CSIRO using serverless compute to analyse the human genome

The CSIRO is using AWS Lamba to allow analysis of the 20 exabytes of data coming from genomics every year.
Written by Asha Barbaschow, Contributor

By 2025, it is estimated that 50 percent of world's population will have had their genome sequenced, which according to Commonwealth Scientific and Industrial Research Organisation (CSIRO) transformational bioinformatics team leader Dr Denis Bauer means that genomic data will be larger than the data held by Twitter, YouTube, and astronomy combined.

Genomics is the study of information encoded in an individual's DNA, allowing researchers to study how genes impact health and disease.

The genome holds the blueprint for every cell in an individual's body and with so much information encoded in the genome it comes as no surprise Australia's peak research organisation is investing heavily in exploring its possibilities.

Speaking at the AWS Public Sector Summit in Canberra on Wednesday, Bauer detailed how the CSIRO is using Amazon Web Services infrastructure to build a genomic application that just a couple of years ago would have seemed impossible.

She said genomics produces a staggering 20 exabytes of data per year, noting also how such large amounts of data brings about three main problems.

"One technical problems is that the large volumes of data is not trivial to get a hold of, specifically when we're talking about 40 gigabytes per genome, per individual," she explained.

"We also experience burstable workloads where clinicians might access this resource at the same time as 10,000 other clinicians, but at the next minute it might drop to nothing, so therefore we don't want to pay for a workload that can crunch that much data and then the next time there's nothing, it's just sitting around idling.

"Third problem is consolidating data from silos."

With privacy regulations differing between jurisdictions, Bauer said it is unlikely that there's going to be a consolidation of the world's genomic data into one single entity. She said therefore dealing with distributed systems will have to be something those involved get used to.

The transformational bioinformatics team that Bauer leads has the charter to develop novel bioinformatics solutions for research and industry using the latest in cloud and BigData infrastructure.

It specifically focuses on population-scale analysis of genomics, transcriptomics, and methylomics, as well as genome engineering applications.

Working for the eHealth research program within CSIRO, which is the largest digital health agency in Australia, Bauer said the teams are focused on improving healthcare through using digital technologies and services.

The CSIRO released its Future of Health [PDF] report this week, which outlined the organisations 15-year vision of healthcare in Australia.

As the title of the report explains, the main idea of the CSIRO is Shifting Australia's focus from illness treatment to health and wellbeing management.

"One of the biggest messages from this was that we need to stop being reactive, treating illnesses, to being preventative and catching illnesses before they actually become symptomatic and one of the key themes in there was digital health," Bauer added.

Another theme in the report was precision medicine, and to deliver on this vision Bauer said the CSIRO has developed VariantSpark, which is a Hadoop/Spark machine learning library for genomic data analysis.

"It's built on the Apache core and what you can do is you can spin up an Apache spark cluster to analyse your data directly on AWS," she explained.

"Bringing the information that we've found in the genome into the actual clinical practice and making decisions on it is not trivial and for that we developed GenPhen-Insight, which is a tool that combines medical data with genomic data to improve in real-time, treatment diagnosis and treatment outcomes or recommendations.

"Specifically designed for scaling to the growing need of genomic data in the future."

VariantSpark uses AWS Lambda, an on-demand serverless computing service and the CSIRO's genomic files are all located in a datalake on S3.

See also: AWS Lambda: The smart person's guide (TechRepublic)

"We started out with one infrastructure and then tweaked it to get better performance and do the analysis we wanted to do," Bauer said. "I strongly believe that once you go serverless you never go back.

"The speed on innovation is incredible -- you can stand up a minimum viable product in a couple of seconds and with minimal cost and you don't have to think about the underlying infrastructure."


Editorial standards