There's a saying in the scientific community: all biology is computational biology. In other words, when you're dealing with whole genomes, or datasets compiled from thousands of records containing thousands more biomarkers, you need serious computational power – and techniques – to get the most out of the data. Now, cloud platforms built for biomedical research are providing the power that these projects need.
Three years ago, Verily, the life sciences arm of Google's parent company Alphabet, along with MIT and Harvard's biomedical and genomic research center the Broad Institute, set up Terra, an open-source cloud research platform for storing and analysing biomedical data.
The platform's creation, says Anthony Philippakis, chief data officer at the Broad Institute, was driven by two trends within the wider biomedical space: an explosion in the volume and types of data available – from genomics data to medical imaging and electronic health records – alongside a greater need for data sharing within biomedical science.
"No one organisation, whether it be a medical centre, or a university, or a company, has enough data in order to go after profound questions like the genetic basis of the disease," he says.
"For a long time, although we've embraced the idea of data sharing, especially in genomics, and in many other fields, the way we've operationalized, it is a bit silly – which is to say that, in order to share data, we have to copy it: we put copies on servers and tell researchers to download it to their local environment," he explains.
That's challenging, because it's both very expensive, in that you end up storing a copy of every data set at every research institution, and it's also quite insecure, because when you start downloading data, it becomes difficult to track and audit who's touched it and for what purpose.
Their open-source platform aims to fix that and now has over 15,500 users. These include the Accelerating Medicines Partnership in Parkinson's Disease, which is researching new biomarkers to help with diagnosing the disease, developing new treatments, or helping refine prognosis, and the National Institutes of Health's All of Us project, using longitudinal datasets from one million people to better prevent and manage disease. Verily's own Project Baseline, started in 2017 to track human health over time, is another of Terra's users.
Terra contains a number of both publicly available and access-controlled datasets from nearly two million study participants, which scientists can store and analyse alongside their own data. The platform allows scientists to build workflows in Workflow Description Language (WDL) – commonly used in genomics research to create workflows that can scale along with dataset size – as well as analyse and visualise the results.
Some elements of interrogating data, such as segmenting out particular cohorts from larger datasets, can be done via a point-and click interface in that dataset's workbench. To further interrogate the data, the cohorts can be loaded into a separate workspace, where a biomedical scientist can perform analysis using Jupyter notebooks pre-populated with basic characterization analysis. Technically minded scientists can then code in languages like R and Python, or use tools like Hail, to do further analysis on their cohorts. For scientists that are less hands-on, workspaces can be shared with others – such as a lab's data scientists – to code the analysis instead.
"I can do whatever amount of the hands-on technical bits I want, while collaborating with others who have the complementary skills to me, because science is a team sport," says David Glazer, CTO of Terra at Verily. "There's a spectrum of biomedical chops and a spectrum of technical data-science chops. The team will have people at all ranges and combinations of those skills. Terra lets them work together collaboratively in the cloud, to pull what they're good at with the tools and data that they need to find the insights that they're working on," he says.
The bulk of Terra's users are drawn from academic researchers at universities and in biomedical science institutions, often following the Broad Institute's example and working on genomics analysis. However, the diversity of data is starting to increase – customers trying to correlate health outcomes with environmental and lifestyle factors, as well as incorporating medical imaging and time series data from devices like ECGs.
Unsurprisingly in the current environment, epidemiological research has also taken off, with Terra used to track the geographical spread of different COVID variants.
"Supporting infectious disease research, now more than ever, is a key area of focus within Terra. Right now, we're starting to see a lot more interest worldwide in performing pathogen sequencing in order to be able to identify new strains and track their spread through society", Broad Institute's Philippakis says.
"We're starting to see now many other public health and state laboratories starting to use Terra to process COVID sequencing data. And as we go forward, being able to enable the construction of a pathogen weather map, if you will – being able to track the spread of pathogens across time and space – is a branch of research that we're very committed to."
Dr Danny Park, group leader for viral computational genomics at the Broad Institute, and his colleagues, have been using Terra for COVID-19 research, and released a COVID-19 workspace in Terra early last year.
Among the work that Park and his fellow researchers have used Terra for was an analysis of thousands of patient samples to investigate the origins and spread of COVID-19 in Massachusetts, particularly around local 'super-spreader' events in Boston.
Terra, said Park, is really good for analysing very large datasets and keeping all of the results organised. "Terra's advantage is that it lets you get a little bit more under the hood, so if you are the kind of group that likes to modify your pipelines, your code, your analysis – what you're actually running – a little bit more, and you have a bit more direct access to that." Park says.
Terra is also increasingly seeing a shift in the type of organisation using the platform: non-profits are now being joined by commercial organisations, such as pharmaceutical companies, who are using it to interrogate and cross-correlate their own datasets.
Terra's adoption by commercial companies is likely to grow, thanks to a recent agreement with Microsoft, which has become the third partner in Terra alongside Verily and the Broad Institute. Terra was previously based entirely on Google Cloud Platform services; now users can also choose to run on Azure. With healthcare organisations traditionally Microsoft shops, having Redmond as a partner will potentially dramatically expand Terra's user base.
Following the deal with Microsoft, Terra customers will also be able to use Microsoft's suite of machine-learning and AI tools, including Azure Synapse Analytics, Azure Machine Learning and Azure Cognitive Services, as part of their analyses.
In the future, the three partners will work on developing services and strategy together, focusing on three areas: core cloud and data infrastructure, analytics and data science, and building "the commercial muscle that both organisations [Verily and Microsoft] bring to the table, because this started with academic science, but it's very quickly moving into translational science in wider commercial enterprises. Really gaining a critical mass of data and users is another very important thing that I expect the three organisations to focus on as we move forward," says Desney Tan, general manager of Microsoft's healthcare business.
The Broad Institute's Park also noted that having Microsoft on board also brings additional geographical reach, with an African region that the organisation's partners in countries such as Nigeria, Senegal, and Sierra Leone will be able to use in pathogen analysis. "The geolocality of the data matters," he says.