Video: Dubai plans to map the genome of all of its three million residents
Genomics England is using a non-relational database to power the data science behind its ambitious 100,000 Genomics Project.
The organisation, which is owned by the UK's Department of Health and Social Care, runs the project, which is sequencing 100,000 whole genomes from patients with rare diseases, along with their families, and also patients with common cancers.
The project has now reached its halfway point, with over 50,000 genomes sequenced. By the end of 2018, the 100,000 genomes project will be complete, with more than 20 petabytes of data stored on the project's infrastructure.
The aim is to harness the power of whole genome sequencing technology to transform the way people are cared for, through the development of new and more effective personalised treatments for patients.
Patients' genomic data is combined with their clinical data to enable interpretation and analysis. This data is also open to researchers and clinicians studying how best to use genomics in healthcare.
The amount of data being processed is enormous. According to MongoDB, the project is sequencing on average 1,000 genomes per week which is producing about 10TB of data per day. To manage this complex and sensitive data set, Genomics England uses MongoDB Enterprise Advanced as a component of its computing platform.
"Managing clinical and genomic data at this scale and complexity has presented interesting challenges," said Augusto Rendon, director of bioinformatics at Genomics England. "Adopting MongoDB has been vital to getting the 100,000 Genomes Project off the ground. It has provided us with great flexibility to store and analyse these complex data sets together."
According to MongoDB the project's data platforms were built from the ground up. Why MongoDB? The document database was chosen for three core reasons, the company said. The first was the database's "native ability to handle a wide variety of data types, even those data structures that weren't considered at the beginning of the project."
This should make it simpler for developers and data scientists to evolve data models and develop software solutions.
The second reason was performance at scale. It became clear that there would be a massive and constantly growing dataset "that would need to seamlessly scale across underlying compute resources", the company said. Not only would the data set be big and complex, but the researchers would need to easily explore the data and not have long waits for simple queries.
The third driver was MongoDB's features such as end-to-end encryption and fine-grained access control to data.
The 100,000 Genomes Project dataset is sensitive: as well as the full genetic makeup of a patient, it also includes their clinical features and lifetime health data. Instead, de-identified data is analysed within a secure, monitored environment. Obviously, this makes encryption and data security vital.
Recent and related coverage
The general idea of the partnership is to cut the expense and time required for cancer researchers to interpret genetic variants across 170 genes.
After committing $25 million to a partnership with the Broad Institute, Intel announces some milestones that should advance genomics research into cancer and other diseases.
According to IBM, the Qatar Genome Program is one of the world's largest national genome medical research projects.
Maladies in healthcare data management are obvious, but what about solutions? From selling Genome data to using Blockchain and the effects of GDPR, would we rather stick to the devil we know?
The New Year could bring us closer to the fountain of youth and to proof we're not alone in the universe. And that's not all.
Sophia Genetics uses artificial intelligence to pinpoint gene mutations behind cancers and rare disorders to help healthcare providers prescribe the best treatments for patients.