Genomics England is using a non-relational database to power the data science behind its ambitious 100,000 Genomics Project.
The organisation, which is owned by the UK's Department of Health and Social Care, runs the project, which is sequencing 100,000 whole genomes from patients with rare diseases, along with their families, and also patients with common cancers.
The project has now reached its halfway point, with over 50,000 genomes sequenced. By the end of 2018, the 100,000 genomes project will be complete, with more than 20 petabytes of data stored on the project's infrastructure.
The aim is to harness the power of whole genome sequencing technology to transform the way people are cared for, through the development of new and more effective personalised treatments for patients.
Patients' genomic data is combined with their clinical data to enable interpretation and analysis. This data is also open to researchers and clinicians studying how best to use genomics in healthcare.
The amount of data being processed is enormous. According to MongoDB, the project is sequencing on average 1,000 genomes per week which is producing about 10TB of data per day. To manage this complex and sensitive data set, Genomics England uses MongoDB Enterprise Advanced as a component of its computing platform.
"Managing clinical and genomic data at this scale and complexity has presented interesting challenges," said Augusto Rendon, director of bioinformatics at Genomics England. "Adopting MongoDB has been vital to getting the 100,000 Genomes Project off the ground. It has provided us with great flexibility to store and analyse these complex data sets together."
According to MongoDB the project's data platforms were built from the ground up. Why MongoDB? The document database was chosen for three core reasons, the company said. The first was the database's "native ability to handle a wide variety of data types, even those data structures that weren't considered at the beginning of the project."
This should make it simpler for developers and data scientists to evolve data models and develop software solutions.
The second reason was performance at scale. It became clear that there would be a massive and constantly growing dataset "that would need to seamlessly scale across underlying compute resources", the company said. Not only would the data set be big and complex, but the researchers would need to easily explore the data and not have long waits for simple queries.
The third driver was MongoDB's features such as end-to-end encryption and fine-grained access control to data.
The 100,000 Genomes Project dataset is sensitive: as well as the full genetic makeup of a patient, it also includes their clinical features and lifetime health data. Instead, de-identified data is analysed within a secure, monitored environment. Obviously, this makes encryption and data security vital.