This open source project is using Python, SQL and Docker to understand coronavirus health data

Django and Python developers working alongside clinicians and researchers have built a new analytics platform that looks at electronic health records for 24 million people.

Google: Cyber criminals are using coronavirus-themed phishing lures

As the largest health provider in the world, the NHS holds an unparalleled amount of health data, which scientists and researchers should be able to draw on to help find them find ways to treat or prevent diseases. 

In practice, NHS patient data hasn't always been as accessible to researchers as they would have wanted. 

But the urgent threat of coronavirus created an impetus to put the huge repository of data at researchers' disposal as soon as possible, in order to help them find answers to questions such as why some people are more likely to die from the disease, and whether the medications a patient takes can affect whether they develop severe symptoms or not. 

SEE: Big data management tips (free PDF) (TechRepublic)

OpenSafely, a new open-source analytics platform, has made the NHS health records of tens of millions of people in the UK available for researchers to analyse in their fight against COVID-19. Through the OpenSafely platform, researchers can analyse the electronic health records of millions of individuals. The records contain the full pseudonymised primary care data of 24 million people, with more to be added shortly. The analytic software is open for security review, scientific review, and re-use. The tools are built in Python, SQL, and Docker, with additional statistical analyses called from Stata and R; all the code and analyses are managed through GitHub.

OpenSafely was created in just five weeks by the University of Oxford, the London School of Hygiene and Tropical Medicine, and health records companies including TPP; NHS England is acting as the data controller. While the idea of creating an analytics platform like OpenSafely predated COVID, the threat of the disease and an understanding of the value of the data the NHS holds, spurred the organisations to kickstart the project; at the same time, the COPI notice from NHS X, the health service's tech and digital unit, made information governance around patient data during coronavirus more straightforward.

"There was a need to access an unprecedented scale of data, but to do that, we had to come up with a model that was much more secure than anything that had gone before," says Dr Ben Goldacre, director of the University of Oxford's EBM Data Lab.

Issues around security and privacy have cast a shadow over projects looking to use NHS data for research in the past and, given the extreme sensitivity of health data, making sure that 'anonymised' or 'pseudonymised' records couldn't be reverse engineered into giving up sensitive data on an individual was key for OpenSafely.

To do this, OpenSafely uses a series of tiered tables, each giving up less and less information on individuals, and researchers don't have the access to run a database query on the raw event-level patient data. 

"They provide a description of what their analytic cohort should look like, in code, and then that runs remotely. They can't do a simple database query, which is where all of the security risks would reside," Goldacre says.

To keep NHS patients' data as secure as possible, OpenSafely has shifted from a model based on trust (where trusted researchers are approved to work on raw data) to one more based on proof.

"That's partly a concept that you inherit from working with software developers. You put tests in your code, you want proof that something works, you don't want to rely on trust," Goldacre says.

"I think it would have been unambiguously completely impossible and incredibly dangerous to analyse the primary care records of 40% of the population using the traditional model of large data extracts. That would have been unimaginably dangerous and I think even a general purpose trusted research environment would have been very, very risky."

Researchers will only be able to analyse the OpenSafely data inside the electronic health record company's datacentre. Rather than the usual model of exporting datasets that researchers work on locally (and so expose it to all the local security risks), all the analysis takes place where the records reside and only summary tables can be extracted by researchers. 

OpenSafely is also available under open-source licence, with all code published on GitHub alongside the study definition for the first study run on the data.

Projects like OpenSafely could ultimately help push the research community to a more open, less proprietorial stance with their data and analysis. "In some respects, we have built OpenSafely to help and encourage epidemiologists to become better at sharing their work, not by hectoring them, but just by making it a completely normal part of the workflow," he says.

The system makes a feature of sharing your working out -- more openness than clinicians and researchers might traditionally have felt comfortable with.

The way the group have built OpenSafely aims to encourage researchers to share everything they do as they go. When users make a code list -- a list of people with a particular condition, for example -- or an analytic script, it's all shared on GitHub. 

"Everything that you do is shared by design," Goldacre adds.

It hasn't taken long for OpenSafely to bear its first fruit: a study of 17 million records published last month found that people from Black and Asian backgrounds were more at risk of dying of COVID-19, even when their additional medical risk factors and any social deprivation had been accounted for. It also identified key risk factors for death from COVID including being male, older, or with severe asthma and poorly controlled diabetes. 

SEE: Coronavirus: Apple and Tesla reveal the new products they're making in COVID-19 fight

OpenSafely has been able to go from setup to first research in a matter of weeks by using a team that included 'developer-epidemiologists' who could understand assist both the IT and software staff, as well as researchers working on the project.

"We have software developers -- proper, commercial grade, full stack, Django and Python developers -- working alongside clinicians and researchers, because that's the only way we were able to build OpenSafely with a small tight team. In our group over the course of the last five years, we've built a team where our software developers know a lot about how health data works, how clinical trials work, how the NHS works operationally, and how research works, but also we've got clinicians and researchers who know how software developers work. They know how to use GitHub, they know how to use Docker, they write Jupyter notebooks in Python. And so, that means that everyone can be much more fluent and creative around building new tools and services," Goldacre says.

"I don't think it's necessary for absolutely everybody to have those computational data science skills, but I think it's a really important skills gap that has been really neglected."

More research publications are likely to follow from OpenSafely -- as well as the team's own pipeline of research, there have been around 100 approaches from other researchers seeking to run their own analysis with the data. 

It's expected the data will be used to help answer questions on how effective proposed treatments for COVID-19 are, risk factors for developing severe symptoms or needing ITU admission, how the disease might spread and affect healthcare needs within a given area, assess how effective public health interventions like lockdowns have been, and even work out the impact of coronavirus aftershocks -- unexpected health impacts caused by the virus, such as delayed cancer referrals or vaccinations.

There are also plans for a 'phase two' of the project, which will move it beyond purely supporting urgent coronavirus-related research, and onto looking at how to enable wider health research on NHS primary care datasets.