Special Feature
Part of a ZDNet Special Feature: Managing AI and ML in the Enterprise

Healthcare and artificial intelligence: How Databricks uses Apache Spark to analyze huge data sets

Pharma companies use the open-source platform to build data lakes with internal and external data sets.

IBM tests the use of artificial intelligence for breast cancer screenings
2:24

There is no shortage of big data sets in the healthcare world, encompassing everything from chest X-rays to drug research. Startups and established companies alike are both using artificial intelligence (AI) and machine learning to analyze these data sets and use the results to guide business strategy and treatment plans.

Special Feature

Special Feature: Managing AI and ML in the Enterprise

This ebook, based on the latest ZDNet / TechRepublic special feature, advises CXOs on how to approach AI and ML initiatives, figure out where the data science team fits in, and what algorithms to buy versus build.

Read More

In the AI 100: The Artificial Intelligence Startups Redefining Industries, CBInsights reported that healthcare is the top industry in the emerging role of artificial intelligence. Thirteen of the 100 companies surveyed are focused on healthcare, including Subtle Medical, which uses AI to enhance radiology images; Viz.ai which uses deep learning to identify blocked arteries and veins; and Butterfly Network, which is building a portable ultrasound device that uses an AI-assisted diagnostic tool. Butterfly is also applying its platform to COVID-19 patients by looking for infection patterns in the lungs that indicate illness.

Open source framework benefits healthcare IT firm

These companies are specializing in particular conditions, but one healthcare IT firm is building on an open source framework to open up all kinds of data sets analysis.

Databricks was founded by the original creators of Apache Spark, an open-source distributed cluster-computing framework built atop Scala. Databricks grew out of the AMPLab project at the University of California, Berkeley.

Frank Nothaft, technical director of healthcare and life sciences at Databricks, said that Apache Spark's distributed data processing engine is perfect for running complex queries at large scale, which is the computational power required to analyze data sets related to drug development.

SEE: How COVID-19 is disrupting the enterprise and what you can do about it (TechRepublic Premium)  

"Five years ago the largest table had three million rows; today the largest tables have up to 60 billion rows," he said.

Nothaft described the company as 'big data analytics and machine learning on top of cloud computing'. The company was founded in 2013, released its first product in 2015, and launched its healthcare group in 2017.

"We have launched a genomics product, we are working in medical imaging, and we are doing an increasingly large amount of work in the clinical and claims processing space," he said.

Building data lakes 

Nothaft said the company's first step in the product development process was to build a cloud management layer to make it easy for users to spin up clusters quickly. "This also helped on the admin side to manage cost, access, and compliance on the data side," he added.

The company's pharmaceutical clients use the platform for early research and drug discovery, clinical trials, and manufacturing. Databricks is best suited for data preparation and the extract, transform, and load (ETL) process, Nothaft said. 

Pharmaceutical company Novartis used the platform to build a research data lake. "We combined all of the genomic data that they have and the molecule data so that scientists could run queries on top of the genomic data to identify associations," said Nothaft.

Nothaft added that in the pharma industry there is often a skill set gap between data scientists and domain scientists who specialize in biology and chemistry. With one client, the ETL process took three weeks to ingest genetic sequencing data from one million patients. Once the ETL process is in place, internal teams can manage it.

"Our goal is to push data prep into the hands of the scientists," he said.

The importance of a knowledge graph

Nothaft said that most companies build a machine learning layer that aggregates all internal data for internal use. For example, AstraZeneca built a knowledge graph that combines internal data sets as well as data extracted from public sources. The company then created a knowledge graph and built algorithms on top of that data.

"This helps the researchers figure out which experiments to run, and which experiments not to run so they can spend more time on high-potential experiments," he said.

Nothaft also said that creating a knowledge graph can make it easier for divisions within a pharma company to collaborate. "If everyone's data is in one place I can run the query without talking to anyone and get it in 30 minutes," he said. 

However, one challenge is the fact that every data set contains personal health information, which comes with lots of compliance rules. Nothaft said that the Databricks platform has a governance layer built into it.

SEE: Artificial Intelligence Ethics Policy (TechRepublic Premium)

Turning genetic analysis into action

Michael Ortega, head of communications for healthcare and life sciences at Databricks, said that he sees more large healthcare organizations bringing this kind of big data analysis in-house. 

Databricks works with Sanford Health, a healthcare system that includes 44 hospitals, 1,400 physicians, and more than 200 senior care locations in 26 states and nine countries. Sanford also has a health insurance plan. 

Many of Sanford's clinics are located in the Dakotas and the upper plains. Some patients are Native Americans with distinct genetic profiles or people with specific environmental risk factors, including working in the oil and gas industry. If a doctor wants to do a genetic analysis for a patient, that usually requires using an external lab and giving up ownership of the data.  

"The best thing we can do to serve them is to help them bring this analysis in-house, which is a high-value service but also helps them keep costs down," Ortega said.

Ortega also said that Databricks has worked with clients to improve personalized medicine, such as predicting the progression of Alzheimer's and helping people make lifestyle adjustments. Ortega said clients have combined genomic profiles and brain images to identify a new biomarker that can more precisely predict a person's risk for developing the disease.

 "When people look at genetic reports, they really don't understand how to take the risk factors and turn those into behavioral changes," he said. "What are we doing to make sure people still have access to risk factors, but have more actionable information."

Also see