IBM offers open source notebooks for COVID-19 data analysis

Using developer-friendly Jupyter notebooks, IBM has built a toolkit designed aggregate and clean up authoritative COVID-19 data.

Written by Stephanie Condon, Senior Writer June 25, 2020 at 6:00 a.m. PT

IBM on Thursday unveiled a new, open-source toolkit designed for developers and data scientists that want to help spot trends in the ongoing COVID-19 pandemic. Using developer-friendly Jupyter notebooks, the toolkits are designed as a way to kickstart in-depth analysis. For instance, a user could analyze county-level data in the US to find correlations between poverty levels and infection rates.

latest developments

microscopic magnification of coronavirus that causes flu and chronic pneumonia leading to death

Coronavirus: Business and technology in a pandemic

From cancelled conferences to disrupted supply chains, not a corner of the global economy is immune to the spread of COVID-19.

Read now

"IBM and our team believe in the importance of democratizing technology, activating developers with the most up-to-date datasets and tools, which can help policymakers make the most informed decisions for citizens' well-being," Frederick Reiss, chief architect for IBM's Center for Open Source Data and AI Technologies, wrote in a blog post.

The toolkit aggregates and cleans up COVID-19 data from authoritative sources, formatting it for analysis with tools like Pandas and Scikit-Learn. The COVID notebooks rely on data from some key, authoritative sources: For the county-level data from the US, IBM relies on data from the COVID-19 Data Repository, run by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. To supplement that information, the toolkit relies on data from the The New York Times Coronavirus (Covid-19) Data in the United States repository and New York newspaper THE CITY's digest of the daily reports from the New York City Department of Health and Mental Hygiene. For other countries, the notebooks use the European Centre for Disease Prevention and Control's data on the geographic distribution of COVID-19 cases worldwide.

The notebooks download the data sets as they run since they change daily. Moreover, the license terms of the data sets prohibit commercial entities from redistributing the data.

To help users keep their notebooks up to date with the latest information, IBM has also created data processing pipelines. For instance -- as illustrated in the image below -- a user could build a pipeline for county-level time series data for the United States. Each box represents a Jupyter notebook. A user can click on the arrow in the toolbar above the workflow to ship the entire set of notebooks to the cloud. From there, all the notebooks run on Kubeflow Pipelines, and the results are saved to the cloud provider's object storage.

"It's important to note that the underlying data for COVID-19 changes on a daily basis," Reiss wrote. "As you build your own analysis, you'll want to update the results of your own notebooks frequently."

Coronavirus

Editorial standards

Show Comments

Blue rocketbook smart notebook with a pen and a cloth on top of it

IBM offers open source notebooks for COVID-19 data analysis

latest developments

Coronavirus: Business and technology in a pandemic

Coronavirus

Related

The best smart notebooks you can buy: Expert tested

This is my favorite power bank for my MacBook Pro

I recommend this 15-inch MacBook Air to most people, and it's $250 off for Prime Day