Special Feature
Part of a ZDNet Special Feature: Coronavirus: Business and technology in a pandemic

IBM offers open source notebooks for COVID-19 data analysis

Using developer-friendly Jupyter notebooks, IBM has built a toolkit designed aggregate and clean up authoritative COVID-19 data.

IBM on Thursday unveiled a new, open-source toolkit designed for developers and data scientists that want to help spot trends in the ongoing COVID-19 pandemic. Using developer-friendly Jupyter notebooks, the toolkits are designed as a way to kickstart in-depth analysis. For instance, a user could analyze county-level data in the US to find correlations between poverty levels and infection rates. 

latest developments

Coronavirus: Business and technology in a pandemic

From cancelled conferences to disrupted supply chains, not a corner of the global economy is immune to the spread of COVID-19.

Read More

"IBM and our team believe in the importance of democratizing technology, activating developers with the most up-to-date datasets and tools, which can help policymakers make the most informed decisions for citizens' well-being," Frederick Reiss, chief architect for IBM's Center for Open Source Data and AI Technologies, wrote in a blog post. 

The toolkit aggregates and cleans up COVID-19 data from authoritative sources, formatting it for analysis with tools like Pandas and Scikit-Learn. The COVID notebooks rely on data from some key, authoritative sources: For the county-level data from the US, IBM relies on data from the COVID-19 Data Repository, run by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. To supplement that information, the toolkit relies on data from the The New York Times Coronavirus (Covid-19) Data in the United States repository and New York newspaper THE CITY's digest of the daily reports from the New York City Department of Health and Mental Hygiene. For other countries, the notebooks use the European Centre for Disease Prevention and Control's data on the geographic distribution of COVID-19 cases worldwide.

The notebooks download the data sets as they run since they change daily. Moreover, the license terms of the data sets prohibit commercial entities from redistributing the data. 

To help users keep their notebooks up to date with the latest information, IBM has also created data processing pipelines. For instance -- as illustrated in the image below -- a user could build a pipeline for county-level time series data for the United States. Each box represents a Jupyter notebook. A user can click on the arrow in the toolbar above the workflow to ship the entire set of notebooks to the cloud. From there, all the notebooks run on Kubeflow Pipelines, and the results are saved to the cloud provider's object storage.   

"It's important to note that the underlying data for COVID-19 changes on a daily basis," Reiss wrote. "As you build your own analysis, you'll want to update the results of your own notebooks frequently."