AtScale, Databricks and others release advanced COVID-19 data resources
Tracking down various COVID-19 datasets and integrating them is hard. In response, AtScale is making a BI-ready COVID-19 dimensional model available; Databricks has added COVID-19 datasets to its platform, and is launching a related Hackathon; and Looker and others are making their own offers of COVID-19 data and compute resource availability.
A few weeks ago, I covered Tableau's release of a dataset, updated daily, with a simplified presentation of Johns Hopkins Center for Systems Science and Engineering's (JHU's) global COVID-19 dataset. That was an important step in democratizing the data so that people could connect to and analyze it on a self-service basis. I was one such person and performed a few simple analyses on that I shared in the post.
Meanwhile, people are hungry for more. Data enthusiasts and epidemic specialists alike want access to metrics beyond confirmed cases and deaths, as well as demographic data outside the scope of COVID-19 itself. There's a lot of public data out there, but tracking it down, cleaning it, blending it and modeling it isn't trivial. So now, various companies in the data space are working to address pain points and make it easier to work with this broader range of data.
OLAP for COVID
Let's start with AtScale, the San Mateo- and Boston-based company focused on OLAP over big data in the cloud. The company is announcing today its COVID-19 Cloud OLAP Model, ready for drill-down analysis. AtScale is hosting the model on its own platform and making it available for querying, free of charge. Datasets include the Starschema: COVID-19 Epidemiological Data, which is available through Snowflake's Data Exchange, and data from Boston Children's Hospital's COVIDNearYou.org. AtScale says their model is updated daily, as are the source datasets.
To gain access to the AtScale model, interested parties can request access here. AtScale will respond with an email providing login information and connection instructions. Attached to that email are fully developed Excel and Tableau workbooks based on the model (the Excel one is pictured above). Users can open those workbooks, plug in their unique user id and password, then start slicing, dicing and analyzing.
Databricks provides easy data access, launches hackathon
Meanwhile, Databricks, whose Spark-based platform serves as a workbench for data engineers and data scientists, is adding value to the COVID-19 data scene too. To begin with, Databricks has added various COVID-19 datasets to be available natively on its platform (on both the Amazon Web Services and Microsoft Azure clouds). Specifically, developers can find the data in the "/databricks-datasets/COVID/" folder built in the Databricks file system (DBFS), on either the paid service or the free Community Edition. In other words, spin up any Databricks cluster and COVID-19 data will be in its file system, automatically. The company has also created sample workbooks that demonstrate how to open up the data and analyze it -- details on the datasets and links to the notebooks are provided in a blog post by Databricks' Denny Lee.
In addition to the data availability, and in coordination with Databricks' upcoming Spark + AI Summit virtual event, Databricks is launching a related hackathon under the banner "Data Teams Unite!" Teams who enter the hackathon will be asked to focus on COVID-19, climate change or challenges in their own communities (using open data resources made available by national, regional, state and local governments). Since Databricks' event is a virtual one this year, and is free, the company expects a significant increase in attendance and hopes to see robust hackathon participation. Teams of up to 4 people may enter the hackathon. Three finalist teams will be selected and Databricks will make direct donations to charities of the teams' choice; the grand prize winner also receives free training and a ticket to a future Spark + AI event. The Hackathon begins today and entries are due June 12. Judging will take place between June 15-19.
There are a lot of resources out there that go well beyond CSV files. Specialists focused on the crisis have lots of choices; that should help them get to insights -- and, hopefully, sound policies and effective protocols -- faster. And if you're a non-specialist and find yourself home due to lockdown, in need a project to focus on, maybe you can take advantage of all these great COVID-19 data resources as well.
Updated April 22 at 12:40pm ET to revise Databricks hackathon due date and judging period from May 29 and June 1-5, respectively to June 12 and June 15-19, respectively.