White House leads effort to publish COVID-19 open research data set

In the span of a week, a unique collaborative effort between academic, government and industry researchers has produced a new COVID-19 dataset for the worldwide machine learning community.
Written by Stephanie Condon, Senior Writer

In the span of a week, a unique cooperative effort between academic, government and industry researchers has produced a new, structured dataset that the worldwide machine learning community can use to advance COVID-19 research. The COVID-19 Open Research Dataset (CORD-19) is comprised of more than 24,000 scholarly articles (including more than 10,000 full-text artciles) about coronavirus family viruses, goes live on Monday at SemanticScholar.org. It's the most extensive machine-readable coronavirus literature collection available for data and text mining to date.

Organized by the White House, the organizations that helped structure the data include the Allen Institute for AI, the Chan Zuckerberg Initiative, Georgetown University's Center for Security and Emerging Technology, Microsoft Research and the National Library of Medicine (NLM) of the National Institutes of Health (NIH). 

Now that the dataset is available, the White House Office of Science and Technology Policy, as well as the organizations involved, are issuing a call to action to the nation's AI experts to develop new text and data mining techniques that could help answer high-priority scientific questions related to COVID-19. 

These questions relate to the virus's incubation, treatment, symptoms and prevention, according to US CTO Michael Kratsios. These questions were developed in coordination with the World Health Organization (WHO) and the National Academies of Sciences, Engineering, and Medicine's Standing Committee on Emerging Infectious Diseases and 21st Century Health Threats. The key questions are available on Kaggle, where researchers can submit their insights. 

This is a "truly all hands-on-deck approach," Kratsios said Monday. 

In the face of a crisis like the COVID-19 pandemic, "the biggest challenge a researcher faces initially is understanding, 'Where can I contribute? What has already been done?,'" the Allen Institute's Doug Raymond said to ZDNet. "Without resources like the core dataset we're releasing, that is a time consuming problem." 

Indeed, research on the novel coronavirus and related viruses spans decades, across multiple institutions -- and it's evolving quickly. 

"To have it all structured and in one place... so you can understand the state of the art and what the science is, that is an immediate boost to current and future efforts," Raymond said. 

The COVID-19 coronavirus outbreak has had far-reaching ramifications across the globe. As of Monday, there were more than 179,000 confirmed cases of COVID-19 globally, including more than 7,000 deaths. Last week, the WHO officially declared the novel coronavirus a pandemic, while President Trump announced a ban on some travel from Europe to the US. States and municipalities are imposing limits on large gatherings, while many large companies like Google and Apple are urging employees to work from home. 

The first version of the full-text repository will be publicly available at the Allen Institute's Semantic Scholar site. Researchers will continue to update it as new insights are published in archival services (such as bioRxiv, medRxiv and others) and peer-reviewed publications.  

To build the dataset, Microsoft used its web-scale literature curation tools to pull together global scientific efforts and results. The NLM provided access to literature content, while the Allen Institute transformed the content into machine-readable from.

The articles included are typically published in a PDF format, so the first step to making them machine-readable is to extract the text. That entails identifying information such as citations, authors, dates -- taking essentially a static file and putting it in a format that can be referenced and processed by other applications. The researchers have put the articles in a JSON format.

"We've done the heavy lifting in terms of making it available for future processing," Raymond said. "Researchers should be equipped to incorporate it in whatever projects they have planned with relative ease."

The project, Raymond said, is "a great example of how AI can help address common problems." It also represents a fairly unique level of collaboration across public and private sector organizations. 

"These are partners that don't often work together to make scientific research more openly accessible," Raymond said. "That's great precedent."

In addition to the joint project, the Allen Institute for AI is launching an adaptive feed of Covid research to help researchers and the general public stay up to speed on the latest research that's relevant for them. 

The feed uses AI to understand connections between papers, Raymond explained. The initial feed of articles is based on the Allen Institute's sense of relevance, and it adjusts based on which articles a user chooses to read. 

The feed is available to anyone using the Allen Institute's Semantic Scholar.

Editorial standards