Verizon introduces open-source, big data coronavirus search engine

So much sickness, so much data, so little time. To help make sense of coronavirus research Verizon Media has created Vespa, an open-source big data search engine.
Written by Steven Vaughan-Nichols, Senior Contributing Editor

As we struggle to get a grip on exactly how COVID-19 makes us ill and what we can do about it, researchers have created over 50,000 articles. That's a lot of information! So, how do you make sense of it all? Verizon Media is doing it by using Vespa. This is an open-source, big data processing program to create a coronavirus academic research search engine: CORD-19 Search.  

This engine works on top of the  COVID-19 Open Research Dataset (CORD-19). This dataset should help medical researchers to find and create new insights in the fight against SARS-CoV-2. The documents within it are updated weekly as new research is published in peer-reviewed publications and archival services like bioRxiv, biological sciences preprints and medRxiv, health science preprints. It also includes document links to PubMed, Microsoft Academic, and the WHO COVID-19 database of publications

What's different about it from other search engines is that it combines several different methods to find the best answers. Vespa combines text and structured search with exploring by semantic similarity using the scibert-nli model. This is a pre-trained data-mining language model for efficiently searching scientific text. 

Usually Verizon uses Vespa for applications such as article recommendations, user personalization, and ad targeting. Now, by keyword indexing COVID-19 articles, it makes searching the flood of COVID-19 articles much easier for researchers. 

More technically advanced researchers can access the data via the CORD-19 application programming interface (API). If you want you can even download the code and run the application on your own server. 

This is very much a work in progress. You can expect daily updates to the documentation and query features. Verizon welcomes your help on both the code and the data. Check out its contributing guide for how you can help. You can also reach the project's developers by tweeting to them @vespaengine.
Related Stories:

Editorial standards