As we struggle to get a grip on exactly how COVID-19 makes us ill and what we can do about it, researchers have created over 50,000 articles. That's a lot of information! So, how do you make sense of it all? Verizon Media is doing it by using Vespa. This is an open-source, big data processing program to create a coronavirus academic research search engine: CORD-19 Search.
This engine works on top of the COVID-19 Open Research Dataset (CORD-19). This dataset should help medical researchers to find and create new insights in the fight against SARS-CoV-2. The documents within it are updated weekly as new research is published in peer-reviewed publications and archival services like bioRxiv, biological sciences preprints and medRxiv, health science preprints. It also includes document links to PubMed, Microsoft Academic, and the WHO COVID-19 database of publications.
What's different about it from other search engines is that it combines several different methods to find the best answers. Vespa combines text and structured search with exploring by semantic similarity using the scibert-nli model. This is a pre-trained data-mining language model for efficiently searching scientific text.
Usually Verizon uses Vespa for applications such as article recommendations, user personalization, and ad targeting. Now, by keyword indexing COVID-19 articles, it makes searching the flood of COVID-19 articles much easier for researchers.
More technically advanced researchers can access the data via the CORD-19 application programming interface (API). If you want you can even download the code and run the application on your own server.
This is very much a work in progress. You can expect daily updates to the documentation and query features. Verizon welcomes your help on both the code and the data. Check out its contributing guide for how you can help. You can also reach the project's developers by tweeting to them @vespaengine.
Related Stories: