Science is one of the pillars upon which modern society is built. The scientific method is what underlies many achievements, including technology and data-driven decision making. This, however, does not mean science is without its own issues.
Producing results to fight against the SARS-CoV-2 (coronavirus) is one of the most pressing issues today, bringing the entire scientific community together. Addressing issues related to scientific research can help produce results under pressure.
ZDNet connected with two prominent researchers to discuss how they are using the state of the art in analytics and AI, graphs analytics and knowledge graphs, to facilitate scientific research for the COVID-19 pandemic.
Scientific data is unFAIR, and that hinders COVID-19 research
"Especially in life science, we have highly connected data, very heterogeneous data, and the entities are connected in a very complex way. And GDPR regulations make working with data a bit more complicated," said Dr. Jarasch.
Jarasch pointed out that the coronavirus causes infectious disease, so it is especially complex. Each virus on its own has a strategy to get into the cell to reproduce and infect other cells. Research has to go on, as we don't have enough experiments available yet. Many events in this disease are not known yet because there's not enough data. Because of the way the virus replicates and mutates, developing a vaccine can be really complicated:
"There is no one drug that will likely save us from everything. There are many different drugs on many different patient groups that respond to one or the other treatment. I wouldn't recommend blindly running any algorithm on any data. The number of data points and dependencies between data points is too high for humans to cope with.
That's why you need computer assisted analysis or AI or other machine learning algorithms in order to analyze the data. Graph enables a new dimension of data analysis by helping us connect highly heterogeneous data from various disciplines. We need to identify connections in our graph to get new hypotheses and new evidence for one or the other problem."
Dr. Jarasch is involved in the COVID GRAPH project. This is a voluntary initiative of graph enthusiasts and companies aiming to build a knowledge graph with relevant information about the COVID-19 and the SARS-CoV-2 virus. As he pointed out, it includes about 44,000 publications, mostly from pre-print servers:
"This is a good example, because nobody can ever read all these papers, understand them, analyze them, and bring them together in a way that makes sense. Then we have coronavirus relevant patents, case studies, genes, functions, molecular data, and each and every day there are more data sources to be integrated."
COVID GRAPH brings together a diverse team of scientists, developers, data scientists, as well as more than seven companies. It's mainly intended for scientists in healthcare or life science, but it can also be of interest to others. It's publicly available, free of charge, and soon, it could also help scientists studying other diseases potentially linked to the coronavirus.
The goal is to provide sources of information that are connected via the fundamental entities in the biomedical domain: genes, proteins, and their functions. Bringing siloed data together can uncover previously unnoticed connections, and this is where knowledge graphs offer advantages.
Making data FAIR with Knowledge Graphs
Making data FAIR is key in facilitating scientific research in general, and coronavirus research in particular. This is also a key goal of the Open Research Knowledge Graph (ORKG) project. ORKG aims to describe research papers in a structured manner, making them easier to find and compare.
Dr. Auer identified two key issues in scientific research. First, integrating and semantically representing heterogeneous data about patients, diseases, drugs, clinical trials, etc. Second, representing the state-of-the-art from papers in a more comparable and reproducible fashion.
As a result, the effort required for preparing and integrating data for answering specific research questions is dramatically reduced, and AI techniques can be readily applied. ORKG focuses on representing scientific contributions from papers semantically. This makes comparing differences and similarities of different approaches easier, by juxtaposing them in tabular views or domain-specific visualizations.
In COVID GRAPH, too, there are two aspects. One is the database itself, which stores the data that is connected. There is also a GUI through which users can query and investigate data. Having the result from a query is just the beginning for interactive browsing and discovering new things that are connected with the result.
Knowledge graphs can be stored in any back end, from files to relational databases or document stores. But since they are, well, graphs, it does make sense to store them in a graph database. This greatly facilitates storage and retrieval, as graph databases offer specialized structures, APIs, and query languages tailored for graphs.
Graph databases come in two main flavors, depending on which graph model they support: Property graph and RDF. In general, RDF graph databases emphasize semantics and interoperability, while property graph databases emphasize ease of use and performance.
Auer and Jarasch not only eagerly agreed to provide an overview of their efforts, but they are also making a joint appearance in an online Meetup to elaborate further. There is a common goal (facilitating scientific research for COVID-19) and a common approach (using graph analytics and knowledge graphs). The focus is on describing and structuring publications semantically.
As Dr. Jarasch noted, a property graph is a little bit different from a knowledge graph, in the sense you are storing properties on nodes and edges that you can query. In a knowledge graph, you can integrate more knowledge when you are creating new relationships between nodes that have specific evidence attached to them.
As Dr. Jarasch said:
"COVID GRAPH is, I would say, a little bit of both. It's more a knowledge graph than a property graph, but since we are integrating fundamental entities like genes, proteins and transcripts and clinical trials, I would also say this is part of a property graph. I would say that the answer is both depending on what you query.
We have the publications and the patents, and some texts extracts from different sources. They have to be structured in a way so that you connect the elements that belong together. On the other hand you divide bigger text chunks into parts that make sense, and then step-by-step analyzing semantically and annotating the texts and connecting them to the different entities."
Dr. Auer noted that property graph technology can be a basis for building knowledge graphs:
"We use a property graph as a basis, but equip it with unique URI identifiers, vocabularies as well as RDF export and SPARQL querying facilities. In order to facilitate large scale distributed knowledge integration, we need to build on the W3C semantic technology standards like URIs, RDF, OWL, SPARQL, etc."
ORKG is looking for partners to help develop domain-specific showcases, in particular for virology and epidemiology. The plan is to create domain-specific knowledge observatories, which represent the state of the art in a certain field and allow researchers to get a quick overview. ORKG is open source, open data, and open knowledge, and Dr. Auer noted they are happy to engage in collaborations.
COVID GRAPH is currently integrating more data sources like clinical trials, and connecting entities from potentially related diseases like diabetes, cancer or lung diseases. Other action points are running pattern finding algorithms to find new patterns or relationships, and working more on the GUI and user experience side. There is a public chat forum where you can get involved or contact the team.