Data meets science: Open access, code, datasets, and knowledge graphs for machine learning research and beyond

A new interconnected ecosystem for research is shaping up, and machine learning is just the tip of the iceberg.
Written by George Anadiotis, Contributor

Science and data are interwoven in many ways. The scientific method has lent a good part of its overall approach and practices to data-driven analytics, software development, and data science. Now data science and software lend some tools to scientific research.

Science, data, and data science

"To succeed at becoming a data-driven organization, your employees should always use data to start, continue, or conclude every single business decision, no matter how major or minor".

That quote belongs to Ashish Thusoo, author of the DataOps book, founder of Qubole, and one of the people who built the data-driven culture in Facebook as early as 2007.

As we noted in our 2017 coverage of DataOps in conversation with Thusoo, to anyone with a science background, this should sound familiar. It's the quintessence of the scientific method: developing hypotheses and putting them to the test with data.

It's clear how data-driven culture, and even software practices like agile, which is all about iterative development, have borrowed from science. Now an emergent ecosystem of solutions centered around scientific research and publication may be about to repay the loan.


The interplay between science and data is a long-standing one. Now it's time data repays its debt to science. (Photo by Annie Spratt on Unsplash)

Traditionally, scientific research has relied on peer review. The peer-review and publication process can take anywhere from a few months to a few years to complete. In addition, the business model of many scientific publishers does not make research accessible to everyone.

To make research readily available to as many people as possible as soon as possible, many researchers choose to publish their work on pre-print repositories like Arxiv or Zenodo. Pre-prints solve the open access issues, as they are immediately accessible for free.

The reproducibility crisis and artificial intelligence

Most pre-prints will be revised, in minor or major ways, while others may not be published at all. But even for the ones that do go through the review and publication process successfully, an equally important issue remains: Reproducibility.

Reproducibility is a major principle of the scientific method. It means that a result obtained by an experiment or observational study should be achieved again with a high degree of agreement when the study is replicated with the same methodology by different researchers.

According to a 2016 Nature survey, more than 70% of researchers have tried and failed to reproduce another scientist's experiments, and more than half have failed to reproduce their own experiments.

This so-called reproducibility or replication crisis has not left artificial intelligence intact either. Although the writing has been on the wall for a while, 2020 may have been a watershed moment.

That was when Nature published a damning response written by 31 scientists to a study from Google Health that had appeared in the journal earlier.

Critics argued that the Google team provided so little information about its code and how it was tested that the study amounted to nothing more than a promotion of proprietary tech.

As opposed to sometimes obscure research, AI has the public's attention and is backed and capitalized by the likes of Google. Plus, AI's machine learning subdomain with its black box models makes the issue especially pertinent. Hence, this incident was widely reported on and brought reproducibility to the fore.

SEE: Analytics: Turning big data science into business strategy (ZDNet/TechRepublic special feature) | Download the free PDF version (TechRepublic)

Reproducible research, code, data, and graphs

Enter Papers with Code. Papers with Code is another repository for research, with its mission statement citing the creation of a free and open resource with machine learning papers, code, and evaluation tables as its goal. It highlights trending machine learning research and the code to implement it.

Papers with Code was founded by Robert Stojnic and Ross Taylor in 2018. Stojnic and Taylor have joined Facebook AI in 2019. Since then, the team has grown, they have partnered with Arxiv, and expanded to more disciplines.

The latest addition to Papers with Code's arsenal is data. The repository now indexes 3,000+ research datasets from machine learning. Users can now find datasets by task and modality, compare usage over time, and browse benchmarks.

Also, integration with schema.org, and therefore wider discoverability and availability of those datasets via Google's dataset search, seems to be in the roadmap.

As far as reproducible research goes, we should also mention open-source technology by eLife that lets authors publish Executable Research Articles, treating live code and data as first-class citizens. And the good news doesn't end there.

Connected Papers is the latest addition to an emerging ecosystem for research

Another significant boost to research in any domain comes from the ability to find and explore relevant work. We have seen for example how knowledge graphs have been used to do precisely that for COVID-19 related research.

Connected Papers is a free visual tool that helps researchers and applied scientists find and explore papers relevant to their field of work, in any domain. It creates a graph for each paper in its repository, by analyzing about 50,000 papers and selecting the few dozen with the strongest connections to the origin paper.

On Feb. 3, Connected Papers also announced a partnership with Arxiv. Now every paper page on Arxiv will link to a graph of Connected Papers. Interestingly, Connected Papers arranges papers according to their similarity. That means that even papers that do not directly cite each other can be strongly connected and very closely positioned.

The COVID GRAPH and Open Research Knowledge Graph (ORKG) teams have focused on COVID-19, and emphasized annotation and structure, respectively. Connected Papers seems to expand coverage, and emphasize algorithmic similarity.

Towards a better research ecosystem

Open access, discoverability, reproducibility, code, datasets, and knowledge graphs. This is all good news for research, and machine learning research too, obviously. It seems like steps towards a healthier, more productive research ecosystem are being taken.

This is especially true considering how many of these initiatives are either already connected, or can easily be connected. However, there's also one major issue we see connecting all those otherwise commendable efforts: Sustainability. Let's do a quick recap.

Arxiv, which is in many ways a vital hub in this ecosystem, is a community of volunteers supported by staff at Cornell University. Papers with Code is now part of Facebook AI, with the tension in striking a balance between open research and commercial interests being a well-known issue.

Connected Papers started as a weekend side project between friends, and then it got traction. Today, it is self-funded and free to use, with one sponsor that we know of and a call for more sponsors. COVID GRAPH is a volunteer effort, and ORKG is a publicly funded research project.

Those are different ways different teams have found towards what seems like a common goal: A better research ecosystem. Essentially, they are all trying to grapple with the dilemma of how to produce public goods that belong in the Commons, in a challenging, commercially-oriented environment.

In principle, that's not very far off from the dilemma open source creators are facing. Significant differences do exist, of course -- we don't expect to see anyone from the research ecosystem getting venture capital funding anytime soon, for example. We do, however, hope to see them live long and prosper.

Editorial standards