Google Dataset Search, a tool originally designed to help researchers locate online data that is available to use, is now out of beta and improved with new features, announced the company today.
The search feature launched in 2018 as an attempt to aggregate online open-access data, and has now indexed 25 million datasets, according to Natasha Noy, research scientist at Google Research. The content covers information ranging from penguin populations to medical data, and can be used by researchers to test hypotheses, or by scientists to train machine-learning algorithms.
Of course, the tool is also open to casual users. Type in "skiing", for example, and you will find datasets showing the speeds of the fastest skiers, or the revenues of ski resorts.
The new features announced by the company today are mostly intended to simplify the research process for users. Results can now be filtered based on the type of dataset required, such as tables, images or text, or on whether the dataset is free to use. The search engine can also now be used as a mobile application.
Noy highlighted that it is possible – and encouraged – for those holding on to a particular dataset to make the information discoverable through Google's tool by using an open standard, called schema.org, to describe the properties of their dataset on their web page.
When Dataset Search was launched, Google's team already identified that one challenge would be to find a simple way to make sure that existing data repositories would find their way into the search engine's catalogue, so that data could actually be found by users.
The company put forward the schema.org solution at the time, which it described as a standard that could be added to a page that contains a dataset, to enable Google to link the page to the Dataset Search engine.
"Our ultimate goal is to help foster an ecosystem for publishing, consuming and discovering datasets," said Google.
Although the research team did not disclose how many users had tested the tool, they provided some insights into the type of data that people have been after since 2018. The most common queries, according to Noy, include "education", "weather", "cancer", "crime", "soccer" and… "dogs".
Most of the data that has been linked to the search engine relates to geosciences, biology and agriculture, added Noy; and luckily, most of the governments in the world already use the schema.org standard when publishing open data. The US government alone accounts for two million datasets.
Although Dataset Search is out of beta, Noy said that Google will keep updating the tool in the future. She suggests taking it "for a spin" if you haven't tried it yet – that is, if you aren't already looking up "dogs".