With data science and analytics on the rise and under way to being democratized, the importance of being able to find the right data to investigate hypotheses and derive insights is paramount.
What used to be the realm of researchers and geeks is now the bread and butter of an ever-growing array of professionals, organizations, and tools, not to mention self-service enthusiasts.
Even for the most well-organized and data-rich out there, there comes a time when you need to utilize data from sources other than your own. Weather and environmental data is the archetypal example.
Suppose you want to correlate farming data with weather phenomena to predict crops, or you want to research the effect of weather on a phenomenon taking place throughout a historical period. That kind of historical weather data, almost impossible for any single organization to accumulate and curate, is very likely to be readily available by the likes of NOAA and NASA.
Those organizations curate and publish their data on a regular basis through dedicated data portals. So, if you need their data on a regular basis, you are probably familiar with the process of locating the data via those portals. Still, you will have to look at both NOAA and NASA, and potentially other sources, too.
And it gets worse if you don't just need weather data. You have to locate the right sources, and then the right data at those sources. Wouldn't it be much easier if you could just use one search interface and just find everything out there, just like when you Google something on the web? It sure would, and now you can just Google your data, too.
Schema.org, metadata, and semantics for the win
That did not come about out of the blue. Google's love affair with structured data and semantics has been an ongoing one. Some landmarks on this path have been the incorporation of Google's knowledge graph via the acquisition of Metaweb, and support for structured metadata via schema.org.
Anyone doing SEO will tell you just how this has transformed the quality of Google's search and the options content publishers now have available. The ability to markup content using schema.org vocabulary, apart from making possible things such as viewing ratings and the like in web search results, is the closest we have to a mass-scale web of data.
This is exactly how it works for dataset discovery, as well. In a research note published in early 2017 by Google's Natasha Noy and Dan Brickley, who also happen to be among the semantic web community's most prominent members, the development was outlined. The challenges were laid out, and a call to action was issued. The key element is, once more, schema.org.
Schema.org is a controlled vocabulary that describes entities in the real world and their properties. When something described in schema.org is used to annotate content on the web, it lets search engines know what that content is, as well as its properties. So what happened here is that Google turned on support for dataset entities in schema.org, officially available as of today.
The first step was to make it easier to discover tabular data in search, which uses this same metadata along with the linked tabular data to provide answers to queries directly in search results. This has been available for a while, and now full support for dataset indexing is here.
But is there anything out there to be discovered? How was Google's open call to dataset providers received? ZDNet had a Q&A with Natasha Noy from Google Research about this:
"We were pleasantly surprised by the reception that our call to action found. Perhaps, because we have many examples of other verticals at Google using the schema.org markup (think of jobs, events, and recipes), people trusted that providing this information would be useful.
Furthermore, because the standard is open and used by other companies, we know that many felt that they are doing it because it is 'the right thing to do.' While we reached out to a number of partners to encourage them to provide the markup, we were surprised to find schema.org/dataset on hundreds, if not thousands, of sites.
So, at launch, we already have millions of datasets, although we estimate it is only a fraction of what is out there. Most just marked up their data without ever letting us know."
NOAA's CDO, Ed Kearns, for example, is a strong supporter of this project and helped NOAA make many of its datasets searchable in this tool. "This type of search has long been the dream for many researchers in the open data and science communities" he said. "And for NOAA, whose mission includes the sharing of our data with others, this tool is key to making our data more accessible to an even wider community of users."
Under the hood
In other words, it's quite likely you may find what you are looking for already, and it will be increasingly likely going forward. You can already find data from NASA and NOAA, as well as from academic repositories such as Harvard's Dataverse and Inter-university Consortium for Political and Social Research (ICPSR), and data provided by news organizations, such as ProPublica.
But there are a few gotchas here, as datasets are different from regular web content that you -- and Google -- can read.
To begin with, what exactly is a dataset? Is a single table a dataset? What about a collection of related tables? What about a protein sequence? A set of images? An API that provides access to data? That was challenge No. 1 set out in Google's research note.
Those fundamental questions -- "what is topic X" and "what is the scope of the system" -- are faced by any vocabulary curator and system architect respectively, and Noy said they decided to take a shortcut rather than get lost in semantics:
"We are basically treating anything that data providers call a dataset by marking schema.org/dataset as a dataset. What constitutes a dataset varies widely by discipline and at this point, we found it useful to be open-minded about the definition."
That is a pragmatic way to deal with the question, but what are its implications? Google has developed guidelines for dataset providers to describe their data, but what happens if a publisher mis-characterizes content as being a dataset? Will Google be able to tell it's not a dataset and not list it as such, or at least penalize its ranking?
Noy said this is the case: "While the process is not fool-proof, we hope to improve as we gain more experience once users start using the tool. We work very hard to improve the quality of our results."
Speaking of ranking, how do you actually rank datasets? For documents, it's a combination of content (frequency and position of keywords and other such metrics) and network (authority of the source, links, etc). But what would apply to datasets? And, crucially, how would it even apply?
"We use a combination of web ranking for the pages where datasets come from (which, in turn, uses a variety of signals) and combine it with dataset-specific signals such as quality of metadata, citations, etc," Noy said.
So, it seems dataset content is not really inspected at this point. Besides the fact that this is an open challenge, there is another reason: Not all datasets discovered will be open, and therefore available for inspection.
"The metadata needs to be open, the dataset itself does not need to be. For an analogy, think of a search you do on Google Scholar: It may well take you to a publisher's web site where the article is behind a paywall. Our goal is to help users discover where the data is and then access it directly from the provider," Noy said.
First research, then the world?
And what about the rest of the challenges laid out early on in this effort, and the way forward? Noy noted that while they started addressing some, the challenges in that note set a long-term agenda. Hopefully, she added, this work is the first step in that direction.
Identifying datasets, relating them, and propagating metadata among them was a related set of challenges. "You will see", Noy said, "that for many datasets, we list multiple repositories -- this information comes from a number of signals that we use to find replicas of the same dataset across repositories. We do not currently identify other relationships between datasets."
Indeed, when searching for a dataset, if it happens to be found in more than one locations, then all its instances will be listed. But there is also something else, uniquely applicable to datasets -- at least at first sight. A dataset can be related to a publication, as many datasets come from scientific work. A publication may also come with the dataset it produced, so is there a way of correlating those?
Noy said some initial steps were taken: "You will see that if a dataset directly corresponds to a publication, there is a link to the publication right next to the dataset name. We also give an approximate number of publications that reference the dataset. This is an area where we still need to do more research to understand when exactly a publication references a dataset."
If you think about it, however, is this really only applicable to science? If you collect data from your sales pipeline, and use them to derive insights and produce periodic reports, for example, isn't that conceptually similar to a scientific publication and its supporting dataset?
If data-driven decision making bears many similarities to the scientific process, and data discovery is a key part of this, could we perhaps see this as a first step of Google moving into this realm for commercial purposes as well?
Also: 6 tips for creating effective big data models TechRepublic
When asked, Noy noted that Google sees scientists, researchers, data journalists, and others who are interested in working with data as the primary audience for this tool. She also added, however, that as Google's other recent initiatives indicate, Google sees these kinds of datasets becoming more prominent throughout Google products.
Either way, this is an important development for anyone interested in finding data out in the wild, and we expect Google to be moving the bar in data search in the coming period. First research, then the world?
Previous and related coverage:
Imagine you could get the entire web in a database, and structure it. Then you would be able to get answers to complex questions in seconds by querying, rather than searching. This is what Diffbot promises.
How can GPUs and FPGAs help with data-intensive tasks such as operations, analytics, and machine learning, and what are the options?
With natural disasters picking up in frequency and intensity, the role of NGOs in disaster relief is picking up as well. A key requirement for all NGOs is transparency, and applying data-driven techniques may help.