Exploring data using natural language ("plain English") query expressions isn't a new concept, but it has become more relevant and more feasible lately. People are used to search engines and like the metaphor as data querying experience. Products like Thoughtspot and Answer Rocket specialize in this teaming of search and data discovery. And the Q&A feature of Microsoft Power BI enables this, both for ad hoc queries in dashboards and even for use as an authoring tool when designing reports.
Many natural language analytics products, however, require data to be moved into their own repositories or index structures. But today, Arcadia Data is announcing a new Search feature, in the latest release of its Arcadia Enterprise product, that adapts the natural language query paradigm to work directly on top of data lakes.
The low down
In a phone briefing with Sushil Thomas, Arcadia Data's Founder & CEO, and Steve Wooledge, the company's VP of Marketing, I learned that the Arcadia Data Search feature works on top of Hadoop-based data lakes as well as cloud data lakes that reside in Amazon S3 and Microsoft's Azure Data Lake Store (ADLS).
Once Arcadia is connected to the lake, users can type in search expressions like "show me the states with the highest population in 1910" and get results back in the form of data visualizations. This works both for individual searches and within dashboards as (shown in the figure at the top of this post).
Covering the edge cases
Executing such queries over data lakes requires graceful handling of certain ambiguities:
- The same query may apply to more than one data set in the data lake. In this case, Arcadia Data will apply its own scoring algorithm, querying the data set it feels is most applicable, but listing clickable options for the others (see figure below). Users who pick one of the alternate data sets will implicitly influence the scoring algorithm to favor that data set more in subsequent searches
- Certain data sets or columns within a data set may not be appropriate for search-based query. To mitigate these difficulties, Arcadia Data allows administrators to specify which tables, and which columns within them, are searchable.
- For those columns that are searchable, the words used in a natural language query may not match those columns' names verbatim. To handle this quandry, Arcadia Data allows a list of synonyms to be entered for each searchable column.
Arcadia's Search feature provides other niceties. For example, as query expressions are entered, auto-complete suggestions are provided (this may include entire search expressions presented as suggestions after only a single word is entered in the search box). Results are rendered using what Arcadia Data determines to be the most appropriate visualization type, but users may specify the viz type they'd like within the search expression itself.
It's probably important to point that although Arcadia has named this new feature "Search," it does not rely on special search indexes, and it doesn't use technologies like Solr/Lucene or ElasticSearch. Instead, Arcadia is really providing a natural language abstraction layer that converts the entered expression into the corresponding query in SQL, or another native language (depending on the data set's origin and format). Although Arcadia does create its own style of OLAP cube under the hood to accelerate some queries, the data in the lake is being queried natively, and no indexing or ELT is required.
Search engine as data catalog
If you think about it, Arcadia Data's Search feature addresses many of the same data lake use cases as do data catalog-driven query tools. The idea in both cases is to make data in the lake more discoverable, providing a self-service query experience for business users not familiar with each data set and its schema.
The data catalog approach works in a top-down fashion: first find the data set you need and then craft the query against it. Arcadia Data's Search feature is more bottom up: say what you want to see and then the data set will be selected and query crafted for you. Both approaches are valid and either one may be preferable, depending on circumstances.
But sometimes an imperative command is faster and easier than a browsing experience. For business users looking to get off the "blank page" and start getting real use out of their data lakes, Arcadia Data has a great solution. Once users have their bearings, they may wish to use a data catalog to help them explore their data lakes more comprehensively. There are strong synergies to using both.