X
Business

Data lakes going the way of the visual spreadsheet?

If you're a spreadsheet kind of person with a ton of data sitting in a data lake, Datameer's new visual exploration feature may be your thing.
Written by George Anadiotis, Contributor

How big data powers digital transformation

Self-service analytics comes in different shapes and sizes, and so data lakes. Both are widely popular concepts that have been shaping the big data world, so it's no wonder that a flurry of approaches and tools exist there.

There is also a fair amount of overlap between the two. Hadoop-based data lakes are rather common these days, but that does not make them easy to work with for the non-data science types. So self-service analytics tools make a point in trying to support them as data sources their users can connect to.

This happens through a layer of mediation, typically SQL-based. There are various SQL-on-Hadoop engines around, ranging from proprietary to open source, and each distribution comes with its own.

So, depending on how fast your SQL-on-Hadoop engine is and how big your big data lake is, your mileage on the self-service tool side will vary. Typically, such tools also try to facilitate things on their side, by supporting as many engines as possible, applying smart connection techniques, and so on.

In any case, the whole point in self-service analytics, as opposed to traditional data warehouses, is to skip the data mediation process. This requires things such as dimension definition and data cube preparation and therefore a team of people whose job is to work on that.

f27b3e90eb7cb1f22e20afb3c13f193f-crochet-chart-cc-crochet.jpg

Accessing Hadoop data lakes as a spreadsheet is an interesting idea -- now taking a visual turn.

Deviant Datameer

The idea in self-service analytics is to let users explore data sources on their own, on the fly and using visual paradigms. There is a wide range of tools in that category, each with its own approach and strengths, and then there are some deviants, too.

Datameer is one of those deviants. Its paradigm for exploration is the spreadsheet. You may argue the point in using visual tools is to avoid having to go through endless rows and columns, and that this prospect only gets more scary when you are dealing with data at that scale.

However, there obviously is a market segment for which this paradigm is useful. Spreadsheets have been around for a long time, and many people have spreadsheet skills. In essence, Datameer's platform gives them a way to not stray too much out of their comfort zone, while offering an alternative to SQL-on-Hadoop.

Datameer lets users connect to a variety of Hadoop distributions on premises or in the cloud, and provides a mechanism for entering declarative spreadsheet formulas that are translated to fully optimized Hadoop jobs.

Datameer also supports ETL and visualization features, and you can export your Datameer spreadsheets to work with CSV, Apache AVRO, Parquet and Tableau formats. Now Datameer is adding another feature in its arsenal called visual exploration.

Visual exploration -- all about fast indexing

This is an interesting move in keeping with the times. It does not give up on the spreadsheet paradigm, but it gives users the ability to visually go through charts summarizing their endless rows and columns.

Users can choose which fields from their datasets they are interested in, and the Visual Explorer will summarize them in charts, offering the option to drill down as well. Then users can decide whether that's an interesting slice of their data for further analysis.

The way it works is by building indexes on the fly, which are then used to calculate a distribution for data points and render them. This is a patent pending technology from Datameer, but although the specifics were not discussed some observations can be made.

Datameer emphasizes the hard work that has gone into building this on-the-fly indexing, and for good reason. Indeed, indexing is a key technique for accessing data at this scale efficiently. Indexing also costs a lot in compute and storage, and pre-calculating indexes for every possible exploration a priori is not viable.

datameerbenchmark.png

According to Datameer's benchmarks, the new index-based visual paradigm outperforms SQL-on-Hadoop approaches. (Image: Datameer)

Datameer published some results, comparing their approach to access via Hive, Spark SQL, Presto, and Amazon Spectrum, which show Datameer performs and scales better.

Vendor results should be typically be taken with a pinch of salt, and this is no exception. In addition, this announcement is for the beta, which only supports a few chart types.

Going the way of the visual spreadsheet?

Datameer says more will be added until general availability sometime in the beginning of 2018. When discussing with Datameer VP of Product Raghu Thiagarajan, he pointed out that what is keeping Datameer is not the need to refine its indexing or develop new index types for new charts, but rather the visual representation part.

Indeed, developing self-tuning charts for millions or billions of data points must be hard. But assuming more charts will eventually be there, and the performance gain will indeed be significant, this poses an interesting question.

datameerstacked.png

For the time being, it's a whole lot of bar charts, but Datameer says there are more chart types coming when the visual explorer will be made generally available. (Image: Datameer)

If you are a Datameer client, you clearly stand to benefit from the new feature. What's not to like in having a new, clearly more intuitive, and apparently faster way of accessing your data in the environment and paradigm you already use?

The question is: If you are not a Datameer client, is this important enough to make you jump the fence? Chances are, if you have a Hadoop data lake, you also have some way of giving analysts a familiar interface to work with that data.

Whether that is any flavor of SQL-on-Hadoop, or maybe your good old data cubes reinvented, would you give that up to go the way of the visual spreadsheet?

Performance gain and ease of use through a shift toward a more visual paradigm sound tempting. But are they tempting enough to make people give up on SQL? Would they rather keep both side by side, or maybe just wait and push for their SQL-on-Hadoop to catch up?

The answer will be different depending on whether we are talking about greenfields or existing users, how dire their need for speed is, existing skills, infrastructure, contracts, budget, strategy, etc. Taking SQL out of the equation altogether in favor of a visual paradigm may sound interesting, but will this be good enough to sway an entire community?

Won't a body of existing knowledge on SQL indexing and a ton of combined resources eventually enable visual paradigms over SQL to catch up?

It will be interesting to see how well this works for Datameer, and whether the deviant continues to challenge the mainstream.

PREVIOUS AND RELATED COVERAGE

SAP unveils its Data Hub

SAP leverages its Vora product, visual designers, and a solemn mission to let data stay in its source systems, to create its new Data Hub product. Will the wild Enterprise data landscape finally be tamed?

Transforming processes with big data: Refining company turns to Celonis Process Mining

Refining company Neste plans to leverage SAP mining and analytics to enhance day-to-day operations.

Editorial standards