Out of the Hadoop box: SQL everywhere and AtScale

AtScale has made a name for itself by providing an access layer on top of Hadoop that enables it to be used directly as a data warehouse. AtScale is now announcing support for Teradata DW and Google Dataproc and BigQuery, offering what it calls a Unified Analytics Platform. Why this move now, how does it work and what does it mean?
Written by George Anadiotis, Contributor

You may not realize it, but Hadoop has already been around for 10 years. Even now, with most organizations having in one way or another adopted it, not everything about it is obvious and clear. But when it first came out from Yahoo in 2006, Dave Mariani, AtScale's co-founder and CEO, was one of the first to use it and realize its potential.

He was at the right place at the right time: Mariani was doing analytics in Yahoo, delivering data to drive business insights and advertising on the company's assets. DW and cubes were pretty much the only game in town for analytics then, and a big game too. Mariani, a data cube veteran with numerous implementations under his belt, mentioned that "a single one of these cubes at Yahoo could drive revenue in the area of 50 million dollars".

Mariani, like most industry experts today, realized that Hadoop could revolutionize the data industry due to its properties: a shared-nothing architecture that meant it can scale-out in a seamless, cost-effective way, a framework on which ETL and processing jobs can run, and late binding / schema on read. He realized that earlier than most, or at least he acted upon it earlier.

In Yahoo, as well as in Klout which Mariani joined after Yahoo, Hadoop was heavily used, but the BI landscape was what it had always been: fragmented, using a plethora of tools ranging from Excel to MicroStrategy. At that time, the only way for those tools to be able to use the data stored in Hadoop was to take data out of Hadoop and store it in a DW. Then SQL-on-Hadoop came along, Cloudera set out to release Impala, Mariani was recruited, and the rest is history.

Eventually, Mariani set out to implement his own vision: to let users access data in Hadoop as painlessly as possible. The vehicle was AtScale, with Yahoo and Cloudera onboard as investors and clients. AtScale deliberately refrained from offering a data navigation and visualization layer. Their thinking was that they could and would not displace tools already used for this purpose. Instead, they chose to act as a vendor-neutral middleware to facilitate access to data stored in Hadoop over SQL and MDX. This architecture is based on 3 pillars.


AtScale architecture is designed to enable users access data in back-end systems in a seamless way, using their BI tool of choice. Image: AtScale

Design, Cache, Query

First, the Design Center. AtScale describes this as the canvas for painting virtual cubes. This tool lets users navigate data stored in Hadoop and define metadata that can in turn be used to define dimensions for virtual OLAP cubes. It's a collaborative, multi-user tool, so users can complement each other's knowledge.

In addition to effectively acting as a schema definition mechanism, it also supports data governance by means of access rules and security. AtScale calls this a Universal Semantic Layer in which business logic can be defined centrally and deployed instantly, regardless what BI tools people use.

Virtual cubes sound cool, but what about performance? There's a reason why cubes in traditional DW are pre-calculated after all. This is where the Adaptive Cache comes in. The 2nd layer in AtScale's architecture is a caching mechanism that works by applying intelligent strategies to not only hold the most recently and heavily used data at store for faster subsequent access, but also to predict data more likely to be used in the future and preemptively fetch them.

AtScale argues that even physical cubes start to break down for large cardinalities / dimensions, and claims virtual cubes perform just as well or even better. They cite an example in which a query on a virtual cube with over 500 Billion rows retrieved results in under a couple seconds.

Last but not least, the closer AtScale gets to a user-facing interface: the aptly named Hybrid Query Service (HQS), offering a query layer that supports both SQL and MDX. HQS supports JDBC, which means that effectively any ANSI-SQL client can connect via AtScale over JDBC to query data residing in Hadoop. AtScale has partnerships and certifications in place for products like Tableau, Qlik and PowerBI, based on user base and requirements, as well as all major Hadoop distribution vendors.


As AtScale's CEO Dave Mariani puts it, "If nobody can interact with your cluster, Hadoop is just a white elephant." Now the elephant is out of the box, rolling with the times.

Out of the Hadoop box

What's new is that now AtScale goes beyond Hadoop (in the cloud or on-premises), offering support for Teradata DW, Google Dataproc and BigQuery. According to AtScale's founders, this was part of their vision all along and customers have been asking for it too. That vision was initially met with scepticism while raising capital for AtScale's Series A, but things were much easier now as the company was recently able to complete a Series B of about US$ 11 million.

AtScale's strategy of acting as the middle man seems to be paying off, as it allows capitalizing on developments of the SQL engines it relies upon. These engines have been taking off, having been measured to offer a 2-3 times improvement in performance compared to earlier versions.

AtScale has applied the "decouple everything" paradigm that Hadoop brought to the storage world, by adding its own data definition and query optimization layer on top of storage, be it Hadoop or other, as the roadmap includes support for even more storage engines.

Is this the story of TOAD playing out in the brave Hadoop world and beyond? Like TOAD, AtScale started out with a modest vision - to make the lives of people working with data easier, on Oracle and Hadoop respectively. Like TOAD, AtScale has been seeing growing adoption (listing clients like Macy's, Comcast and GlaxoSmithKline) and is expanding beyond its initial niche.

TOAD and AtScale even overlap somehow now, as TOAD offers support for SQL-on-Hadoop too - albeit without all the extras that AtScale brings to the table. Looks like Hadoop is out of the box, and in a converging database world, that should come as no surprise.

Editorial standards