Video: Hadoop's creator looks at upcoming tech that will unlock big data
Hadoop disrupted, and in some ways became synonymous with big data, by offering a framework for cheap storage and scale out processing. Parallel to Hadoop came the flurry of NoSQL solutions that also addressed the need for massive storage and processing for data that is not necessarily structured.
Over time, Hadoop evolved into an ecosystem built on HDFS and MapReduce, its storage and processing foundations, including pieces such as a key-value store (HBase) and various SQL-on-Hadoop implementations. NoSQL solutions have also been gradually adding SQL to their arsenal, as SQL is a point of convergence and a de facto industry standard.
Hadoop started out geared towards analytics, NoSQL solutions come in many flavors and often support both operational applications and analytics. A third type of processing that has become part of the equation is streaming.
Ingesting and processing infinite streams of data in real-time is getting to be part of everyday operations for many organizations, and solutions have emerged in this space as well. Now evolution is moving towards unifying these hitherto disparate modes -- transactional operations, analytics and stream processing -- into a common framework.
The evolution of Hadoop has brought on Spark, a new framework and API that builds on Hadoop's ecosystem but brings in-memory processing, SQL and streaming support to the table, among other things. And now Spark is becoming the foundation for convergence of transactional (OLTP), analytical (OLAP) and streaming data processing.
Getting Snappy with it
SnappyData is probably not a name you've heard before unless you are a Spark afficionado, but its approach exemplifies this convergence. SnappyData's open source platform, which has just released its generally available version 1.0, is built on Spark and aims to unify transactional, analytical, and streaming data processing.
Discussing with Sudhir Menon, SnappyData's co-founder and COO as well as diving into SnappyData research publications shed some light on SnappyData's background and approach. Menon and his co-founders went on a journey from an independent vendor to being part of a corporation via acqui-hiring, to intrapreneuers to entrepreneurs.
SnappyData's team origins trace back to GemFire. GemFire is a in-memory data grid solution, originally a proprietary one developed by GemStone, then acquired by Pivotal and added to its portfolio, then open sourced and rebranded as Apache Geode.
"When we looked at what customers were trying to do with NoSQL systems on top of Hadoop, we knew that there was an opportunity there. Spark came along at the right time;  obviously there were gaps there that we knew we could fill and that is how we came to build out SnappyData and incubated it inside Pivotal," explains Menon.
SnappyData is a combination of Spark and GemFire. What point is there in combining Spark, which already works in-memory, with GemFire, which is an in-memory data grid? GemFire also happens to be a scale-out transactional store. So by bringing these two together, what you get is an OLTP - OLAP combo that also does streaming and is open source.
There are many benefits in this, as managing all your data needs in one framework sounds like the unifying field theory of big data: less complexity, better performance, TCO goes down, ROI goes up and everyone lives happily ever after.
It sounds too good to be true, and it is. It's easier said than done, and SnappyData were not the first ones to try something similar. Menon says it was a combination of hands-on experience in enterprise practice, software, data, and exposure to both GemFire and Spark that enabled them to go for it.
Fusing Spark as a computational engine with GemFire as a transactional store involved overcoming significant challenges. SnappyData identifies them in the different data structures and query processing paradigms, different expectations of high availability across workloads and the need to support interactive analytics when joining streams against massive historical data.
So how did SnappyData cope with these challenges? They created a hybrid cluster manager, used a hybrid row/column data model and added mutability to Spark's immutable data structures (RDDs), wrote a query dispatcher that determines what goes where, added the ability to compute approximate results on the fly, and kept full support for the Spark API.
Menon emphasizes that tempting Spark users with the ability to leverage their existing codebase and expertise has been part of their strategy all along, and it should be possible to use SnappyData as a drop-in replacement. If only they knew about it, that is.
SnappyData has reached GA rather unceremoniously, which in itself says something. In the team of 30 that work at SnappyData now, practically everyone is an engineer. That may not help SnappyData get much air time, but has enabled it to reach the GA milestone in a little over a year since it was officially spun out of Pivotal.
Menon says that for Pivotal "this was about doing the right thing and getting us enabled and going simply on the merit of the idea." Clearly that helped getting access to a number of big clients. Menon described how they are using SnappyData in production and getting results, as well as actively contributing to the platform's development.
Not the only one with mixed data motion
So now what? Should you just drop everything and go SnappyData? What about core Spark and other options?
Menon says they have been avid Spark users themselves, and the decision to tie their solution to Spark was a strategic one they carefully weighed. He adds that they have been in touch with Databricks, the commercial entity behind Spark, and they also contribute code to core Spark:
"Spark's focus is to democratize and get SQL and AI driven analytics to mainstream usage for batch, interactive and streaming workloads. They are agnostic to the source of the data and would like Spark to work well with every data source out there.
For users however, there are a number of workloads and situations where the ability to colocate data with processing provides huge advantages and boosts in performance and when the compute and data are not collocated, we still offer massive latency, concurrency and performance benefits to end user applications."
That sounds like a co-opetition relation. On the one hand, SnappyData brings strength to Spark's codebase and community and although it's too early to tell, parts of its approach may well make it into Spark in the future.
On the other hand, although SnappyData's offering is new and lacks for example the option to run as a managed service that Databricks brings to the table, SnappyData may well sway Spark users.
We reached out to Databricks for comment, but did not get a response by the time of writing. It will however be interesting to see how Databricks and the Spark community react in the coming period, as Databricks has unofficially circulated that a couple of pain points for Spark are being addressed.
As for other options? Hadoop vendors like Cloudera and MapR have operational database offerings in Kudu and MapR-DB. Kafka has recently added SQL and data processing to its capabilities. In-memory databases like GridGain are potential players in this convergence space too.
The one most closely resembling SnappyData's approach however is Splice Machine. Splice Machine also builds on Spark aiming to unify OLTP, OLAP, and streaming and is open source. But there are significant differences in the two approaches too.
Splice Machine builds on HBase. There already are a number of custom implementations where Spark is used in conjunction with HBase, Cassandra or MemSQL. Monte Zweben, Splice Machine's CEO, points out that such integrations require moving data back and forth, as opposed to Splice Machine's native HFile interface to Spark.
Zweben says this is an efficient mechanism to create base Dataframes for complex computations that has Snapshot Isolation semantics built in to maintain ACID transactional properties.
He also emphasizes Splice Machine's data ingestion performance leveraging a fast-bulk ingestion tool, adherence to ACID properties so that indexes are atomically updated and ability to maintain constraints and triggers. There's also support for insert, update, and delete methods that take Spark Dataframes as input.
SnappyData would sure agree with the moving data part. In fact they published benchmarks comparing SnappyData against Spark+Hbase/Cassandra/MemSQL. As you would expect, that benchmark shows SnappyData's approach to perform better.
There is no direct comparison between SnappyData and Splice Machine however. Zweben says that SnappyData does not have the same granular MVCC to support true operational OLTP applications. Menon on his part emphasizes the different approach they took, by natively integrating GemFire as a first class Spark citizen, means better perfomance.
Perhaps this will remain a not-so-clear point. There are however points that are very clear.
Splice Machine has been around longer, has more mindshare and offers more deployment options. Splice Machine recently added the option to run as a managed service on AWS, with Azure scheduled to follow soon. SnappyData will need to build its team and offering further. By contrast, SnappyData can run both on-premises and on AWS, but not as a managed service.
SnappyData has something unique at this point: approximate query processing (AQP), without relying on a priori knowledge of data distribution. This is part of the Enterprise version, and means you can get approximate results for streaming data on the fly, while exact results are being calculated. Splice Machine also offers ways to join streaming to other data sources via virtual and external tables, but not AQP.
The key takeaway however is the rapid growth and innovation this space is seeing and the convergence of paradigms. Before Hadoop even turned 10, it has moved to the background and is superseded by Spark. And now Spark is becoming a platform for innovation, potentially offering the possibility for a unifying data theory and practice.