Spark: The big data tool du jour is getting automation
You probably did not hear it here first. Spark has been making waves in big data for a while now, and 2017 has not disappointed anyone who has bet on its meteoric rise. That was a pretty safe bet actually, as interpreting market signals, speaking with pundits and monitoring data all pointed to the same direction.
Spark adoption is booming. Its community is growing, and all major big data platforms make a point of interoperating with Spark. If you look at its core contributors and project management committee (PMC) you will see Hadoop heavyweights Cloudera and Hortonworks, and all-round powerhouses such as IBM, Facebook and Microsoft.
You will also see a name you may not recognize, but dominates Spark's current development and future direction: Databricks. Databricks is a startup founded by Spark's inventors, Ali Ghodsi and Matei Zaharia. Ghodsi and Zaharia, who started out as fellow researchers and friends in their Berkeley days, are the CEO and CTO of Databricks.
Last week the Spark Summit Europe event attracted more than 1,000 attendees in Dublin. Ghodsi and Zaharia were both there to share news, get in touch with the community and discuss. ZDNet was also there, and the topics we discussed covered a wide spectrum ranging from strategic to hard-core technical.
Dublin set the stage for the latest addition to Databricks arsenal: Delta. In a way, Delta represents the direction and philosophy of Databricks and its founders perfectly. It can be summarized as being a smart cache layer on top of AWS S3 storage that lets you do all your data processing at scale and throughput in the cloud, with Azure and Google cloud soon following suit.
It sounds evolutionary rather than revolutionary, in the sense that this is something that has been going on for a while. Databricks has been moving in that direction too, and starting the conversation with Delta it was an obvious question for Ghodsi: great, but what's new there exactly?
Databricks pitches Delta as a platform that combines streaming and batch processing, data warehouses, collaboration and machine learning (ML) all in one, while running in the cloud to offer scale and elasticity. Ghodsi explains that product development was customer-driven, not just in the sense of responding to needs but also making customers part of the development loop.
But why try to shape Spark to a data warehouse, and how would that work?
The reason is data warehouses do have advantages in terms of performance and governance, and hearing from customers how they kept moving data around between their data lakes and data warehouses inspired Databricks to take action. Data lakes complement data warehouses in terms of cheap storage and separation of compute and storage, so the idea was to get the best of both worlds.
While it is true that Spark in the cloud is nothing new, and Databricks already had its own managed version with added goodies in terms of collaboration and runtime called Unified Analytics Platform, Delta does bring some new things to the table. "We basically added transactions and metadata" says Ghodsi, and that goes a long way.
Metadata may be a modest term for everything that's going on under the hood for Delta. That includes things such as data compaction, schema matching, statistical query optimization and serverless deployment. All of those are to some degree at least powered by automation and machine learning.
In terms of data ingestion, schemas can be defined that allow for data validation. This ensures data entering Delta will be clean, which is a given in the data warehouse world, but not in data lakes. This is standard SQL DDL, and although the underlying storage format is proprietary Ghodsi implied it will eventually be open sourced too.
What is already open source is the storage format for data. Parquet is used for this, which as Ghodsi emphasizes means that unlike traditional data warehouses, you can at any point just take your data in a portable format and do what you want with them.
So although it sounds simple, a smart cache layer can bring an array of benefits. This is great if you work in the cloud and are willing to pay for Delta, but what about on premise and the core, open source Spark platform?
"It would have been easy to get Delta to work on top of HDFS for example, but on premise is not part of our strategy" says Ghodsi. "We are inundated by requests for our cloud products and we are growing as fast as they are" he adds.
But if Databricks is mainly concerned with growing its proprietary cloud platform, and other vendors are also doing the same, who is going to contribute to core Spark, and where does that leave Spark users? Ghodsi is fast to address those concerns.
"Innovation in Spark is accelerated because Databricks is a cloud company" says Ghodsi. "We battle test the codebase in the cloud and then it finds its way to Spark. Databricks is by far the biggest contributor, and features like Spark SQL were developed and tested by us in the cloud and then contributed to Spark."
"We still make sure Spark works where it needs to, for example with YARN or Cassandra, but we are not going to contribute resources for things that are on premise. Other vendors can pick this up though, so it's ok".
Ghodsi says that this has been their intent with Databricks all along, citing shorter iterations as a major reason:
"With on premise software, you need to wait for some 2 years from the moment you implement something till you roll it out and get feedback - it's like flying blind: it has to be included in the next version, sales has to sell, professional services has to upgrade, and then you may hear whether people are happy using the software or not. We now have 2 week sprints and upgrades are done in no time."
So it's all cloud for Databricks, which also happens to be in the driver's seat of the most popular big data platform at the moment. Spark is comprised of many elements brought together to assist teams working with data to extract insights. Add to that Databricks proprietary extensions and what you get is an Insight Platform as a Service (IPaaS).
It's not the only one of course. To begin with, there is competition from cloud providers, each of which offers core Spark in addition to its own tools for working with data in the cloud. While tempting for organizations that already have a footprint in these clouds, such offerings also entail lock-in and are not necessarily best of breed.
Choosing an insight platform will be the next great platform decision, and other vendors are in this game as well. How do Databricks founders compare themselves against other options?
For Qubole for example, Ghodsi says "they focus on Hadoop, things like Hive and Map Reduce optimization. This is not something we do, not interested in that, and we don't see them in deals that much." Maybe so, but Qubole claims to offer managed versions of Spark and to do it across clouds, which Databricks does not have at the moment, in addition to basing their value proposition on automating workloads.
And what about Hadoop and its key vendors? They were not included in Forrester's evaluation or our conversation, but arguably may be part of the insight platform decision. It's not so much about Hadoop anymore, but more about what you can do with it. Hadoop vendors are aware of this apparently and going the IPaaS way, each at its own pace and in its own fashion.
Then there are some less obvious, also not included in analysts reports in that space, options. Some names that come to mind would be Kafka / Confluent, Flink / dataArtisans, SnappyData and Splice Machine, in the sense that these platforms are alternatives to some aspects of Spark.
Confluent has recently added its own managed cloud version of Kafka, and in addition is expanding its reach and ambitions. Kafka is the most popular entry point for streaming architectures, and Confluent wants to grow Kafka into a platform to build micro services on. It's also adding features such as transformations and SQL on streaming data.
Ghodsi and Zaharia point out that while there is some overlap, Spark is all about integration: "We work with Kafka and see it as a complementary solution. But Kafka's transformation and SQL can only take you so far. If you want to do batch and streaming data, and some machine learning on top, as most of our clients do, our integrated tools and APIs is what you need".
What if streaming is your main concern though? Should you go for Spark Structured Streaming, or the portability and flexibility that Apache Beam promises? Beam is a project started by Google and donated to Apache, with the goal of acting as in interoperability layer for streaming engines such as Google's Dataflow, Flink or Spark.
"There's nothing stopping you from running the Beam model on Spark, it's just that people in the Beam project have to implement this," says Zaharia. Google's Tyler Akidau, who is also Apache Beam's mastermind, mentioned in a talk a few months back that this was left to the Spark community. Confluent's CEO also said they are not interested in Beam unless it adds support for tables.
There seems to be a bit of a stalemate there, and Spark's PMC does not seem to think there is much value in going for Beam compatibility at this point. Zaharia says they tried to come up with an easier to use version of Beam by keeping the key ideas in Beam, which are separating query and trigger, but making them independent:
"You just give Spark structured streaming your query and it will make it incremental, then it can run when triggered or at will, you also have control over the output and it's a simpler combination of operators."
Spark structured streaming uses Spark's SQL engine under the hood, and for Zaharia this is what makes the difference:
"We spent a lot of time developing it, but now you can use it for all your queries. You can use SQL standard functions and data types as well as user defined ones. It can't handle 100% of SQL yet, but none of the other streaming engines can either."
Zaharia also referred to some benchmarks comparing Spark Structured Streaming against Kafka and Flink. He says they used the same configuration used by Kafka and Flink, four times fewer nodes, and the difference in performance was staggering: "If you can run 10 nodes instead of 100, you save not just in money, but also in complexity. This is what we tried to optimize."
Speaking of SQL, does Spark intend to go the SnappyData / SpliceMachine way and try to address both transactional and analytical workloads?
Ghodsi says that using the skipping capabilities you could do point SQL queries. Although using Spark as the back-end for your web store for example is not something Databricks would normally recommend, some people are doing that. What Databricks would recommend however is to use Spark Structured Streaming combined with a transactional database, to do things such as keeping calculated statistics up to date.
Zaharia commends on SpliceMachine's approach, noting that while there also are other approaches that try to bypass the immutability aspect of Spark's core data structure (RDD), SnappyData took the extra effort to integrate Spark with a transactional store on a low level.
Zaharia also confirms that some of SnappyData's solutions on indexing for example have been adopted by core Spark too. But walking down that road towards OLAP - OLTP convergence themselves does not seem in their immediate plans.
So where is Spark, and Databricks, headed next? Zaharia says the bigest thing they are currently working on is streaming and deep learning (DL). These are the 2 fastest growing areas, and making them work together is a major goal for Databricks.
"Nobody will tell you it's simple, but we think it does not have to be complicated," says Zaharia. He says the main reason things are complicated is that the tools are new, and they are not that well integrated. Much like what the situation was in Hadoop a few years ago. Zaharia recounts:
"MapReduce was awesome and it let you do things that were not possible before. But to build an application you needed to combine four or five different systems. The key thing you need in programming is composition. It's OK to implement 1 MapReduce, but in most algorithms you would need over 10. In Spark you just use functions, you don't even need to know what the operators inside them are.
Now, if your execution engine knows how to do both loading data and training the algorithms, then you don't need separate systems. This is what we are doing. We have one API through which you can do batch, streaming and joins, and this simplifies things.
We are now adding DL in our ML pipeline. It turns out if you add DL operators there, many common use cases are easy to implement. If you look at DL frameworks like TensorFlow, they are designed for people developing new models.
For our pipelines, we are focused on people who want to use existing models and train them on their data to apply to their problem. In our demo we built a state of the art visual search in 7 lines of code.
People want to build end to end applications quickly and what matters most is to be able to compose applications out of little pieces easily and efficiently. The point of Spark is to have a standard library of algorithms you can mix and match no matter how and know you can just use them and get good performance."
Again, Databricks is not alone in its push for DL here. Since Hadoop is the backbone of so many massive data lakes, it makes sense to use data stored there for DL. MapR was the first one to introduce a dedicated solution called QSS, Cloudera touts its Data Science Workbench as its solution for DL, and Hortonworks realizes and acknowledges the connection but is not really active in this space yet.
It's not hard to see why Spark is as successful as it is: emphasis on usability and performance, an integrated framework, a strong open source foundation and community, and founders with solid technical backgrounds. Some may argue that's not necessarily what's needed to scale a company with Databricks ambition and influence, but so far it seems to be working.