Full SQL on Hadoop? Splice Machine opens up its database for trials

Full SQL on Hadoop? Splice Machine opens up its database for trials

Summary: Splice Machine has brought together technology from two Apache projects in its quest to create a SQL-on-Hadoop database.

SHARE:
MonteZwebenSpliceMachine2014May220x165
Monte Zweben: Scale linearly on the Hadoop platform. Image: Splice Machine

Splice Machine says the launch this week of its public beta opens up the Hadoop stack for the first time to a full-featured SQL database, capable of running transactions and analytics simultaneously.

The company, which has been working with 15 charter customers and raised $15m funding in February, has made its eponymous product available as a free download to encourage testing and development.

"It's radically different to anything else that's out there because it's the first true ANSI SQL transactional database on the Hadoop stack. It can power concurrent applications. People can read and write from the database at the same time," Splice Machine CEO and co-founder Monte Zweben said.

"This is not just data science anymore for Hadoop, where a batch of data is loaded up into the Hadoop file system, you run some analytics on it using MapReduce or even a SQL layer, and then dump the results back into a report.

"This is about real-time, concurrent applications, and this has not been feasible until now."

Splice Machine has taken the Apache Derby Java relational database and removed its storage layer, replacing it with the Apache HBase NoSQL database. Then the company modified the planner, optimiser and executor inside Derby to take advantage of HBase's distributed architecture.

"Now what happens is that Derby, like its original version, compiles a SQL plan out into byte code — a very efficient representation of a SQL execution — and we can distribute that out to the HBase nodes, so the computation can take place in parallel and close to where the data is stored for the maximal efficiency," Zweben said.

"Then we splice the results back together again — hence the name of the company."

According to Zweben, the skills required to use the Splice Machine software are a knowledge of SQL and some familiarity with the Hadoop file system, to be able to configure and install it.

"But they do not need to be Java programmers and they do not need to know MapReduce," he said.

"The significance is that heretofore being able to power real-time applications on the proven Hadoop stack was limited only to very low-level, key-value storage systems like HBase.

"The vast community of application developers and IT people really could not take advantage of it because they would have to be able to program in Java."

Zweben said Splice Machine is not offering all its code back to the Hadoop and HBase projects but it is a contributor.

"We fix bugs and contribute to places where it makes sense for the whole community back into HBase and Hadoop as good participants in the community. But we also have some proprietary software as well — kind of like what everyone does," he said.

"They keep some IP proprietary for the shareholders of the company but they contribute to the open-source community openly and we are just like that."

Zweben said the database market is currently confusing because of the variety of options available but for those whose databases are under pressure the first choice is whether to scale up, with expensive proprietary hardware, or scale out on cheaper commodity clusters.

Even if they choose scale-out, people then face three options.

"The first is in my opinion a very poor choice and that's NoSQL. NoSQL definitely has the scale-out feature in its architectures but unfortunately you throw the baby out with the bath water," Zweben said.

"All the services that SQL provides need to be written now at the application level by the developers, which is costly and error prone, whether that is aggregation or joins or transactions. All these things, plus all the tools around SQL, have to be rewritten and that's a mistake.

"NoSQL may be great for simple web pages but if you're doing a real app you need an SQL database."

So if the choice is SQL, the options become whether to scale out on a proprietary architecture or on Hadoop.

"People around the world developing on Hadoop and HBase — two scale-out and open-source technologies — are contributing so much IP and technology in terms of both the architectures of Hadoop and HBase, plus all the systems around Hadoop and its ecosystem," Zweben said.

"Our competitors in the NewSQL space, who are essentially scale-out vendors, they have to write all that code themselves. They have to write a distributed file system. They have to write a fast key value store. They have to build all the other tools."

Zweben said new versions and new functionality for Splice Machine will be announced this year, including work on architectural extensions such as in-memory computation.

The ACID-compliant product is available on a freemium basis for test developments and experimentation. Used in production environments, Splice Machine has a list price annual licence fee of $5,000 per node.

Read more on Hadoop

Topics: Big Data, Enterprise Software, Open Source

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

2 comments
Log in or register to join the discussion
  • not an open source project

    don't bother...
    mikeonaft
  • Not the first true ANSI SQL database on the Hadoop stack.

    Unfortunately the statement is not entirely accurate. EMC developed a product called "HAWQ" which was the Greenplum database over HDFS - this was over a year ago. Greenplum was an MPP implementation of PostgreSQL 8.2 and was ANSI compliant as PostgreSQL was, as is HAWQ. These products are now owned and sold by EMC subsidiary Pivotal. The article perhaps should also point out that transactions or "write" in an HDFS context is always append only as that's all that's supported by the filesystem. There are no updates or deletes (afaik) in HDFS so any implementation of SQL over HDFS is restricted to non-updating operations (some implementations support truncate table, but that's as far as you can go.)
    markbur