HarperDB: An underdog SQL / NoSQL database

HarperDB flies in the face of conventional wisdom in a number of ways. But does the world need another database?

Video: What's new in the graph database world? Here's a quick recap

Start coding a new database in your garage with a buddy of yours. Use JavaScript. Name it after your dog. Patent your own data model. Do not go open source. Take over the world.

That is HarperDB's recipe for success. It seems so unlikely, it makes you wonder whether it's genius or just crazy.

The exploded data model

Stephen Goldberg and Kyle Bernhardy do not seem like crazy people. They have long standing experience in enterprise consulting, and this is precisely what got them started on HarperDB.

Goldberg and Bernhardy liked the scale and ease of use of NoSQL, but still wanted ANSI SQL for actionable analytics. They wanted the ability to perform multi table joins, and multi conditions statements.

Them, and pretty much everyone else in the database world. The convergence of SQL and NoSQL solutions is something that has been going on for a while. A typical way to deal with this requirement is multi-model databases. But Goldberg and Bernhardy decided to take a different approach.

They felt multi-model was inherently flawed as a design pattern, were frustrated by the performance of data lakes and map reduce solutions, and wanted something that would be ACID compliant.

They thought a single model was needed to accommodate all of the above, so they went ahead and created what they call the exploded data model, which is also the basis of their patent.

nosqlsqlinfographic.png

The exploded data model is a patent the creators of HarperDB came up with to deal with the need to accommodate both SQL and NoSQL. Image: HarperDB

In the exploded data model, each attribute from a JSON object, or column from a SQL insert/update statement becomes an index upon write. These attributes and their values are stored discreetly on disk.

Goldberg and Bernhardy say this avoids the need to configure foreign keys and indexes, and allows indexing every attribute/column without increasing disk footprint as they do not store the entire record whole, or store separate index tables.

Upon search parallelization is used to coalesce the data back into an object based on which columns are requested. This, Goldberg and Bernhardy note, has the added benefit of allowing joins to be as performant as a single table search:

"Our data model allows for both read and write concurrently at high throughput. Each attribute transaction is discrete, and we don't experience row locking or need in-memory transformation, which often plague database solutions and cause them to fail under HTAP scenarios."

No schema, no maintenance

It sounds less crazy now, although at first glance their coalescing data approach does not seem completely different from the multimodel approach. To evaluate this would mean to either have access to their implementation and patent, or to benchmark against competing solutions, and these are options we do not have.

What still sounds a little crazy though is to take on established solutions, in a market as crowded as the database market. Goldberg and Bernhardy say they are not trying to compete against entrenched solutions, but rather work alongside them and augment them.

That is part of the reason why they are launching today focusing on IoT, as they note there are a lot of greenfield projects, which need new architectural patterns to see success and to scale.

They also target working alongside traditional SQL data warehouses as a sidecar providing SQL capability in real-time for unstructured data via their JDBC driver, or making column/row data from SQL databases into applications that were designed to interact with JSON.

dt15dj8x4aaa8ay.jpg

HarperDB specifically targets IoT use cases, due to its small footprint and the fact that IoT is a relatively new field. Image: HarperDB

HarperDB advertises as schema-less and configuration-free. Goldberg and Bernhardy clarify it is more accurate to say that HarperDB has a dynamic schema. And no configuration refers to the fact that no configuration for columns, foreign keys, data types, or indexes is needed.

HarperDB has the concept of schemas, tables, and attributes. Schemas and tables only provide name spaces for finding attributes, and creating logical collections. Attributes are reflexively created on insert/update and do not have data types, but ODBC and JDBC drivers sample data to suggest data type in BI tools.

From the garage to the world

Goldberg and Bernhardy also say the design thinking behind HarperDB was to make it so easy that a developer of any skill level could use it. They wanted to internalize the majority of the complexity of developing a database rather than offloading that complexity onto the developer. They say the install process requires five questions and takes about one or two minutes.

They mention running a hackathon with 78 teams, and only one developer has asked a question regarding implementation. They add that since releasing the beta in August 2017 they have received fewer than five support requests with nearly 800 downloads from 670+ developers.

This touches on an important point: what kind of support can you expect from HarperDB, and what is the team behind it and the outlook for growth?

Although the core HarperDB team is experienced and tightly knit, having worked together for a long time, the entire organization employs eight people at this point. HarperDB has raised roughly $1.3 million in funding to date and is in the process of raising another $750K to $1 million, targeting a potential round A in 12 to 18 months.

Goldberg and Bernhardy say this will enable them to grow the team with engineering and sales talent, and they are working with embedded device and system integration partners selling the product and providing support.

Node.js for the win, in IoT and beyond?

Goldberg and Bernhardy make a point of leveraging Node.js for talent recruitment. They say they chose Node.js partially because it is easy to learn, and most developers already know JavaScript. As a result, this makes bringing on developers significantly easier.

The decision to build a complex system like this on JavaScript would probably have been scorned upon a few years back. But for Goldberg and Bernhardy, Node.js is a competitive advantage.

runharperdb.png

HarperDB makes a point of being lightweight and easy to configure, owing in part to using Node.js for development. Image: HarperDB

They cite Stackoverflow's developer survey for the last couple of years, in which Node.js was #1 and #2 most popular language, and the one used most commonly in IoT, and part of the reason they target IoT:

"HarperDB on a micro-computing device is not a slimmed down version, a gateway solution, or a caching mechanism, but a full HTAP database running directly on the device with clustering. It is the same code base as the server edition.

It is stateless which allows for very little resource usage when not in use like CPU, RAM and most importantly battery life.

People expressed concerns about the need for HarperDB to get closer to the operating system and Node.js allows for you to utilize C/C++ libraries natively. We have not yet found the need to do this however.

This gives us further room to innovate and opportunity for performance and feature gains. We have also been able to deliver the product incredibly quickly due to the amazing community around NPM, the wealth of supported libraries, and the ease of use of NPM.

Because Node.js is written as a web first language we've seen amazing benefits from using it for things like clustering with Socket.io and Express for our API. And we've had good experiences interacting with the Node.js community."