Zen and the art of data structures: From self-tuning to self-designing data systems

Designing data systems is something few people understand, and it's very hard and costly. But that, too, could be automated, says new research from Harvard, and we're about to start seeing it in real life.

What if the huge design space for data-driven software could be efficiently mapped and explored in order to have tailor-made, optimized solutions? Researchers from Harvard combine analytical models, benchmarks, and machine learning to make this possible.

special feature

IoT: The Security Challenge

The Internet of Things is creating serious new security risks. We examine the possibilities and the dangers.

Read More

Stratos Idreos is an Assistant Professor at Harvard, and leads the Data Systems Laboratory (DASlab). Idreos also happens to be an old friend and colleague in research from the mid-zeroes, and it's always a pleasure to catch up.

Idreos' work in Harvard is featured somewhat atypically for an academic researcher. That is, in a visual, fairly easy to navigate, and compelling way. There are many things we could spend hours talking about, but the one that immediately gets attention is self-designing data systems.

Data structures

Data structures are how we store and access data. A data structure design, as defined by Idreos and his team in a recent publication, consists of 1) the data organization, 2) an optional index, and 3) the algorithms that support basic operations (such as put, get, update).

All algorithms deal with data and most often the design of an algorithm starts by defining a data structure that will enable faster problem solving by minimizing computation and data movement.

Possible ways of designing algorithms and data systems are restricted by choices in data structures. For example, we can only utilize an optimized sorted search algorithm if the data is sorted and if we can maintain the data efficiently in such a state.

This makes data structures one of the most fundamental components of computer science, and this is why they are at the core of all subfields, including data systems (relational, key-value, graph, etc).


Each data structure design is a compromise between the read, update and memory amplification trade-offs. Image: DASlab / Harvard

Data structures are not something everyone understands and can readily use. This type of knowledge is commonplace among database administrators (DBAs) for example, since they have to work with data structures to fine tune their systems.

But to go beyond that, to synthesize and combine, or invent, data structures, and to understand the best design choices when building data systems, is something that a select few people in the world can master.

Idreos, being himself one of those few, is working towards a framework to take some of the burden off the shoulders of both DBAs and system designers.

Data Calculator: Mapping the design space

Expert data system designers are in short supply. And even for them, the process of designing and implementing a data system is complicated and costly. There is a vast design space of a combination of data structures, workloads and hardware configurations to explore. Making the right choices is complex, and it takes time.

People have to give lots of thought to their design, and then implement it and test it to see how it behaves. Idreos' goal is to automate this process as much as possible. The first step is what he calls a periodic table of data structures.

The periodic table of elements in chemistry organized elements into a universal structure, a design space based on their properties and components. This helped classify and explain connections among existing elements, and provided hints about discoveries not yet made.


A periodic table for data systems. Image: DASlab / Harvard

This is what Idreos wants to achieve with the periodic table of data structures, or Data Calculator. The approach is to map data structures and their properties in this table, and then use it to explore the vast search space of different combinations.

Data system design, explains Idreos, is a function of three factors: data structures, workload, and hardware. For example, a certain index may perform very well for reads in-memory, but not so well for writes on-disk.

Idreos and his team have created analytical models supplemented by benchmarks and machine learning that can estimate how well-suited a design might be for certain workloads and hardware configurations. This has some very interesting applications.

Get your RocksDB 1.000 times off, optimize in the cloud

Would you like to get 50 percent better performance out of your existing system, or to maintain the current level of performance, but with half the hardware? This is the kind of effect efficient system tuning can have.

Idreos' work is hugely ambitious, as it touches upon what is the state of the art today, and promises to significantly improve on this. To draw on some concrete examples, think of systems such as Oracle's self-tuning database, or similar offerings such as the ones from ScyllaDB or MemSQL.

Historically, notes Idreos, IBM and Microsoft were among the pioneers in exploring adaptive data systems. Traditionally, the focus has been on tuning indexing. Now this is expanding to other knobs, and this, Idreos speculates, is what Oracle does too.

Idreos' work is different in that it does not just tweak existing data structures such as indexes for example, but it can also reconfigure them to create new ones dynamically. Or at least, this is the vision.

For the time being, DASlab's first implementation may work differently from the Oracles of the world, but appears to have a similar, albeit magnified, effect. DASlab have worked with RocksDB, which Idreos says they have managed to reconfigure to achieve performance that may be from 1.000 to 10.000 times better for the same workloads.

Also: Big data architecture: Navigating the complexity TechRepublic

Another promising application may be tiering for cloud providers. As more and more applications and data move to the cloud, the problem of what to keep in fast access media and what to move to tape becomes increasingly important.

Tiering is the effort to answer the question of what hardware to use for storing data, or in a specific machine, in which memory area to store data. Idreos says that Data Calculator can capture such aspects.

It's an optimization problem: for a specific workload, and budget, find the optimal system, hardware, and memory allocation.

DASlab's research will be applied to using open source data formats in the cloud for such scenarios. Idreos says they will need a minimum of one year of research for this, and the hard part is how to search efficiently in an exponential search space.

From self-tuning to self-designing data systems

Impressive as the RocksDB results may be, the fact remains that the underlying data structures are not altered: RocksDB remains a key-value store.

DASlab's implementation works as an add-on layer on top of RocksDB. This layer inspects workloads and hardware configuration dynamically at run time, and uses Data Calculator to find and apply the optimal configuration.

This was a pragmatic choice made on the basis of restricting the search space, as limiting oneself to key-value data structures makes things more manageable. The vision goes well beyond that however.


It sounds science-fiction like, but may be closer than you think: Harvard's DASlab is working on self-designing data systems. Image: DASlab / Harvard

How about choosing the type of system that is best suited to handle workloads on a per-application basis? And not just among existing systems, too.

Fully-fledged, this research could lead to personalized system design, tailored to the needs of specific applications. And those systems would also be able to self-adapt at runtime, if the workloads change.

This may sound like science fiction, and Idreos estimates it's at least 10 years away. But it's well under way. Beyond presenting this research at the world's most prestigious conferences, Idreos is also working on commercializing it, although we are not at liberty to disclose details.

Machine Learning, and knowing what you're doing

Data Calculator uses a hybrid approach -- part analytical, part benchmarking, part machine learning (ML). Idreos has been relying on analytical approaches for a while, and half-jokingly remarks that "ML is for when you don't really know what you're doing."

But seriously, when discussing the Data Calculator approach compared to Oracle, for example, one obvious question is what kind of datasets Idreos' team could possibly use.

As opposed to the Oracles of the world, DASlab does not have access to tons of real-life database deployment operational data. Idreos notes however that for a good number of things, they know exactly what to do and how, and their analytical models are sufficient:

When you rely on ML exclusively, what you get is an approximate answer. But there are some good reasons why we use it.

First, as a research method that can point towards a good solution. Then with our analytical model it all comes down to an equation that takes 1 micro-second to run, and we have the optimal solution.

In our work for Key-value stores, the design space is huge, but we comprehend it quite well. We have built analytical models that work, so we don't really need ML. (Generalized) Data Calculator is different.

Idreos explains that they can't build an analytical model for every possible data structure -- at least not at this point. The design space is dynamic, it's growing, and it's hard to pin down. What they do is they incorporate domain knowledge, such as how access method primitives behave, and then synthesize an analytical model equivalent.

For example, they model the behavior of random access, scan, or binary trees. Then they use this to synthesize more complex structures such as indexes.

Also: What is machine learning? Everything you need to know

Those analytical models will be somewhat off, however, as they do not represent the world with 100 percent accuracy. In data structures, says Idreos, what you usually miss will be some data or hardware properties.

DASlab uses ML to train algorithms based on analytical and benchmark results. They have an analytical model and run some benchmarks on specific data and configurations, and results are then fed to ML algorithms as training data.

This enables them to answer questions such as "I want to run a scan on 5GB of data with those features on this hardware, how long will it take?", even without having built a precise analytical model for this.

Zen and the art of data structures

Similar to all ML approaches, choosing the right parameters to incorporate in those ML models is extremely important. In this case, it comes down to choosing the parameters that influence hardware and software configuration, even when not being certain exactly how.

For the next stages of this research, Idreos envisions a layered ML approach, based on reinforcement learning:

What we have built works like this: we input a design, and get a cost as an output. This is similar to ML labeling, in terms of labeling inputs. So we can use our hybrid algorithm to label training data for another ML algorithm layer to get more approximate answers.

Although the initial fruits of this research are already about to be commercialized, it may take a while before we see it unfold fully.

Idreos, however, approaches this with a Zen sort of attitude that makes one believe that self-designing data systems are more or less inevitable. Seems like one more area of creativity previously reserved for humans is en route to automation.