TileDB introduces canonical database storage format

A startup is addressing specific professional domains and vertical industry sectors not otherwise served by common relational, document, or key-value data models by stealing a page out of the cloud database handbook.


Ever since the revelation that not all data can be neatly stored in rows and columns, it seems that barely a day goes by without the emergence of yet another new database with its own query engine and unique table or file format. TileDB has merged to in effect, scream, "Stop the insanity" with its quest to establish arrays as a new form of universal storage format.

Unlike most database CEOs, TileDB founder Stavros Papadopoulos comes from the scientific, not the technology community. What eventually became TileDB originated out of yet another Michael Stonebraker MIT project, SciDB, that offered a database engine suitable for use by research scientists because of its array structure, now commercially available as Paradigm 4. Because the data is not force-fit into columns and rows, it can represent almost any kind of data structure -- and commercially it has been used to build multi-dimensional arrays that have some resemblance to the early generation of denormalized MOLAP databases.

But Papadopoulos identified one key drawback to SciDB -- it could not handle data sparsity very well. That's where many columns are empty or null, a scenario that is quite common for genomic data sets focusing on how species or individuals are differentiated from one another; for people, the typical deviation across the human genome is barely 0.1%. Theoretically, you could store all the redundant data, but that would be a huge waste of resources; so as a result, most genomic data sets are highly sparse.

So founder Papadopoulos left the ivory tower at MIT and, initially backed with seed funding from Intel Capital, started TileDB. It picks up where SciDB leaves off by building sparsity into its optimizations, and unlike most databases, concentrates entirely on data storage and management, but leaves the compute/query engine as pluggable. That's the reverse of what databases like MySQL and MariaDB do, where they feature a common compute tier but make the storage engine pluggable. So, for instance, TileDB versions data, supports "time-traveling" (we presume, through snapshots), and handles housekeeping tasks such as access control, logging, and managing metadata.

Yet in some ways, TileDB follows a very similar design pattern in the cloud database world, where the storage engine is common but exposed through different APIs. Microsoft Cosmos DB is the best known public example of this approach, having a core storage tier with APIs for SQL, JSON, graph, and wide column. Additionally, Amazon Aurora and Keyspaces, along with Google Cloud Spanner and Cloud Datastore, all run against storage engines via APIs.

TileDB offers two products. It includes TileDB Embedded, an open-source, cloud-native, and storage library for multi-dimensional arrays and TileDB Cloud, a serverless SaaS offering for sharing data and code and enabling efficient computations that currently runs on AWS and uses S3 for physical storage.

By leveraging cloud storage, abstracting the compute and query engine, and with a cloud offering that is designed to be serverless, TileDB is promoting its ability to scale. Having recently announced $15 million in Series A funding, the company is initially targeting use cases in genomics and geospatial.