As we noted when we discussed the general release of Amazon Timestream last fall, time series platforms are an old-but-suddenly-new category in the database landscape. Although IoT is often cited (or blamed) for the upshot in time series database activity, there are numerous scenarios (e.g., in capital markets, transportation and logistics, etc.) where time is the defining parameter.
But let's get this confession off our chests right now: TimescaleDB is a brand name that is easily confused with Amazon Timestream (OK, Timescale came out in the market first). As a result, we very often find ourselves tripping over all this nearly identical branding and found ourselves entering global replace mode to make sure we put the right names in the right sentences.
Timescale is one of those quasi-open source companies that has applied its own licensing flavor of the month to encourage customers to play with and improve code but prevent the AWS's of the world from launching database cloud services on its community edition. The 2.0 version released a few months back slightly liberalized the licensing to encourage customers to tweak the code.
Unlike better-known InfluxDB, Timescale is very much in the SQL mode, like Amazon Timestream. But unlike Timestream, TimescaleDB is a PostgreSQL variant. Thus, TimescaleDB joins what is literally a crowd in the PostgreSQL community, but it is unique in being one of the few, if not the only, PostgreSQL variants that have been specifically designed for time series data.
Timescale released version 2.0 back in February and is now on a monthly cadence with several dot releases since then. If there is a common theme to the current releases, it is about scaling the platform out, supporting distributed deployment, and on the horizon, extending the platform to support analytics.
While analytics should be very useful for time series data use cases, most time series databases are not built for deep or complex analytics. Ironically, that's attributable to the magnitudes of raw data that pour in; most time series databases downsample (e.g., compress or archive) old data to keep storage costs in check. TimescaleDB's recently introduced features include capability to analyze incoming real-time (uncompressed) data with compressed historical data. More about that in a moment.
The highlight of new features is support for distributed, multi-node deployment. To explain it, we need to dive under the covers to explain TimescaleDB's unique architecture. Many operational databases rely on sharding, where they distribute different parts of the same table across multiple nodes. Although not a relational database, it is how MongoDB scales out. But TimescaleDB relies on a slightly different construct, which it calls a "chunk."
A chunk is like an append-only database because with time series data, by far most of the activity is with writes or inserts rather than changes. And the writes tend to be in consecutive time intervals, which contrasts with the more random distribution commonplace with most transaction databases. For Timescale, a chunk is essentially a shard that also has multiple time slice partitions. When it's time to add a new time partition in the chunk, the system just adds it; there is no need to rebalance or reload the system because the new partition will be contiguous. And all these chunks are read as one unified logical table, although underneath the covers, it is heavily partitioned and sharded. A group of linked chunks in TimescaleDB are managed as hypertables that make the whole assemblage look like one physical table.
Until now, hypertables could only run on a single node. However, in the 2.x generation, hypertables can spread across multiple nodes with all their resident chunks. The result is that TimescaleDB, which until now could accommodate terabytes of data, can now inflate up to a petabyte range.
Now that Timescale can scale out, the next logical step will be adding the ability to replicate an entire cluster of tables for high availability and providing the means for rebalancing older chunks across the cluster. That's based on the assumption that with time series data, only the most recent time slice partitions are active. Both features are on the roadmap.
Another recent enhancement is also related to scale. While time series databases are write-heavy, there is the need to run queries. Until now, finding unique values has required full table scans of the index. Other relational databases, such as Oracle, MySQL, IBM Db2, and CockroachDB already have features that allow scans to skip irrelevant values in composite indexes (e.g., indexes that sort on multiple columns). However, PostgreSQL has been missing that, so for now, TimescaleDB is adding its own skip scan feature. When and if the PostgreSQL community fills this gap, we'd expect that Timescale will probably backport it.
The latest crop of releases has also streamlined compression so that you write to chunks that are already compressed. Like other time series databases, Timescale applies compression to older values -- it does so through a columnar view that it introduced a couple of years ago.
OK, let's pick up the analytics thread again. They are announcing a new project to add an analytic engine -- it will be managed separately from the existing TimescaleDB operational engine for obvious reasons -- analytic queries consume resources differently from operational transactions. But at this point, analytics is still an aspiration; Timescale has reached out to the community for reaction and guidance. We hope that, unlike rivals such as InfluxData, that the new engine will be based on the same underlying technology base as the existing one.