Processing time series data: What are the options?
Get your data from everywhere you can, anytime you can, they said, so you did. Now, you have a series of data points through time (a time series) in your hands, and you don't know what to do with it? Worry not, because there's a bunch of options.
A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.
That's how Wikipedia defines time series, and by that definition, most data starts looking like time series. That's why time series data processing is important, and will become even more important going forward: If you keep recording values for the same thing, time after time, what you have is a time series.
Streaming frameworks, the cloud, and time series databases
Cloud is not the only option, however. Time series databases is another one -- that can also be used in the cloud. This is a class of database solutions designed to handle storage and processing of time series data.
Navdeep Sidhu, InfluxData head of product marketing, is very encouraged by what we have seen from Google's offering:
"We are as excited as they are in seeing the platform get adopted and how it evolves as real usage patterns emerge. Google's market presence and technical acumen will ensure that this platform will be widely used.
We think that having a strong data storage and analytics layer that is designed for IoT sensor data ingestion, real-time analytics, and insight is a key component of any IoT platform."
James Corcoran, SVP of products, solutions and innovation at Kx, the vendor behind kdb+, thinks it's too early to comment on Google's announcement, but will be following it with great interest.
Ajay Kulkarni, CEO and co-founder at TimescaleDB, said he loves innovation in data analytics, and is glad that Google is taking time-series data seriously:
"We'd agree that building a system that can scale is challenging, and that data analysis stacks have gotten so complex that simplifying them is a good thing.
That said, the offering still feels very early. I believe their only quote is from an engineer who says it looks 'promising?' Aside from maturity, something else the project seems to lack is a real query language. What no one wants is yet-another-query-language to learn. Which is why the data analysis industry is starting to re-standardize back on SQL."
"Time-series data tends to be big, so performance and scalability are crucial. The key requirements for working with time-series data are the abilities to analyze and aggregate the data very, very quickly.
kdb+, with a built in high performance programming language called q, is uniquely positioned to work effectively with time-series data. kdb+, and our Kx product suite built on kdb+, have been technologies of choice for the financial services industry for large-scale, critical trading applications and research applications for over 20 years."
Kulkarni emphasized scale, performance, reliability, ease-of-use, and SQL:
"TimescaleDB scales to 100TB with performant queries (i.e., queries that can power a real-time dashboard). It inherits the reliability and ease-of-use of PostgreSQL. And is still the only open source time-series database to support full SQL, which is important not just [for] the end user, but also for that user to share data across the organization."
Sidhu believes that there are three main requirements for the data processing platform for IoT:
"First, it should be designed for real time. IoT and sensor data is mercilessly real-time and high volume. The platform needs to provide functionality to identify patterns, predict the future, control systems, and get the insights on this streaming data to provide business value in real time.
Data must be available and queryable as soon as it is written, allowing for the building of self-healing and dynamic lights-off automation.
Second, it should be biased for action. Basic monitoring is too passive for IoT, which requires the right kind of data to give you proper observability into your systems. You can't manage what you don't understand, and the combination of the right time series data and the advances in machine learning and analytics make automation and self-regulating actions a reality.
An IoT system must be able to trigger actions, perform automated control functions, be self-regulating, and provide the basis for performing actions based on predictive trends.
Third, it should be scalable. The world demands systems that are available 24x7x365 and can automatically scale up and down depending on demand. They must be able to be deployed across different infrastructures without undue complexity.
They need to make optimal use of resources, for instance keeping only what is needed in memory, compressing data on disk when necessary, and moving less relevant data to cold storage for later analysis. They need to deal with millions of data points per second."
Time series database and the world: integration and features
What about other options for time series processing? Corcoran said they have seen a lot of technologies come and go in recent years including NoSQL and Hadoop-based applications, but most of these solutions perform poorly with time-series data at scale.
Kulkarni also conceded there are many options today to store time-series data. Some of them, like data warehouses and lakes, he said, are built for scale but at the cost of performance. Others, he added, build for scale but sacrifice reliability or ease-of-use to get there.
Sidhu noted they have seen implementations on SQL and NoSQL data stores, such as Cassandra, MongoDB, and HDFS. But he went on to add they are all too general-purpose to handle the unique requirements of today's new type of high-volume, streaming data emitted from sensors.
Where opinions part ways is on query language. Indeed, query language is an important feature for any database. While Corcoran noted how kdb+ programming language allows users to perform powerful analysis without having to write a lot of code, Kulkarni emphasized support for geo-spatial data and SQL. InfluxDB has its own query language, InfluxQL.
Another important point is integration, and out-of-the-box support for features that help building applications, such as anomaly detection. Kulkarni noted that TimescaleDB looks like PostgreSQL on the outside, but is architected for time-series on the inside:
"This means that anything that works with PostgreSQL will work with TimescaleDB out of the box. This includes connectors for Apache Kafka, Apache Spark, Tableau, and many more. Because using and operating TimescaleDB is just like PostgreSQL, it's easy to build a variety of applications on top."
Corcoran noted kdb+ has open-source interfaces and plugins for most commonly used messaging solutions, including Kafka and Spark, and also offers drivers for popular statistics and modeling products such as R, Matlab, and Python:
"Kdb+ is known for its capability to capture, analyze, and store high frequency time series data, for example from thousands of IoT sensors, running algorithms in real-time in order to compare streaming data with historical snapshots for anomaly detection."
Sidhu mentioned Telegraf, InfluxDB's open-source plugin technology which can source metrics and events from more than 200 types of endpoints: "DBs, logs, network stats, system stats, etc. It easily plugs into Kafka- and Spark- based sources, as well as streams data into InfluxDB for ingestion and further analytics and alerting," Sidhu said.
The future of time series databases
That's all fine and well, but if time-series storage and processing is so important, this also begs the question: Do time-series processing systems have a future of their own, or will they eventually become part of the offering of all databases and data processing systems, as we move toward real-time applications?
"As we move toward more real-time systems, time-series processing will become more mainstream, and more central to applications. Having the ability to combine time-series data with other types of data will be vital," Corcoran said, when asked.
Sidhu pointed to the uptick in interest on DB-Engines to suggest that time series databases are here to stay and will gain in popularity:
"This is driven by the move to instrumentation in the physical and virtual world. History is ripe with examples of new technologies and platforms being created due to changing workloads.
Traditional databases have yet to be adapted to properly support time series data at the core. Adding time-stamped data support to existing platforms will never provide the scalability and ease-of-use required for these new applications."
Kulkarni believes that all data is fundamentally time-series data, and that the database and data processing market will eventually get absorbed by time-series analysis tools:
"This may seem crazy at first, but if you think about it, every datapoint has a timestamp and analyzing data across those timestamps lets you see how your data is changing. In other words, time-series is the highest fidelity of data one can capture. So, if you're not storing you data in its raw time-series format, you're throwing valuable information away".
This is a bold statement indeed. On our part, let us note that only a few entries in the list of time series databases have commercial vendors and support behind them. Many of them are open source projects.
While oftentimes these projects are the result of years of development, the fact that the majority does not seem to have commercial entities behind them may be an indicator as to the margins this market has for independent growth. In any case, time series processing is here to stay. How exactly it will unfold? Only time will tell.
Photos: From facial recognition to connected toys, a trip inside the invisible big data revolution
It's challenging to get data scientists where you need them. And if you're managing an AI project, better be prepared for handling moving targets. These are some of the results of a survey of chief data scientists and analytics officers that we recently concluded.
What exactly are knowledge graphs, and what's with all the hype about them? Learning to tell apart hype from reality, defining different types of graphs, and picking the right tools and database for your use case is essential if you want to be like the Airbnbs, Amazons, Googles, and LinkedIns of the world.