Processing time series data: What are the options?

Get your data from everywhere you can, anytime you can, they said, so you did. Now, you have a series of data points through time (a time series) in your hands, and you don't know what to do with it? Worry not, because there's a bunch of options.

Google does not always get things right, or get to things first. But when Google sets its sights on something, you know that something is about to attract interest. With Google having just announced its Cloud Inference API to uncover insights from time series data, it's a good time to check the options for processing time series data.

A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Examples of time series are heights of ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.

Also: Volume, velocity, and variety: Understanding the three V's of big data

That's how Wikipedia defines time series, and by that definition, most data starts looking like time series. That's why time series data processing is important, and will become even more important going forward: If you keep recording values for the same thing, time after time, what you have is a time series.

Streaming frameworks, the cloud, and time series databases

If that sounds familiar, it's because real-time applications are the premise behind something we have been covering a lot: Frameworks for streaming, real-time data processing. If you want to ingest data at real time, and apply transformations and rules to process them on the fly, streaming frameworks can help.

And with ACID capabilities for streaming having just being added, this becomes a viable alternative to traditional databases. But even though streaming is gaining adoption, not everyone has streaming processing in place, or is ready to adopt it just yet. As even the leaders in streaming point out, this requires a change of mindset and software infrastructure.

Also: Big Data 2018: Cloud storage becomes the de facto data lake

So, if you have your time series data in place somehow, and you are looking to analyze it to gain insights a posteriori, how can you do this besides streaming frameworks?

With the cloud becoming the de facto storage for a big part of newly produced time series data, having a way to process that data in the cloud where it lives would come in handy. This explains Google's latest announcement, as well as the fact that both AWS and Microsoft Azure have their own offerings there.

Cloud is not the only option, however. Time series databases is another one -- that can also be used in the cloud. This is a class of database solutions designed to handle storage and processing of time series data.

There are many alternatives to choose from, though not all of those are custom built to handle time series. A couple of the top ones responded to ZDNet's request for comment on the state of the union on time series processing.

cloudgrowth.jpg

Like all data, time series data live in the cloud these days. Image: maxsattana, Getty Images/iStockphoto

Navdeep Sidhu, InfluxData head of product marketing, is very encouraged by what we have seen from Google's offering:

"We are as excited as they are in seeing the platform get adopted and how it evolves as real usage patterns emerge. Google's market presence and technical acumen will ensure that this platform will be widely used.

We think that having a strong data storage and analytics layer that is designed for IoT sensor data ingestion, real-time analytics, and insight is a key component of any IoT platform."

James Corcoran, SVP of products, solutions and innovation at Kx, the vendor behind kdb+, thinks it's too early to comment on Google's announcement, but will be following it with great interest.

Also: Amazon Deeplens wants to jumpstart machine learning CNET

Ajay Kulkarni, CEO and co-founder at TimescaleDB, said he loves innovation in data analytics, and is glad that Google is taking time-series data seriously:

"We'd agree that building a system that can scale is challenging, and that data analysis stacks have gotten so complex that simplifying them is a good thing.

That said, the offering still feels very early. I believe their only quote is from an engineer who says it looks 'promising?' Aside from maturity, something else the project seems to lack is a real query language. What no one wants is yet-another-query-language to learn. Which is why the data analysis industry is starting to re-standardize back on SQL."

Key requirements for time series processing

But what are some key requirements for time series data processing? By its nature, time-series data is always being appended to, so it is really important that a technical solution is able to handle a combination of streaming, real-time and historical data, said Corcoran:

"Time-series data tends to be big, so performance and scalability are crucial. The key requirements for working with time-series data are the abilities to analyze and aggregate the data very, very quickly.

kdb+, with a built in high performance programming language called q, is uniquely positioned to work effectively with time-series data. kdb+, and our Kx product suite built on kdb+, have been technologies of choice for the financial services industry for large-scale, critical trading applications and research applications for over 20 years."

Kulkarni emphasized scale, performance, reliability, ease-of-use, and SQL:

"TimescaleDB scales to 100TB with performant queries (i.e., queries that can power a real-time dashboard). It inherits the reliability and ease-of-use of PostgreSQL. And is still the only open source time-series database to support full SQL, which is important not just [for] the end user, but also for that user to share data across the organization."
digital-transformation.jpg

Integration and out-of-the-box support for features to build applications on are some key requirements for time series processing. Image: Getty Images/iStockphoto

Sidhu believes that there are three main requirements for the data processing platform for IoT:

"First, it should be designed for real time. IoT and sensor data is mercilessly real-time and high volume. The platform needs to provide functionality to identify patterns, predict the future, control systems, and get the insights on this streaming data to provide business value in real time.

Data must be available and queryable as soon as it is written, allowing for the building of self-healing and dynamic lights-off automation.

Second, it should be biased for action. Basic monitoring is too passive for IoT, which requires the right kind of data to give you proper observability into your systems. You can't manage what you don't understand, and the combination of the right time series data and the advances in machine learning and analytics make automation and self-regulating actions a reality.

An IoT system must be able to trigger actions, perform automated control functions, be self-regulating, and provide the basis for performing actions based on predictive trends.

Third, it should be scalable. The world demands systems that are available 24x7x365 and can automatically scale up and down depending on demand. They must be able to be deployed across different infrastructures without undue complexity.

They need to make optimal use of resources, for instance keeping only what is needed in memory, compressing data on disk when necessary, and moving less relevant data to cold storage for later analysis. They need to deal with millions of data points per second."

Time series database and the world: integration and features

What about other options for time series processing? Corcoran said they have seen a lot of technologies come and go in recent years including NoSQL and Hadoop-based applications, but most of these solutions perform poorly with time-series data at scale.

Kulkarni also conceded there are many options today to store time-series data. Some of them, like data warehouses and lakes, he said, are built for scale but at the cost of performance. Others, he added, build for scale but sacrifice reliability or ease-of-use to get there.

Also: Big data architecture: Navigating the complexity TechRepublic

Sidhu noted they have seen implementations on SQL and NoSQL data stores, such as Cassandra, MongoDB, and HDFS. But he went on to add they are all too general-purpose to handle the unique requirements of today's new type of high-volume, streaming data emitted from sensors.

Where opinions part ways is on query language. Indeed, query language is an important feature for any database. While Corcoran noted how kdb+ programming language allows users to perform powerful analysis without having to write a lot of code, Kulkarni emphasized support for geo-spatial data and SQL. InfluxDB has its own query language, InfluxQL.

whysql.png

Like in any other database, query language is an important aspect of time series databases.

Another important point is integration, and out-of-the-box support for features that help building applications, such as anomaly detection. Kulkarni noted that TimescaleDB looks like PostgreSQL on the outside, but is architected for time-series on the inside:

"This means that anything that works with PostgreSQL will work with TimescaleDB out of the box. This includes connectors for Apache Kafka, Apache Spark, Tableau, and many more. Because using and operating TimescaleDB is just like PostgreSQL, it's easy to build a variety of applications on top."

Corcoran noted kdb+ has open-source interfaces and plugins for most commonly used messaging solutions, including Kafka and Spark, and also offers drivers for popular statistics and modeling products such as R, Matlab, and Python:

"Kdb+ is known for its capability to capture, analyze, and store high frequency time series data, for example from thousands of IoT sensors, running algorithms in real-time in order to compare streaming data with historical snapshots for anomaly detection."

Sidhu mentioned Telegraf, InfluxDB's open-source plugin technology which can source metrics and events from more than 200 types of endpoints: "DBs, logs, network stats, system stats, etc. It easily plugs into Kafka- and Spark- based sources, as well as streams data into InfluxDB for ingestion and further analytics and alerting," Sidhu said.

The future of time series databases

That's all fine and well, but if time-series storage and processing is so important, this also begs the question: Do time-series processing systems have a future of their own, or will they eventually become part of the offering of all databases and data processing systems, as we move toward real-time applications?

Also: What to do when big data gets too big TechRepublic

In other words, will time-series databases be eventually absorbed by other vendors, like our ZDNet co-contributor Tony Baer has predicted will happen with GPU databases for example?

"As we move toward more real-time systems, time-series processing will become more mainstream, and more central to applications. Having the ability to combine time-series data with other types of data will be vital," Corcoran said, when asked.

opera-snapshot2018-09-27124209db-engines-com.png

Time series databases are gaining momentum. But how many of them can have a future of their own? Image: DB-Engines

Sidhu pointed to the uptick in interest on DB-Engines to suggest that time series databases are here to stay and will gain in popularity:

"This is driven by the move to instrumentation in the physical and virtual world. History is ripe with examples of new technologies and platforms being created due to changing workloads.

Traditional databases have yet to be adapted to properly support time series data at the core. Adding time-stamped data support to existing platforms will never provide the scalability and ease-of-use required for these new applications."

Kulkarni believes that all data is fundamentally time-series data, and that the database and data processing market will eventually get absorbed by time-series analysis tools:

"This may seem crazy at first, but if you think about it, every datapoint has a timestamp and analyzing data across those timestamps lets you see how your data is changing. In other words, time-series is the highest fidelity of data one can capture. So, if you're not storing you data in its raw time-series format, you're throwing valuable information away".

This is a bold statement indeed. On our part, let us note that only a few entries in the list of time series databases have commercial vendors and support behind them. Many of them are open source projects.

Also: How to build a business architecture for your big data TechRepublic

While oftentimes these projects are the result of years of development, the fact that the majority does not seem to have commercial entities behind them may be an indicator as to the margins this market has for independent growth. In any case, time series processing is here to stay. How exactly it will unfold? Only time will tell.

Previous and related coverage:

There is no one role for AI or data science: this is a team effort

'How quote-to-cash works in in any ERP is not something that you can teach a data scientist in two days.'

AI: The view from the Chief Data Science Office

It's challenging to get data scientists where you need them. And if you're managing an AI project, better be prepared for handling moving targets. These are some of the results of a survey of chief data scientists and analytics officers that we recently concluded.

Knowledge graphs beyond the hype: Getting knowledge in and out of graphs and databases

What exactly are knowledge graphs, and what's with all the hype about them? Learning to tell apart hype from reality, defining different types of graphs, and picking the right tools and database for your use case is essential if you want to be like the Airbnbs, Amazons, Googles, and LinkedIns of the world.

What to do with the data? The evolution of data platforms in a post big data world

Thought leader Esteban Kolsky takes on the big question: What will data platforms look like now that big data's hype is over and big data "solutions" are at hand?

Related stories: