Yahoo's Bullet looks ahead in querying streaming data

Yahoo is open sourcing a new query engine for streaming data that eliminates the need to cache data, and therefore can "look ahead" in querying data as it flows through.
Written by Tony Baer (dbInsight), Contributor

A few months back, we posed the question of whether the world needs another streaming engine. Now we'll extend that question to querying. Virtually each streaming engine has a way to submit queries - otherwise, why would you need a streaming engine? Although streaming engines offer the promise of fresh real-time data, the ugly truth is that they must cache data first. That means that most streaming query engines must look back at data that has already been collected.

As it tracked user engagement over its Internet properties, Yahoo was seeking a more lightweight means for performing the fairly rudimentary queries that are typically thrown at data in motion, such as counts, averages, ranks, and distributions. And given the traffic levels, it sought a means for validating the sensors and instrumentation that picks up these counts. In fact, validation became the primary use case for this new project.

The result is the Bullet project, which Yahoo just open sourced on GitHub. Bullet is a highly distributed framework designed for cloud multi-tenant data centers that let you run "forward-looking' queries. Bullet queries act on data flowing through the system after you submit the query. In other words, you query data that will arrive, rather than data that has already arrived. Unusual for an open source project, Bullet also includes a visual user interface, so you're not necessarily restricted to command line or third party tools. And it also has a REST API for programmatic access.

As a query engine, Bullet was designed to be light weight, adding minimal overhead as you process streams. But there is some heavy lift involved in that the raw data, formatted as Avro files, must be parsed into columns that can then be hit with SQL queries that are placed over sliding time windows.

Bullet can fetch individual records; perform aggregations such as group-bys, sums, counts, rankings, and averages. And, for use cases where you want to calibrate the software instrumentation, it can generate histograms showing the distributions of actual data values. And if the torrent of data is too large for available memory, you can sample using the DataSketch library that Yahoo developed. DataSketches provide the closest thing to persistence in Bullet, in that they cache results (but not the raw data).

As Yahoo created uses Storm, it's not surprising that Bullet has been optimized for that engine. But it can also read from Kafka or Flume. It probably wouldn't add much value to engines like Spark Streaming that are limited to microbatching.

While Bullet today is limited to a handful of SQL queries on live, time-windowed data, one of the next features that will be added is the ability to stream incremental results to a client application via the REST API.

For now, Bullet is early stage technology, available as open source through GitHub. There's no vendor support and it's not part of any tool, so you're on your own with regard to managing and integrating it. Bullet competes in a very crowded landscape of log monitoring engines such as Splunk, Logstash/Elasticsearch, and others that provide near real-time capabilities. The challenge for getting mindshare is proving the case that forward-looking queries provide the edge in knowing your customers through the digital log file footprints they leave.

Editorial standards