Streamlio, an open-core streaming data fabric for the cloud era

Apache Kafka replacement and beyond. This is open-core Streamlio's claim to fame, and today's announcement of a managed cloud service brings it one step closer to reality.

Confluent brings fully-managed Kafka to the Google Cloud Platform The partnership between Confluent and Google extends the Kafka ecosystem, making it easier to consume with Google Cloud services for machine learning, analytics and more. Read more: https://zd.net/2KLSOn8

Brand new, you're retro. 

This Tricky aphorism of a song came to mind once more a couple of years back, when Streamlio came out of stealth. Streamlio is an offering for real-time data processing based on a number of Apache open source projects, and it directly competes with Confluent and Apache Kafka, which is at the core of Confluent's offering. What's the point in doing that? 

Also: Processing time series data: What are the options?

In 2017, Apache Kafka was generally considered an early adopter thing: Present in many whiteboard architecture diagrams, but not necessarily widely adopted in production in enterprises. Since then, Kafka has laid a claim to enterprise adoption, and Confluent has acquired open-core unicorn status after its latest funding. This does not make things easier for the competition, obviously.

The question remains then: Why would anybody do this, and how could it work? Streamlio's answer to the why part seems to be that, despite being new for some, Kafka is retro. As to the how: Any offering seeking to position itself as a Kafka alternative would have to be substantially faster/more reliable, while also being compatible with Kafka and offering the options that Kafka offers. 

Now, Streamlio is announcing a managed cloud service, bringing it closer to its vision. ZDNet discussed with Karthik Ramasamy and Jon Bock, Streamlio's CEO and founder and VP of marketing, respectively, about the vision and its execution.

Real time analytics

Ramasamy's bio includes over two decades of experience in real-time data processing, parallel databases, big data infrastructure, and networking. He was engineering manager and technical lead for real-time analytics at Twitter, where he co-created the Apache Heron real-time engine. 

Also: The past, present, and future of streaming

Ramasamy's co-founders are Matteo Merli, ex-Yahoo, architect, and lead developer for Apache Pulsar and a PMC member of Apache BookKeeper, and Sanjeev Kulkarni, also former Twitter technical lead for real-time analytics and Twitter Heron co-creator.

The team certainly does not lack enterprise experience, and this is part of Streamlio's message. That also explains why Streamlio managed to secure Round A Funding of $7,5 million with Lightspeed, which as Ramasamy noted has also been involved in other open-core companies.

Ramasamy noted that Streamlio's headcount is below 100 people at this point. He also pointed out, however, that Apache Pulsar, which is at the core of Streamlio, has over 100 contributors and 3.000 stars on Github. The other two Apache projects on which Streamlio is based are Heron and BookKeeper.

Pulsar is the upper layer for Streamlio, and offers an API which is Kafka-compatible -- although there are nuances to this. There are architectural differences with Kafka, which as per the Streamlio team can be boiled down to the fact that Streamlio has a decoupled layer architecture. What we see as being at the core of this, especially when talking about running Streamlio in the cloud, is BookKeeper.

Book keeping and multi-temperature storage in the cloud

BookKeeper is the storage layer for Streamlio. It was designed with the capability to implement a form of what goes by the name of multi-temperature storage management. Hot data, or data that is recent/frequently used, is kept in faster storage media. Cold data, or data that is less recent/frequently used, is offloaded to slower secondary storage. 

Also: Data, crystal balls, looking glasses, and boiling frogs

What makes this particularly relevant for Streamlio's cloud managed version on AWS is the fact that BookKeeper supports S3, AWS's storage layer. Streamlio's executives emphasized that other streaming platforms such as Kafka, Flink, or Spark do not have this capability built-in.

pulsar-topic-segment-offload-s3.png

Apache Pulsar tiered storage, with offloading capabilities.

Kafka storage is centered around an append-only log abstraction, similar to BookKeeper. Flink uses RocksDB as a persistence layer, and Spark uses Parquet. While all of these can be configured to work with S3 in one way or another, Streamlio claims BookKeeper is faster and easier to use, without requiring special configuration and tuning.

BookKeeper is also used by Pravega, and since it seems to be a differentiation point for Streamlio, we wondered how feasible it would be for others to adopt and integrate BookKeeper as well. Ramasamy pointed out that this would require extensive redesign, and the fact that Streamlio offers an integrated stack on top of BookKeeper is part of its value-add proposition.

As is often the case with upstarts claiming superior performance, Streamlio published a benchmark, according to which Streamlio shows up to 150 percent improvement over Kafka in terms of throughput, while maintaining up to 60 percent lower latency. Streamlio's pricing for its AWS managed version is based on throughput, although it was noted that AWS pricing based on instance capabilities also applies.

Zookeeper and SQL in the cloud

Streamlio also uses Apache Zookeeper, which is considered legacy and a single point of failure, typically used to manage Hadoop clusters on-premise. Using Zookeeper in AWS did not seem to make much sense to us, so we wondered what the rationale was. Ramasamy said that Zookeeper is not used to manage Streamlio, only to serve metadata. He went on to add that Zookeeper is "invisible," and Streamlio's cloud service is container-based

Also: Real-time data processing just got more options

Streamlio also features a number of other interesting architectural choices, including its support for serverless functions, and SQL. The latter is implemented using Presto, the SQL engine open-sourced by Facebook. This, in turn, has some interesting implications.

On the one hand, it means Streamlio benefits by the fact that Presto was designed to support standard ANSI SQL semantics, and it can be used to integrate other sources as well. So, via Presto, Streamlio users can do things such as joining data in Streamlio with external tables, and using BI tools on top of Presto. On the other hand, this design means that queries are not really done on the incoming streaming data in real time.

streamlioarchitecture.jpg

Streamlio's architecture.

When discussing this, Ramasamy said that this was a conscious choice, and it has to do with the overall vision for Streamlio. For Ramasamy, streaming platforms are not meant to replace databases. What he sees as the end goal, however, goes beyond being able to ingest data and dispatch it to the right recipients. Be it via Pup-Sub messaging or Queueing, Streamlio wants to enable its users to run quick analytics over incoming data.

For more in-depth analysis, however, Ramasamy would rather defer to offerings specifically designed for this. What he sees as the role of Streamlio is to act as the data fabric to facilitate data movement, wherever that data may originate from, or be directed to: The edge, the cloud, or the datacenter.

Streamlio's positioning and strategy

That seems like a well-directed vision for Streamlio. The cloud is here to stay, but on-premise data centers are not going away either, and applications on the edge also need to communicate their data. The million-dollar question is: Why pick Streamlio over a number of alternatives? All data streaming platforms want to play this role, and each of them has some things going for it. 

Also: Apache Arrow: The little data accelerator that could 

Streamlio, as opposed to Kafka, Spark or Flink, does look like an early adopter thing at this point. Although there really seem to be technical benefits to Streamlio's architecture, the reality is the competition is ahead in terms of maturity, adoption, funding, and mindshare. But that's not to say Streamlio is a lost cause, or that nobody is using it -- far from it.

Besides being used in production at Yahoo and Twitter, Streamlio has adopters such as Zhaopin (Monster.com company in China) and STICorp to show for. STICorp actually used Streamlio to replace Kafka, although it's worth noting here that Ramasamy pointed out Streamlio is not a drop-in replacement for Kafka. 

fancycrave-224908-unsplash.jpg

A data fabric is a metaphor used to denote a layer weaving data from disparate sources together.

(Image: Fancycrave on Unsplash)

There is API compatibility, but the way it works is by passing code utilizing Kafka API calls through a tool which replaces them with corresponding Streamlio API calls. Ramasamy noted that this guarantees functional equivalence, but it does not mean there is 100 percent correspondence between Kafka and Streamlio APIs, as they reflect different underlying models. Streamlio also noted that there is a prototype integration with Apache Beam, which they will develop further if there is sufficient customer interest.

A broader point to make here, drawing on the comparison between Confluent and Streamlio, would be that of doing open source business. Especially in the light of AWS's fork of Elastic, the latest episode in an ongoing escalation between open source enterprise vendors and AWS. If Streamlio is as successful as others in the market, would it not be yet another target for AWS appropriation? How would it respond to that?

Ramasamy thinks 2019 will mark the decline of open source support as a business model, and the rapid rise of open-source SaaS as a growth market and key business model for open source overall. He predicts we'll see vendors seeking to compete and differentiate on their ability to provide the best possible software-as-a-service -- but leveraging open source technology instead of a proprietary offering: 

"We'll see [vendors] work to provide value-added flexibility, elasticity and performance specific to cloud and SaaS environments in order to deliver what customers increasingly see as the most important value-add: Ensuring that customers can focus on building their applications, and spend less time on care and feeding of the underlying technology that those applications use."

That seems to be reflected in Streamlio's strategy, too. Take open-source components, integrate them, extend them, and build a commercial offering on top of it. Whether that is the end-all in open source is a different discussion. But it is what Streamlio is betting on.

Related stories: