Apache Flink takes ACID

With some of its financial services clients demanding real-time risk management capabilities, Data Artisans has brought ACID transactions to Flink. While it certainly won't replace relational databases, adding ACID at the distributed streaming tier will shift some of the transaction burdens for applications processing real-time data from multiple sources.


With streaming engines popping up right and left, we've been wondering whether the world really needs more streaming engines. When we popped the question last year about Apache Flink, the answer was, it pioneered stateful processing in the open source streaming world. That means that you can checkpoint a Flink stream without having to resort to the underlying database. Flink is no longer alone here; Spark's addition of continuous processing in the 2.3 version to some extent levelled the playing field, as Structured Streaming can also run streams (in addition to microbatches) and operate in stateful mode.

So Data Artisans, the company whose founders created Flink, have taken the next step by taking ACID. The trigger was demand by Data Artisans' capital markets customers who are trying to nail down real-time reporting on their risk positions throughout the trading day.

With the existing capabilities of Flink (and now Spark Streaming), you could insert checkpoints and watermarks on a single stream, shard, or key to get exact point-in-time snapshots on risk exposures, and as long as you don't have IOPS bottlenecks, couple it with a back-end database.

Given that no global investment bank works just with a single event feed or shard, the best advice we'd give in that situation is good luck with that. Just taking the example of transferring funds from one account to another breaks that world of simplicity as you are dealing with separate keys identifying each account. That's a trivial issue that ATM networks solved some time ago, but in the trading world, the latencies are much shorter. Communications between the shards (where each key is stored) must be virtually instantaneous.

Surprisingly, estimating real-time risk positions is more art than science, with the only certainly coming after the end of a trading period or arbitrary block of time when trading positions and risk can be pinned down for the record.

At the Flink Forward conference in Berlin this week, Data Artisans is introducing the Streaming Ledger, which is implemented as a library on Flink. It is available as an API that can be downloaded from GitHub for a single stream, with the "runner" for multiple parallel streams licensed as the commercial product. It extends Apache Flink with the ability to perform serializable transactions from multiple streams across shared tables, and multiple rows of each table. Data Artisans likens it as the streaming equivalent of performing multi-row transactions on one or more key/value databases.

Unlike distributed databases, the Streaming Ledger does not use normal database locks or approaches like multi-version concurrency control (MVCC). As Flink is not a database, transaction logic is contained in the application code (the transaction function), with data persisted in memory or in RocksDB. The event log and timestamps are used for commits and optionally can output them in result streams. Streaming Ledger can operate in Flink's exactly-once or at-least-once operating modes, where exactly once provides stronger durability but at the cost of higher latency.

Data Artisans is targeting streaming ledger for funds transfer, data consolidation, and for funneling features in real-time to feed a machine learning model. The technology is a proprietary add-on to Flink for which Data Artisans has a patent pending.