The rise of Kubernetes epitomizes the transition from big data to flexible data
Can a platform conceived to support running ephemeral applications become the operating system of choice for running data workloads in the multi-cloud and hybrid cloud era? Looks like it, but we're not just there yet.
Kubernetes is not exactly under the radar technology. Kubecon, the main event for Kubernetes in the US, has been sold out for a while. People such as Sarah Wells, technical director for Operations and Reliability at The Financial Times, comment on its phenomenal growth as a sign the technology is "crossing the chasm" to reach the early adopters.
The key driver behind Kubernetes popularity is its ability to help the people whose job is to make sure applications are seamlessly deployed and ran on premises and in the cloud. Kubernetes is evolving from supporting simple, stateless applications, to sophisticated data-driven applications, and data platform providers are taking note.
ZDNet spoke with two of the trailblazers on the transition from big data to flexible data, DataStax and Hortonworks. Their insights help us map where we are on this journey.
From Big Data to Flexible Data
It's no secret: Big data as we know it is dead. Not that data volume, variety, velocity, and veracity are showing any signs of breaking down -- on the contrary. It's just that the realities of the underlying technology have changed, and with them, the architectures and the economics are changing, too.
Hadoop, for example, which has been the poster child of the big data era, was built in a world with different fundamental assumptions than the world we live in today. A world in which network latency was a major bottleneck, and cloud storage was not a competitive option. In that world, most data was on-premise, and making sure data was co-located with compute to avoid having to move them around made a lot of sense.
Today, network latency is less of an issue for cloud providers, and there are more of them to choose from, so we are talking about multi-cloud. Furthermore, for an array of reasons, many organizations are also deploying their own private clouds on premise, so we are talking about hybrid cloud. We are facing a situation in which data is still big, but it also needs to be flexible.
Clouds were largely built on the abstraction of the virtual machine. A virtual machine is a layer on top of the hardware that emulates a physical machine with an operating system -- on which applications can be deployed. The problem with this approach, however, is that it's not fine grained, and it introduces dependencies.
If application A needs version 1 of library X to operate, and application B needs version 2, that's not so easy to manage. And if application A crashes, there's a chance it will bring down the entire virtual machine, affecting application B too. So the idea was to include everything an application needs to run in one container, without any external dependencies.
Initially, to keep things simple, containers were designed for ephemeral applications only: Relatively short-lived applications that do not need to store state. But as containers became more and more popular as a way to homogenize application deployment across multi-cloud and hybrid cloud, we reached an inflection point. Kubernetes has risen as the de facto standard operating system for the cloud era.
Kubernetes helps orchestrate containers: It provides the resources they need, and manages their lifecycle. As having stateless-only applications cancels out the promise of an operating system for the cloud, containers have started adding mechanisms to support stateful applications. Data platform providers are going with the flow, and porting their platforms to run on Kubernetes.
"Eighty percent of applications on Kubernetes are stateless, and as persistent storage technology for containers matures, more PostgreSQL or MySQL will start to be deployed on Kubernetes containers.
When that happens, all parts of the micro-services architecture can then be hosted on Kubernetes which is an evolutionary path. There is also the existing security practice which was designed for bare-metal and as applications are deployed on containers, existing tools need to evolve. So, the Kubernetes eco-system will keep evolving and that's the reason we threw our hat into participating in Cloud Native Computing Foundation (CNCF)."
But the PostgreSQLs and MySQLs of the world are not all there is to deploy on Kubernetes. DataStax Enterprise (DSE), a proprietary database based on open source Cassandra, is a NoSQL database, but that may not make that much of a difference in the end. At least in terms of the end goal of addressing multi-cloud and hybrid cloud environments for deployment.
Kathryn Erickson, senior director of Strategic Partnerships at DataStax, noted that Kubernetes is a great example of open source in action:
"The community and vendors backing Kubernetes provided a revolutionary solution to scaling stateless apps. The next logical step is to further simplify infrastructure management by integrating the database backends supporting those stateless applications into the same orchestration layer.
We see the community responding to this demand and evolving the project to support more stateful services. This has involved multiple approaches and sometimes false starts, but that's the nature of community-driven development and Kubernetes is converging on a successful story here."
Porting something like Hadoop or DSE to work on Kubernetes won't be painless. Buragohain conceded there have been challenges along the way. He said some are being addressed and some are yet to be addressed, and that's where an opportunity lies. Buragohain noted persistent storage, scheduling, security, and networking as the key challenges to work on"
This is why heavily investing in next generation storage is key, as it will overhaul the HDFS architecture and solve for scale and multiple protocols (iSCSI, NFS, S3) with container storage interface. There are many other considerations such as locality of compute/storage or strong consistency. For example, HBase's low latency design requires compute/storage to be co-located."
Erickson also noted that orchestrating stateful applications is possible now that Stateful Sets have matured, but there are still some tasks that are easier to automate than others:
"Adding a node is simple, but removing a stateful node requires a deeper integration to ensure that Kubernetes gracefully handles redistributing that node's data.
Another pain point we have is that point upgrades of DSE can be as simple as a rolling container replacement, while major upgrades require additional orchestration, which varies depending on which portions of the database a customer has enabled. Essentially, any automation of operational database tasks requires not only a deep understanding of DSE, but also of Kubernetes."
It may help to note here that the way Kubernetes is different from a "regular" operating system is that it is based on different metaphors -- events, streams, queues and blocks. Kubernetes is fundamentally asynchronous. Therefore, the issue of scheduling jobs, for example, has to be resolved in a different setting.
In the big data world, said Buragohain, customers have business analysts running interactive sub-second queries for reporting, data engineering running a batch ETL job, or data scientists running a very GPU intensive deep learning model training, and they all have various needs:
"The elastic paradigm also needs to be supported, as thousands of big data jobs are run in a shared multi-tenant cluster. Hortonworks has invested in the world's most hardened scheduler, Apache YARN, to provide queues to submit jobs. Over the years, we have invested in various techniques such as queue priority, min/max capacity, affinity/anti-affinity etc.
These have been hardened with years of production deployments in some of the largest installed bases. Kubernetes does not have a capacity scheduler like YARN and we see an opportunity for ourselves there."
Buragohain went on to add that as customers go from a bare-metal servers to container world, existing firewall policies might not work, so a security paradigm for the new container world is definitely needed.
In terms of networking, he noted, containers cannot talk to other containers in different physical machines over networks unless an overlay network is created - unlike a physical or virtual machine with a full networking stack:
"Kubernetes has done a good job in creating a standardized container networking interface so that containers can leverage thi-rd party software defined networking framework such as Calico. This now requires those working with the server, containers, networks, and applications to work together more closely.
When supporting Big Data workloads on Kubernetes, businesses need to invest in container networking interface and make sure all workloads work end to end across networking/security/storage/containers."
Both DataStax and Hortonworks seem to converge on convergence: They note that in order for something like a cloud-native operating system to emerge, allowing big data to become flexible data, many aspects need to be resolved. For this to work, in turn, consensus and coordination will be required. Could this also involve compromises?
Erickson said DataStax is not making compromises to run on Kubernetes, but evolving to meet customer demand in this space: "Providing Docker images was a start, but we've also just released a metrics collector which aggregates DSE metrics and integrates with existing monitoring solutions like Prometheus and Grafana.
Erickson added: "We're also helping customers take steps toward Kubernetes by advising them on using brokers which make DSE a discoverable service in Kubernetes without a deep integration."
Buragohain noted that Kubernetes is largely a community effort based on collaboration across a wide variety of technologies (networking/storage/security), so there is a distinct advantage here for vendors with an open source model.
"This is the reason why we want to participate in CNCF and help the broader community with a big data centric architecture based on Kubernetes," Buragohain said.
Erickson believes Kubernetes will become a significant enabler for serverless application architectures, which will eventually dominate how users interact with databases:
"For orchestration of the database itself, we will lead with open source Kubernetes tooling and guidance which can be adapted to work within our Enterprise Technology Partner implementations as customer demand justifies.
For cloud database offerings, the underlying orchestration matters less as it is abstracted from end users. As Kubernetes matures, database companies will replace antiquated automation backends with Kubernetes.
In particular, there is a lot of demand for hybrid database technology that can work the same way on premise and in the cloud, and Kubernetes can play a big role in enabling operations automation in both worlds."
Cloud services: 24 lesser-known web services your business needs to try