Down an invisible road: Service mesh’s unpaved path to practicality
A solution to maintaining connectivity for thousands of microservices in a network, may have come just in time, or even a moment too late for some. Yet just behind it may be another solution to the traffic jam it’s expected to cause.
"Applications are going through a reimagining process," declared Vijoy Pandey, Cisco's vice president and CTO of cloud computing. "You're taking an application, and breaking it down into smaller pieces: microservices, containers. We're going thinner and thinner."
Breaking down applications on a server into discrete, compartmentalized, containerized functions offers a wealth of benefits. It lets data centers distribute microservices to the processors and systems best suited for them. It enables massive services to evolve continually and incrementally. It improves security by enforcing policies that govern how these functions communicate, and by separating databases and data streams into virtual appliances that are guarded by gateways.
There is a tradeoff, and it's at least equally as huge: It decomposes systems into colossal Petri dishes of seemingly indistinguishable micro-organisms.
Truth in visibility
"We've fully embraced the idea of microservices," wrote Calvin French-Owen, co-founder of data and analytics infrastructure provider Segment in 2015. The reason, Owen explained, was visibility: "When we're getting paged at 3 a.m. on a Tuesday, it's a million times easier to see that a given worker is backing up compared to adding tracing through every single function call of a monolithic app."
"Instead of enabling us to move faster," she continued, "the small team found themselves mired in exploding complexity. The essential benefits of this architecture became burdens. As our velocity plummeted, our defect rate exploded. Eventually, the team found themselves unable to make headway, with three full-time engineers spending most of their time just keeping the system alive. Something had to change."
It was a whiplash effect that software engineers immediately noticed, including at Microsoft.
"Some companies certainly give up," remarked Microsoft product manager Brian Redmond, during a session at KubeCon 2019 in San Diego. "If you go down a path of microservices, you're probably going to need better procedures, and things to do automation around them."
In addition to consolidated practices around testing and deployment, Redmond suggests a service mesh. Its principal benefit, as he described it, citing a lengthy blog post from William Morgan, the co-creator of one service mesh entitled Linkerd (pronounced "linker · dee"), is that it decouples the functions that make connectivity possible in a microservices application, from the actual jobs those microservices are being depended upon to perform. The term "mesh" refers to a network topology that has no hierarchical or symmetrical structure whatsoever, where any node is theoretically capable of connecting with any other node.
Understanding why this bizarre-sounding tool has suddenly become so vitally necessary requires us to dial ourselves back in time over a half-century.
The subroutine resurfaces
In the 1960s, a computer program was an enumerated sequence of instructions. The first reusable code in such a program was an instruction cluster that was called by the number of its first line, like GOSUB 900, and that ended with the dead-end statement RETURN. Reusable code is a necessity in every application. It was one way of making algorithms -- functions that seek their solutions using repetition -- feasible. Another was a repetitive loop clause, marked at the top with FOR and the bottom with NEXT, but even here, subroutines were often called by the code in-between.
As programming languages evolved, large blocks of reusable code were stacked into libraries. At first, these libraries were compiled directly into the object code, making executable files more and bulkier. Later, in the first "distributed programming" systems (the phrase means something much different today), separate library files could be dynamically linked with each other. This made multitasking feasible in Microsoft Windows. The most common linking method was called the remote procedure call (RPC).
But even those earliest models relied on binding platforms, such as a common operating system, or some shared stretch of infrastructure. In today's networked environments, distributed functions no longer share anything but a protocol to contact one another -- a common interface. So the methodologies for threading distributed functions and microservices together, to borrow a phrase that may have special meaning to regular readers of ZDNet Scale, have been all over the map.
"As we go thinner and thinner, the connectivity between these pieces of a singular application, is getting worse and worse," remarked Cisco's Pandey, speaking with ZDNet. "Things that were library calls within a monolithic app are now RPC calls across the globe, potentially. And the developer doesn't give a damn."
It's not that the software developer does not care about the quality of service -- that's not what Pandey is implying here. He's explaining that the glue that binds functions together in a singular application, has been replaced over the years in evolutionary surges: starting with the collective binding of the source code, with multiple sets of dynamically linked functions, then with Web services contacted using REST API calls, and now with something else entirely: the network overlay. In this extremely distributed model, where every component is a virtual appliance with a network address, the gaps between networked functions may be immeasurably small, or they may be the size of the planet.
None of the activity that brings these functions together as a cohesive unit, is the concern of any of the software developers responsible for these unlinked functions. In most cases, they invoke functions with little more concern than they had when they wrote GOSUB 900.
"For the developer. . . it's just doing a single functional call and returning a value, or returning something. So it's really thin," Pandey went on, "and it's trying to connect to another function, which could be anywhere -- it could be on-prem, it could be in a different [cloud] region, it could be in a GCP [Google Cloud Platform] BigQuery -- it could be anything! And rightfully so, the developer says, 'Why should I care? I'm trying to build this application as quickly as possible, because your business, my business depends on this. Why should I care about the infrastructure?'"
Like "Web services" in 1999, "service mesh" has yet to be generally understood, even by people who may be using it right this moment. Enterprise networks are software-defined. They use a second set of addresses (an overlay) that are mapped not to appliances, switches, or routers, but the gateways of applications and services. Right now, most of these overlays require Internet-style lookups of the Domain Name Service (DNS), for any Web-routed request to locate the correct destination.
A service mesh would enable addresses to be mapped to functions more directly, thus simplifying the addressing scheme. Service discovery (finding the right location for a requested function) could be accomplished by any number of more logical, more programmable methods, which may involve DNS or which could replace it entirely. These methods may be governed by policy, making them far more secure. What's more, in a network using microservices where functions are scattered about and redistributed according to the unpredictable dicta of the orchestrator (in this case, very likely Kubernetes), service mesh may be necessary to resolve requests to mercurially changing addresses.
At one level, it's a networking technique. Yet service mesh may become the switchboard for the next version of the cloud. If Kubernetes continues blazing trails for distributed computing systems and reforming the way cloud infrastructure is constructed, then service mesh will follow in its wake.
What has yet to be determined are the following: Which service mesh will it be? Who will inevitably be responsible for it? And how soon will the wide adoption of this technology in data center infrastructure change the way functions and services are utilized in every network, everywhere?
"We learned through the evolution of OpenStack and Kubernetes that there is a good mechanism of handling the design of a distributed system," explained Brian "Redbeard" Harrington, Red Hat's very accurately nicknamed principal product manager for his company's Service Mesh platform, based around Istio. "Which is to say, there is a control plane and there is a data plane."
Software-defined networking (SDN) brought forth the architectural principle of separating traffic related to the user application from traffic involving control and management of the system, into two planes as Redbeard described. Originally, this was to protect the control functions of the network from being disrupted by erroneous or even malicious functionality in the userspace. But in a distributed network where applications are containerized (using the highly portable virtualization system first made popular by Docker), whose connectivity is enhanced with a service mesh, this plane separation makes feasible a new concept: a containerized function that does not need a map of the network it inhabited, or even to query something else for information from such a map, to behave as part of the network.
"Historically with service mesh, part of the purpose was to glue the intercommunications between services together," continued Red Hat's Redbeard, "handling the discovery and rerouting of traffic. With Istio, we are able to achieve that same goal."
Each instance of a service in this distributed network may be equipped with a kind of proxy called a sidecar, which serves as the connectivity agent on the data plane, handling incoming and outgoing requests. For Linkerd, this sidecar is given the undramatic name linkerd-proxy, and is depicted residing in the data plane in the diagram above; for Istio, whose architecture actually differs very little, the sidecar proxy is provided by a CNCF project named Envoy. In this diagram, each blue rectangle represents a Kubernetes pod; each grey box within it is a container. Within a pod, its containers all share resources, so the application in the data plan has access to linkerd-proxy, but does not need to know or care about anything in the controller pod -- to borrow Vijoy Pandey's phrase, the developer doesn't give a damn. All the network and service configuration functions, along with monitoring and visibility, are handled by containers stationed on the control plane, while the sidecar travels with the application through the system.
Like most everything in the "container ecosystem," a complete service mesh is an assembly of multiple components, some of which even hail from other open-source projects.
"For observability with Istio, and specifically in the context of OpenShift Service Mesh," continued Redbeard, "we mean two things: the idea of tracing our applications, generally through the use of Jaeger, but also collecting and scraping of metrics out of the actual proxy components with Prometheus. . . The distributed tracing piece actually refers to making a request across a call chain, graphing the latency between all of those, and even capturing header information from the requests in a sampled manner."
Sampling has been the key to monitoring highly complex systems, especially when events compound one another to render an effective logging system impossible. Rather than recording every transaction in a ledger whose size would soon span galaxies, a sampling system gathers enough information about the behavior of each component in the network for a monitoring system to ascertain whether any part of that behavior is worth monitoring more closely.
Red Hat has led the development of OpenShift's own component for this purpose, called Kiali [pictured above]. This is perhaps the functionality that Segment could have used three years earlier. "We have glued a bunch of these components together," Redbeard continued, "that we think any operator or administrator is going to need, to have a well-running service mesh, and coordinated them through the [Red Hat] Operator Framework."
Cisco offers an alternate, or perhaps additional, approach for what it calls Network Service Mesh (NSM, not to be confused with Cisco Network Services Manager, Cisco Network Segmentation Manager, or Cisco Network Security Monitoring [PDF]). As Vijoy Pandey reminded us, network services in the classic OSI model have been compartmentalized into seven layers, with user application traffic inhabiting the highest, Level 7. The viability of service meshes such as Linkerd and Istio, he told ZDNet, presume that the underlying network layers -- particularly Layer 2 (data link), Layer 3 (network map), and Layer 4 (transport) -- are already fully resolved. He made the case that a service mesh can resolve connecting a single Kubernetes pod to the underlying network, but not so much one pod to another pod. Cisco's NSM effort, he said, has been joined by Red Hat and VMware.
"NSM basically solves that Layer 2, Layer 3, Layer 4 connectivity problem," said Pandey, "across a multi-cloud domain, in a manner that is friendly to the developer.
"Let's break it down into two solid verticals," he continued. "One vertical has everything to do with containers and Kubernetes, but that purest type of environment is only in our dreams. The other vertical is where there are containers and Kubernetes, along with bare metal, mainframes, VMs, and everything else under the sun -- the brownfield mess that all of our customers deal with. In the Kubernetes/containers/serverless environment, we are assuring that NSM is going to be open-source 'til the cows come home. We are making sure that connectivity problem, for a Kubernetes container deployment, is always going to be open source and free. In that perspective, NSM is going to be a first-class citizen with any Kubernetes deployment."
At the time of our interview, he said, NSM was within two production deployments of meeting the minimum criteria (there should be three) that Kubernetes maintainer CNCF mandates to qualify for "incubation" -- to be fully maintained by CNCF, like Kubernetes itself was in its first years. By next year, Cisco hopes to have five independent users for NSM. At that point, he hopes, NSM can become a part of every CNCF-certified Kubernetes deployment. In other words, in this hopeful and bright future, if you're deploying Kubernetes, you're deploying NSM.
What this would mean, if everything turns out as Cisco has planned, is that the class of Layer 2 and Layer 3 infrastructure that telecommunications providers say they require to make virtualized network functions (VNF) run in secure isolation, will be achievable using containerized network functions (CNF) instead. NSM would weave a dedicated network of addresses supporting the orchestrator and other systems that maintain workloads. Then a Layer 7 service mesh, such as Istio or Linkerd, would still weave a network overlay for connecting those workloads, but this time with the assurance of underlying connectivity -- even if the nodes in which those workloads are maintained reside on different cloud platforms.
It would be an altogether different Internet than the one we rely upon today, one which may be less dependent upon DNS for resolving fixed addresses -- some have speculated it could be independent of DNS. The definition of domains could change, and thus domain names as we have come to know them. It could make applications and services available to users on a per-use basis, perhaps independently of the cloud or telco that hosts them.
But this is not a prediction. We've been at this crossroads before, only to make U-turns, or to get stuck spinning in traffic circles. Reimagining the way things should be in our data centers, has already become an industry unto itself. Yet this time, there is one underlying certainty: There are many organizations that, like Segment, have no more time to waste. Reality demands one permanent solution.