Cilium, Service Meshes, and the Future of Enterprise Networking

Thomas Graf is the co-founder and CTO of Isovalent, and creator of a popular open source (and cloud native) networking technology called Cilium. Cilium is built atop a kernel-level Linux technology called eBPF.

In this interview, Graf discusses the roles that Cilium and eBPF play in the growing cloud-native networking ecosystem, as well some broader trends around Kubernetes adoption and evolution. He explains who’s using and buying Kubernetes within large enterprises, where cloud native infrastructure still needs to improve, and how the desire for standardization is driving innovation.

FUTURE: How should we think about eBPF and Cilium in the context of computing and networking, in general, and then specifically in the context of the cloud native ecosystem?

THOMAS GRAF: Overall, eBPF is the tech, and it’s extremely low-level. It was designed for kernel developers, and my background is in kernel development. eBPF is to the kernel, to the operating system, what JavaScript is to a browser. It makes the operating system programmable just like JavaScript makes the browser programmable. In the past, we had to upgrade our browser versions to actually use certain websites. And then JavaScript came, and all of a sudden application teams and developers could build massive applications — to the point where the most popular word-processing application got replaced by an in-browser application. It led to a huge wave of innovation.

The same is happening with eBPF, although on the operating system level, because all of a sudden we can do things at the kernel or operating system level where we see everything and control everything — which is very important for security — without having to change kernel source code. We can essentially load programs into the kernel to extend its functionality and bring new capabilities with it. This also has unlocked a massive wave of innovation. Hyperscalers like Facebook, Google, and Netflix are using this on their own, directly, with their own kernel teams.

What Cilium brings to the table is it uses that low-level eBPF technology to essentially provide a new wave of software infrastructure, particularly for the cloud native wave. Think of this like software-defined networking and what Nicira, which became VMware NSX, did for the virtualization industry. We are doing the same for cloud native, where it’s a mix of cloud provider or public cloud infrastructure, as well as on-premises infrastructure. And we’re solving networking, security, and observability use cases with that at the infrastructure layer.

And the Cilium Service Mesh, which was just released, is an evolution of these capabilities?

What’s currently happening, since about a year ago, is that the two spaces are colliding. What Cilium has been doing so far is focused on networking, virtualized networking, and then cloud native networking — but still networking. But then, coming at it from the top down, were application teams at Twitter and Google doing service mesh stuff — in the application first, and then the sidecar-based model, the proxy-based model, which is what projects like Istio deliver. And now these two layers are coming closer because traditional enterprises are coming into the cloud native world, and they have enterprise networking requirements, but their app teams also want a service mesh.

Gartner is calling this new layer “service connectivity” — we’ll see if that term catches on — but it’s essentially a layer that includes the enterprise networking piece and the service mesh piece that is coming from the application teams. And because that’s what customers are demanding, we have added the capabilities into Cilium itself. So, essentially, Cilium is going upward from the enterprise networking side and the service meshes are going downward into more of the networking side.

Service mesh

Per Wikipedia: A service mesh is a dedicated infrastructure layer for facilitating service-to-service communications between services or microservices, using a proxy. A dedicated communication layer can provide a number of benefits, such as providing observability into communications, providing secure connections, or automating retries and backoff for failed requests.

Why is there so much focus on the networking and service mesh level of the Kubernetes stack?

Because with the desire to run in multiple clouds and to split applications apart into containers, the connectivity layer has become central. What used to be maybe inter-process communication and middleware is now the network, so the network is becoming absolutely essential for applications to talk to each other and for data to flow.

And in cloud native, in particular, multi-cloud is becoming absolutely essential. All the cloud providers have their own networking layers, but, of course, tailored to their own clouds. They do have on-prem offerings, but they’re not truly multi-cloud. Cilium and eBPF bring to the table that multi-cloud, agnostic layer. It behaves exactly the same on-premises as it does in the public cloud. Several of the public cloud providers are using Cilium under the hood for their managed Kubernetes offerings, and telcos are using it for on-prem 5G infrastructure. It’s about speaking both languages and connecting these worlds together.

That’s why there is so much focus on this: because one of the easiest ways for cloud providers to lock customers in is to own that connectivity layer. I think from a strategic infrastructure perspective, just like the virtualization layer was key, now the connectivity and network layer is absolutely key.

The source of [future] innovation will be open-source, and the customers and users driving the demand will be companies one level down from the hyperscalers — already sizable companies that are still highly disruptive.

Kubernetes is pretty widely accepted and adopted at this point, but there’s still talk in some circles of it being overkill. Who do you think Kubernetes, and the cloud native ecosystem overall, is for?

It’s for modern application teams. I think the realization has kicked in that if you want to attract modern application teams, and be able to have quick go-to-market times, you need to provide them cloud native infrastructure. We often see prototyping — initial, pre-MVP, even proving out the concept or selling internally — on serverless, something like Lambda. And then on Kubernetes, because the app teams can own the infrastructure directly. And then, as it moves into production, they go to enterprise, on-prem Kubernetes distributions. But that’s actually a relatively small portion of the entire infrastructure, maybe a single or low double-digit percentage.

It clearly will be the new standard, though. Just like virtualization adoption was very slow initially and people said it was overkill — but over time, of course, it started to replace the majority of things — we’ll see the same here. Or just like with modern languages. People said Java was overkill, and it probably still is for a lot of applications, but there was a time where it became very hard to do any application development outside of Java because that’s what the majority of application developers could write in. The same will be true for modern application teams: they will expect to have Kubernetes around in order to develop more agile and bring the product to the market quickly.

On the infrastructure side, it might be a bit of overkill, but if the alternative is to rewrite an application from serverless into on-prem, that’s a massive task. So Kubernetes is the middle ground there, which is very attractive.

What about the idea that Kubernetes still needs a better developer experience?

If we look at the original OpenShift, before it rebased onto Kubernetes, it was this. It was even closer to the application team and was an even better application developer experience. You could push to Git and it would automatically deploy. Heroku also tried this, but SaaS-based.

Kubernetes took a step backward and said, “We need to keep some operational aspects in it and make it a bit closer to what a sysadmin would expect, as well. We cannot be only tailored to applications.” It’s the middle ground: It needs to have enough attractiveness for application teams, but it still needs to be possible to run that app outside of a specific environment, and to have it managed by people other than application developers.

I would say the biggest step between Docker and Kubernetes was that Docker was all about developer experience. It solved that part, but did not solve the public-cloud ecosystem part.

How’d we get to this point? Was this the natural evolution from platform-as-a-service (PaaS) and application containers?

It was Docker images and the packaging aspect of Docker. The old school was how to deploy into virtual machines, and there was all sorts of automation around that. And then there was what Facebook was doing with Tupperware — very custom-built and for really large scale. And then Docker came around and essentially provided this container image and everybody could treat it like a miniature VM. I can now distribute my app and instead of a 600MB virtual image, it’s now a 10MB container. But you can treat it the same, it has everything it needs.

That unlocked the ability to bring in an orchestrator like Kubernetes that still allows you to treat applications like mini VMs, but then also take one step further and actually treat them as microservices. It allows you to do both.

I would say the biggest step between Docker and Kubernetes was that Docker was all about developer experience. It solved that part, but did not solve the public-cloud ecosystem part. It did not have, or necessarily want, close integration with the cloud providers. Kubernetes solved that.

Who do you see running Kubernetes inside companies? Is it individual application teams?

There’s an interesting shift that happened with cloud native, which is that we have the rise of the “platform team,” I’ll call it. They’re not application engineers. They have a bit of network ops knowledge and they have quite a bit of security knowledge. They have SRE knowledge and they know how to do cloud automation. They are providing the platform for application teams, and treating those application teams as their customers.

Platform teams are the ones buying Kubernetes and related technologies, which they use because they are tasked with providing that next-generation infrastructure to make modern app teams happy.

I think there’s definitely a space for serverless, in particular for very rapid application development. But in enterprises, we are seeing cloud native as the new layer on top of virtualization

Is that a net-new buyer or a net-new team? Or are platform teams like something that exist inside places like Google or Facebook and are now going mainstream?

They’re mostly a new team. I think they are, to some extent, like the SRE teams at Google and Facebook. However, the application teams probably own more of the app deployment in enterprises, because enterprises don’t have this very clear distinction between software engineers and SREs like Google and Facebook do. I would say this evolution is very similar to how you had virtualization teams, and then lots of network ops migrated from — or evolved or advanced from — being about network hardware to being about network virtualization. And these teams, for example, started to operate VMware NSX. The same is happening here.

Although, it’s not necessarily new budget. We see budgets shifting from security and networking to this platform team, for example, as cloud spending increases and less is spent on network hardware. They often operate with the security team and with the network ops team to get buy-in, but they actually own a pretty substantial size of the budget.

How do you see the Cloud Native Computing Foundation evolving, and will Kubernetes always be at the center of it — or of the cloud native movement overall?

Kubernetes is what sparked the CNCF, and in the first couple of years it was all about Kubernetes and public cloud. What we’ve seen since about a year ago is that it’s now no longer just about Kubernetes, it’s actually more about cloud native principles. This actually means it’s not necessarily cloud anymore either, not even private cloud. It’s often even traditional enterprise networking, boring on-prem infrastructure, bare-metal servers, and all of that, but with the cloud native principles built in.

The new norm is now hybrid and includes multiple public cloud providers, as well as on-premises infrastructure. Companies want to provide the same application developer agility, or provide observability with modern cloud native tools, or do security with modern cloud native tools — for example, authentication, instead of just segmentation or identity-based enforcement — all those new cloud native concepts on existing infrastructure.

We’re seeing a very strong demand to still connect to the old world and talk MPLS, VLAN, sFlow, and NetFlow — the whole existing set of enterprise requirements. None of them have gone away.

About a decade into it, the cloud native space doesn’t seem to be a fad. How much room is there for it to continue evolving?

There was definitely a time where it was like, “Oh, Kubernetes is probably short-lived, and serverless is going to be the next layer.” Or, “Kubernetes is similar to OpenStack. Or, “It will disappear and it’s going to be an implementation detail.” And that has not happened.

I think there’s definitely a space for serverless, in particular for very rapid application development. But in enterprises, we are seeing cloud native as the new layer on top of virtualization, and we believe it has a similar shelf life as virtualization. Which means we’re at the very beginning of the cloud native migration.

What big problems still need to be solved at the infrastructure level?

We’re seeing enterprises in a situation where, all of a sudden, whether they want it or not, they need a multi-cloud strategy. Because they also have on-premise infrastructure, they now need a hybrid cloud strategy on top of that. And they need to figure out how to do security and other functions universally across this infrastructure without locking themselves into a particular public cloud.

So this is the next big challenge: Who’s going to be that agnostic layer for multi-cloud and cloud native, like what VMware became? Who’s going to be the VMware for cloud native?

I think the realization has kicked in that if you want to attract modern application teams, and be able to have quick go-to-market times, you need to provide them cloud native infrastructure.

And although cloud native adoption might have been relatively easy for the modern web companies who were early adopters, the challenge from your perspective is building new technologies that bridge the gap between this modern world and existing enterprise tools and systems?

The hard part is that modern app teams are used to having the infrastructure layer evolve as quickly as them. And this forced the infrastructure layer to be even more programmable, more adjustable. That’s why we actually see a networking layer and a security layer on top of the cloud networking layer. But now we have enterprises coming in, and we’re seeing a very strong demand to still connect to the old world and talk MPLS, VLAN, sFlow, and NetFlow — the whole existing set of enterprise requirements. None of them have gone away, all the compliance rules are still the same. And even some of the modern SaaS companies now face these challenges as they grow bigger and they care about compliance and so on.

From a technology perspective, it’s about how to connect that new cloud native world to the existing enterprise requirements. Because a lot of these problems were hidden by the public cloud providers. Public cloud providers solved the compliance problems, but they did not open source or publish any of that; they solved that on their own. It’s part of the cloud value. Enterprises now need to rebuild and buy that if they don’t want to lock themselves into the public cloud offerings.

Where do you see the next wave of cloud native innovation coming from? Does it still come from a company like Google, or is there a new type of company leading the charge?

It’s very interesting. I would say it’s probably not coming from the Googles and the Facebooks. The source of innovation will be open-source, and the customers and users driving the demand will be companies one level down from the hyperscalers — already sizable companies that are still highly disruptive, like Adobe, Shopify, or GitHub. But also companies at risk of being disrupted by technology, like financial services, insurance providers, and telcos. These companies all have a shared interest in standardizing infrastructure with repeatable development and infrastructure models.

Posted July 26, 2022

Views expressed in “posts” (including articles, podcasts, videos, and social media) are those of the individuals quoted therein and are not necessarily the views of AH Capital Management, L.L.C. (“a16z”) or its respective affiliates. Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by a16z. While taken from sources believed to be reliable, a16z has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation.

This content is provided for informational purposes only, and should not be relied upon as legal, business, investment, or tax advice. You should consult your own advisers as to those matters. References to any securities or digital assets are for illustrative purposes only, and do not constitute an investment recommendation or offer to provide investment advisory services. Furthermore, this content is not directed at nor intended for use by any investors or prospective investors, and may not under any circumstances be relied upon when making a decision to invest in any fund managed by a16z. (An offering to invest in an a16z fund will be made only by the private placement memorandum, subscription agreement, and other relevant documentation of any such fund and should be read in their entirety.) Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z, and there can be no assurance that the investments will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by Andreessen Horowitz (excluding investments for which the issuer has not provided permission for a16z to disclose publicly as well as unannounced investments in publicly traded digital assets) is available at https://a16z.com/investments/.

Charts and graphs provided within are for informational purposes solely and should not be relied upon when making any investment decision. Past performance is not indicative of future results. The content speaks only as of the date indicated. Any projections, estimates, forecasts, targets, prospects, and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Please see https://a16z.com/disclosures for additional important information.

Cilium, Service Meshes, and the Future of Enterprise Networking

Service mesh

Our Cities Have an API Problem. Startups Can Fix It.

PlanetScale CEO on Cloud-Prem and Climbing the Engineering Ladder

The Rise of Domain Experts in Deep Learning

Why the Software Supply Chain Needs More Security

What Is Negative Engineering?