Kubernetes at 10: The long road to mastery of persistent storage

Kubernetes is 10! Mid-2024 sees the 10th birthday of the market-leading container orchestration platform.

Jan Safranek, principal software engineer at Red Hat, was there during the flurry of activity that saw Docker and then Kubernetes explode onto the scene and businesses and developers grapple with how to provide it with persistent storage.

Safranek was there when Pets Sets and StatefulSets came along and helped tackle this issue by managing deployment and scaling container pods. Then Safranek was involved in the development of CSI (Container Storage Interface) drivers and Operators that provided software extensions to manage applications and their components.

We mark the first decade of Kubernetes with a series of interviews with engineers who helped develop Kubernetes and tackle challenges in storage and data protection – including the use of Kubernetes Operators – as we look forward to a future characterised by artificial intelligence (AI) workloads.

What was the market like when Kubernetes first launched?

Jan Safranek: The world was dominated by bare metal or heavy virtual machines. Docker containers got popular very quickly, but nobody knew how to manage them because there were no tools around. Kubernetes brought a whole new concept of running applications in lightweight and isolated chunks.

How did you get involved in work on the data infrastructure around Kubernetes?

Safranek: It was easy for me. I was working on Linux storage management tools here at Red Hat and I saw a new team being established to help with Kubernetes. I did not know anything about it, but it looked cool so I applied there.

How did you realise Kubernetes was in the leading position in the market?

Safranek: It was at the first KubeCon that I attended in Seattle in 2016. Before that, I met fellow engineers that were doing interesting things. But there I met real businesses that ran important parts of their infrastructure on Kubernetes.

When you looked at Kubernetes, how did you approach data and storage?

Safranek: When I joined Kubernetes, people already realised that even when containers were ephemeral in nature, there had to be something persistent in the bigger picture. The basic APIs [application programming interfaces] were already there, even with the first volume plugins, but nobody really knew how to use them. PetSets arrived, and later changed to StatefulSets, but even with that it can still be challenging to run heavy data-driven apps on Kubernetes.

What issues first came up around data and storage with Kubernetes for you?

Safranek: I started poking around what already existed in Kubernetes and how to use it. I added a few examples and end-to-end tests to get familiar with the code and the overall process. It was so easy at that time. I remember it was moving forward so quickly that it was very hard to keep my pull requests up to date.

The first real issues we needed to solve was how to run a stateful application, which was solved by PetSets/StatefulSets, and how to consume data from storage systems outside of Kubernetes. We first started with in-tree code for cloud-base storage and a few generic plugins for traditional storage such as NFS and iSCSI.

What had to change?

Safranek: With greater adoption, we quickly realised we needed more robust code. Our initial controllers were very fragile, so we had to rewrite everything to provide stable and consistent behaviour. There is still room for improvements there, however, as all of them have survived for years with only minor bugfixes.

In addition, as more and more storage vendors wanted to integrate their storage back-ends with Kubernetes, we learned that we needed a generic extension interface for volume plugins. First, we came up with FlexVolumes, which were very cumbersome to use, but we learned and created CSI, which is the main storage interface of Kubernetes today. It has become immensely successful, and there are at least 130 drivers that have voluntarily listed themselves and who knows how many more that we don’t know about.

And, as Kubernetes got adopted and started running critical infrastructure, we needed to make sure that we didn’t break it. Now it’s much harder to introduce a new feature or change an existing one, as there are countless reviews and approvals needed to ensure stability of our releases.

How did you get involved around Kubernetes Operators?

Safranek: Red Hat was one of the first adopters of Operators. The beginnings were quite lively. I personally started with a few bad ones, but quickly I learned how to write good Operators. It all comes with practice, as with anything in software development. Also, there is more documentation today than ever before.

What happened around Operators that made them a success for data and storage?

Safranek: I have mixed experience here. Many Operators are no better than Helm charts [which describe a related set of Kubernetes resources]. The good ones, however, helped businesses ease their pain with applications that need persistent data. It’s still hard to run a stateful app correctly, with all the corner cases covered.

How did this support more cloud-native approaches? What were the consequences?

Safranek: As I mentioned, it’s easier for devops to rely on a third-party Operator to run their database or other stateful workload while they focus on putting these pieces together into a great application that runs their business.

Kubernetes is now 10. How do you think about it today?

Safranek: Well, it has been a ride. From the beginnings where anyone could rewrite anything into rock-stable software that keeps this world running or at least some very important parts of it to where it turned into a boring piece of infrastructure.

What problems still exist around Kubernetes when it comes to data and storage?

Safranek: All containers are ephemeral in nature and could be short-lived. Kubernetes can try to run them for a long time, but when it needs to delete them, it will. Kubernetes offers some APIs, such as PodDisruptionBudget, to keep the absolute necessary amount of containers running, but all stateful applications must count on some disruption. This is a new concept and it’s still hard to handle it correctly.

Any other anecdotes or information to share?

Safranek: The best thing about working on Kubernetes is that I’ve met very smart people, learned a lot, and I am still learning.

Source