In-Kernel or Not: Hyper-Converged Storage Services Hokey Pokey

To Kernel or Not To Kernel…are we putting services in Kernel, or leaving them out…This has been a hot topic in the Hyper-Converged storage market.

I’d like to take a few minutes to relay some great thoughts that Nutanix CEO Dheeraj Pandey posted on another blog’s comment section a while back. Reading through it now, I firmly believe it needs it’s own proper post. This content is from Dheeraj, with small edits for blog readability.

The whole management argument of integration is being broken apart. Had that been true, Oracle apps would have continued to rule, and people would never have given Salesforce, Workday, ServiceNow, and others a chance. And this has been true for decades. Oracle won the DB war against IBM, even though IBM was a tightly integrated stack, top-to-bottom. After a certain point, even consumers started telling Facebook that their kitchen-sink app is not working, which is why FB started breaking apart that experience into something cleaner, usable, and user-experience-driven.

These are the biggest advantages of running above the kernel:
Fault isolation: If storage has a bug, it won’t take compute down with it. If you want to quickly upgrade storage, you don’t have to move VMs around. Converging compute and storage should not create a toxic blob of infrastructure; isolation is critical, even when sharing hardware. That is what made virtualization and ESX such a beautiful paradigm.

Pace of Innovation: User-level code for storage has ruled for the last 2 decades for exactly this reason. It’s more maintainable, its more debuggable, and its faster-paced. Bugs don’t bring entire machines down. Exact reason why GFS, HDFS, OneFS, Oracle RDBMS, MySQL, and so on are built in user space. Moore’s Law has made user-kernel transitions cheap. Zero-copy buffers, epoll, and O_DIRECT IO, etc. makes user-kernel transitions seamless. Similarly, virtual switching and VT-x technologies in hypervisors make hypervisor-VM transitions seamless.

Extensibility and Ecosystem Integration: User-space code makes it more extensible and lends itself to a pluggable architecture. Imagine connecting to AWS S3, Azure, compression library, security key management code, etc. from the kernel. The ecosystem in user-space thrives, and storage should not lag behind.

Rolling Upgrades: Compute doesn’t blink when storage is undergoing a planned downtime.

Migration complexity (backward compatibility): It is extremely difficult to build next-generation distributed systems without using protobufs and HTTP for self-describing data format and RPC services. Imagine migrating 1PB of data if your extents are not self-describing. Imagine upgrading a 64-node cluster if your RPC services are not self-describing. Porting protobufs and HTTP in kernel is a nightmare, given the glibc and other user library dependencies.

Performance Isolation: Converging compute and storage doesn’t mean storage should run amuk with resources. Administrators must be able to bound the CPU, memory, and network resources given to storage. Without a sandbox abstraction, in-kernel code is a toxic blob. Users should be able to grow and shrink storage resources, keeping the rest of application and datacenter needs in mind. Performance profiles of storage could be very different even in a hyperconverged architecture because of application nuances, flash-heavy nodes, storage-heavy nodes, GPU-heavy, and so on.

Security Isolation: The trusted computing base of the hypervisor must be kept lean and mean. Heartbleed and ShellShock are the veritable tips of the iceberg. Kernels have to be trusted, not bloated. See T. Garfinkel, B. Pfaff, J. Chow, M., Rosenblum, and D. Boneh, “Terra: A virtual machine-based platform for trusted computing,” in Proceedings of the 19th ACM Symposium on Operating Systems Principles, pp. 193–206, 2003. Also see P. England, B. Lampson, J. Manferdelli, M. Peinado, B. Willman, “A Trusted Open Platform,” IEEE Computer, pp. 55–62, July 2003.

Storage is just a freakin’ app on the server. If we can run databases and ERP systems in a VM, there’s no reason why storage shouldn’t. And if we’re arguing for running storage inside the kernel, let’s port Oracle and SAP to run inside the hypervisor!

In the end, we’ve to make storage an intelligent service in the datacenter. For too long, it has been a byte-shuttler between the network and the disk. If it needs to be an active system, it needs {fault|performance|security} isolation, speed of innovation, and ecosystem integration.

One more thing: If it can run in a Linux VSA, it will run as a container in Docker as well. It’s future-proof.