kubelet: document seamless upgrade support and guidance

This tries to capture the current state of affairs and a potential plan for supporting seamless upgrades better.
2025-09-14 13:45:06 +00:00 · 2025-01-27 11:20:01 +01:00
parent 5a6ace2aa0
commit 0490b9f0b7
1 changed files with 80 additions and 0 deletions
--- a/pkg/kubelet/pluginmanager/pluginwatcher/README.md
+++ b/pkg/kubelet/pluginmanager/pluginwatcher/README.md
@@ -16,6 +16,86 @@ there.

 This socket filename should not start with a '.' as it will be ignored.

+To avoid conflicts between different plugins, the recommendation is to use
+`<plugin name>[-<some optional string>].sock` as filename. `<plugin name>`
+should end with a DNS domain that is unique for the plugin. Each time a plugin
+starts, it has to delete old sockets if they exist and listen anew under the
+same filename.
+
+## Seamless Upgrade
+
+To avoid downtime of a plugin on a node, it would be nice to support running an
+old plugin in parallel to the new plugin. When deploying with a DaemonSet,
+setting `maxSurge` to a value larger than zero enables such a seamless upgrade.
+
+**Warning**: Such a seamless upgrade **is not** supported at the moment. This
+section merely describes what would have to be changed to make it work.
+
+### In a plugin
+
+To support seamless upgrades, each plugin instance must use a unique
+socket filename. Otherwise the following could happen:
+- The old instance is registered with `plugin.example.com-reg.sock`.
+- The new instance starts, unlinks that file, and starts listening on it again.
+- In parallel, the kubelet notices the removal and unregisters the plugin
+  before probing the new instance, thus breaking the seamless upgrade.
+
+Even if the timing is more favorable and unregistration is avoided, using the
+same socket is problematic: if the new instance fails, the kubelet cannot fall
+back to the old instance because that old instance is not listening to the
+socket that is available under `plugin.example.com-reg.sock`.
+
+This can be achieved in a DaemonSet by passing the UID of the pod into the pod
+through the downward API. New instances may try to clean up stale sockets of
+older instances, but have to be absolutely sure that those sockets really
+aren't in use anymore. Each instance should catch termination signals and clean
+up after itself. Then sockets only leak during abnormal events (power loss,
+killing with SIGKILL).
+
+Last but not least, both plugin instances must be usable in parallel. It is not
+predictable which instance the kubelet will use for which request.
+
+### In the kubelet
+
+For such a seamless upgrade with different sockets per plugin to work reliably,
+the handler for the plugin type must track all registered instances. Then if
+one of them fails and gets unregistered, it can fall back to some
+other. Picking the most recently registered instance is a good heuristic. This
+isn't perfect because after a kubelet restart, plugin instances get registered
+in a random order. Restarting the kubelet in the middle of an upgrade should be
+rare.
+
+At the moment, none of the existing handlers support such seamless upgrades:
+
+- The device plugin handler suffers from temporarily removing the extended
+  resources during an upgrade. A proposed fix is pending in
+  https://github.com/kubernetes/kubernetes/pull/127821.
+
+- The CSI handler [tries to determine which instance is newer](https://github.com/kubernetes/kubernetes/blob/7140b4910c6c1179c9778a7f3bb8037356febd58/pkg/volume/csi/csi_plugin.go#L115-L125) based on the supported version(s) and
+  only remembers that one. If that newest instance fails, there is no fallback.
+
+  In practice, most CSI drivers probably all pass [the hard-coded "1.0.0"](https://github.com/kubernetes-csi/node-driver-registrar/blob/27700e2962cd35b9f2336a156146181e5c75399e/cmd/csi-node-driver-registrar/main.go#L72)
+  from the csi-node-registrar as supported version, so this version
+  selection mechanism isn't used at all.
+
+- The DRA handler only remembers the most recently registered instance.
+
+### Deployment
+
+Deploying a plugin with support for seamless upgrades and per-instance socket
+filenames is *not* compatible with a kubelet version that does not have support
+for seamless upgrades yet. It breaks like this:
+
+- New instance starts, gets registered and replaces the old one.
+- Old instance stops, removing its socket.
+- The kubelet notices that, unregisters the plugin.
+- The plugin handler removes *the new* instance because it ignores the socket path -> no instance left.
+
+Plugin authors either have to assume that the cluster has a recent enough
+kubelet or rely on labeling nodes with support. Then the plugin can use one
+simple DaemonSet for nodes without support and another, more complex one where
+`maxSurge` is increased to enable seamless upgrades on nodes which support it.
+No such label is specified at the moment.

 ## gRPC Service Lifecycle