kubelet: document seamless upgrade support and guidance

This tries to capture the current state of affairs and a potential plan for
supporting seamless upgrades better.
This commit is contained in:
Patrick Ohly 2025-01-27 11:20:01 +01:00
parent 5a6ace2aa0
commit 0490b9f0b7

View File

@ -16,6 +16,86 @@ there.
This socket filename should not start with a '.' as it will be ignored.
To avoid conflicts between different plugins, the recommendation is to use
`<plugin name>[-<some optional string>].sock` as filename. `<plugin name>`
should end with a DNS domain that is unique for the plugin. Each time a plugin
starts, it has to delete old sockets if they exist and listen anew under the
same filename.
## Seamless Upgrade
To avoid downtime of a plugin on a node, it would be nice to support running an
old plugin in parallel to the new plugin. When deploying with a DaemonSet,
setting `maxSurge` to a value larger than zero enables such a seamless upgrade.
**Warning**: Such a seamless upgrade **is not** supported at the moment. This
section merely describes what would have to be changed to make it work.
### In a plugin
To support seamless upgrades, each plugin instance must use a unique
socket filename. Otherwise the following could happen:
- The old instance is registered with `plugin.example.com-reg.sock`.
- The new instance starts, unlinks that file, and starts listening on it again.
- In parallel, the kubelet notices the removal and unregisters the plugin
before probing the new instance, thus breaking the seamless upgrade.
Even if the timing is more favorable and unregistration is avoided, using the
same socket is problematic: if the new instance fails, the kubelet cannot fall
back to the old instance because that old instance is not listening to the
socket that is available under `plugin.example.com-reg.sock`.
This can be achieved in a DaemonSet by passing the UID of the pod into the pod
through the downward API. New instances may try to clean up stale sockets of
older instances, but have to be absolutely sure that those sockets really
aren't in use anymore. Each instance should catch termination signals and clean
up after itself. Then sockets only leak during abnormal events (power loss,
killing with SIGKILL).
Last but not least, both plugin instances must be usable in parallel. It is not
predictable which instance the kubelet will use for which request.
### In the kubelet
For such a seamless upgrade with different sockets per plugin to work reliably,
the handler for the plugin type must track all registered instances. Then if
one of them fails and gets unregistered, it can fall back to some
other. Picking the most recently registered instance is a good heuristic. This
isn't perfect because after a kubelet restart, plugin instances get registered
in a random order. Restarting the kubelet in the middle of an upgrade should be
rare.
At the moment, none of the existing handlers support such seamless upgrades:
- The device plugin handler suffers from temporarily removing the extended
resources during an upgrade. A proposed fix is pending in
https://github.com/kubernetes/kubernetes/pull/127821.
- The CSI handler [tries to determine which instance is newer](https://github.com/kubernetes/kubernetes/blob/7140b4910c6c1179c9778a7f3bb8037356febd58/pkg/volume/csi/csi_plugin.go#L115-L125) based on the supported version(s) and
only remembers that one. If that newest instance fails, there is no fallback.
In practice, most CSI drivers probably all pass [the hard-coded "1.0.0"](https://github.com/kubernetes-csi/node-driver-registrar/blob/27700e2962cd35b9f2336a156146181e5c75399e/cmd/csi-node-driver-registrar/main.go#L72)
from the csi-node-registrar as supported version, so this version
selection mechanism isn't used at all.
- The DRA handler only remembers the most recently registered instance.
### Deployment
Deploying a plugin with support for seamless upgrades and per-instance socket
filenames is *not* compatible with a kubelet version that does not have support
for seamless upgrades yet. It breaks like this:
- New instance starts, gets registered and replaces the old one.
- Old instance stops, removing its socket.
- The kubelet notices that, unregisters the plugin.
- The plugin handler removes *the new* instance because it ignores the socket path -> no instance left.
Plugin authors either have to assume that the cluster has a recent enough
kubelet or rely on labeling nodes with support. Then the plugin can use one
simple DaemonSet for nodes without support and another, more complex one where
`maxSurge` is increased to enable seamless upgrades on nodes which support it.
No such label is specified at the moment.
## gRPC Service Lifecycle