If there was an unexpected status, the code extracting the expected error
message crashed with a panic. Happened once so far, for unknown reasons
because the unexpected status then didn't get logged.
When the new RollingUpdate option is used, the DRA driver gets deployed such
that it uses unique socket paths and uses file locking to serialize gRPC
calls. This enables the kubelet to pick arbitrarily between two concurrently
instances. The handover is seamless (no downtime, no removal of ResourceSlices
by the kubelet).
For file locking, the fileutils package from etcd is used because that was
already a Kubernetes dependency. Unfortunately that package brings in some
additional indirect dependency for DRA drivers (zap, multierr), but those
seem acceptable.
The key difference is that the kubelet must remember all plugin instances
because it could always happen that the new instance dies and leaves only the
old one running.
The endpoints of each instance must be different. Registering a plugin with the
same endpoint as some other instance is not supported and triggers an error,
which should get reported as "not registered" to the plugin. This should only
happen when the kubelet missed some unregistration event and re-registers the
same instance again. The recovery in this case is for the plugin to shut down,
remove its socket, which should get observed by kubelet, and then try again
after a restart.
When doing an update of a DaemonSet, first the old pod gets stopped and
then the new one is started. This causes the kubelet to remove all
ResourceSlices directly after removal and forces the new pod to recreate all of
them.
Now the kubelet waits 30 seconds before it deletes ResourceSlices. If a new
driver registers during that period, nothing is done at all. The new driver
finds the existing ResourceSlices and only needs to update them if something
changed.
The downside is that if the driver gets removed permanently, this creates a
delay where pods might still get scheduled to the node although the driver is
not going to run there anymore and thus the pods will be stuck.
While they have the LinuxOnly part, other tests have that and this
e2eskipper line. Let's add it just in case.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Implement a new function, verifyAlphaFeatures, to ensure that alpha features cannot be enabled by default. Update the verifyOrUpdateFeatureList function to call this new verification. Add corresponding unit tests to validate the behavior of alpha feature handling.
Using 'claims.?email_verified.orValue(true) == true' in the example
validation rule. By explicitly comparing the value to true, we let type-checking see the result
will be a boolean, and to make sure a non-boolean email_verified claim will be caught at runtime.
Signed-off-by: Anish Ramasekar <anish.ramasekar@gmail.com>