The reason for the previous behavior was unnecessary performance overhead that
occurs when the caller already provided a "fresh" copy and doesn't touch it
afterwards.
But this is something that DRA driver developers can easily get wrong, so it's
better to be safe than sorry.
When deleting a bunch of slices, the delete events queue the pool while it is
being synced. It then got synced again immediately, while the deleted slices
were still being removed from the informer cache. The obsolete slice in the
cache caused the controller to delete it again, which fails with a "not
found". That error is ignored, but this still caused extra API calls.
Now syncing gets delayed with a configuration duration (default: 30 seconds) so
the informer cache is more likely to be up-to-date when the pool gets synced
again.
This test covers creating and deleting 100 large ResourceSlices. It is strict
about using the minimum number of calls.
The test also verifies that creating large slices works.
This avoids the problem of creating an additional slice when the one from the
previous sync is not in the informer cache yet. It also avoids false
attempts to delete slices which were updated in the previous sync. Such
attempts would fail the ResourceVersion precondition check, but would
still cause work for the apiserver.
It's better to verify UID and ResourceVersion of the ResourceSlice that we want
to delete. If anything changed, the decision to remove it might not apply
anymore and we need to check again.
The ResourceSlice controller (theoretically) might end up creating too many
slices if it syncs again before its informer cache was updated. This could
cause the scheduler to allocate a device from a duplicated slice. They should
be identical, but its still better to fail and wait until the controller
removes the redundant slice.
The driver determines what each slice is meant to look like. The controller
then ensures that only those slices exist. It reuses existing slices where the
set of devices, as identified by their names, is the same as in some desired
slice. Such slices get updated to match the desired state.
In other words, attributes and the order of devices can be changed by updating
an existing slice, but adding or removing a device is done by deleting and
re-creating slices.
Co-authored-by: googs1025 <googs1025@gmail.com>
The test update is partly based on
https://github.com/kubernetes/kubernetes/pull/127645.
[FG:InPlacePodVerticalScaling] Fixed the apiserver panic issue that occurred when adding a container during pod updates in the InPlacePodVerticalScaling scenario.
Introducing pdb to preemption had disrupted the orderliness of pods in the victims,
which would leads picking wrong victim node with higher priority pod on it.
Using the "normal" logic for a feature gated field simplifies the
implementation of the feature gate.
There is one (entirely theoretic!) problem with updating from 1.31: if a claim
was allocated in 1.31 with admin access, the status field was not set because
it didn't exist yet. If a driver now follows the current definition of "unset =
off", then it will not grant admin access even though it should. This is
theoretic because drivers are starting to support admin access with 1.32, so
there shouldn't be any claim where this problem could occur.
The new DRAAdminAccess feature gate has the following effects:
- If disabled in the apiserver, the spec.devices.requests[*].adminAccess
field gets cleared. Same in the status. In both cases the scenario
that it was already set and a claim or claim template get updated
is special: in those cases, the field is not cleared.
Also, allocating a claim with admin access is allowed regardless of the
feature gate and the field is not cleared. In practice, the scheduler
will not do that.
- If disabled in the resource claim controller, creating ResourceClaims
with the field set gets rejected. This prevents running workloads
which depend on admin access.
- If disabled in the scheduler, claims with admin access don't get
allocated. The effect is the same.
The alternative would have been to ignore the fields in claim controller and
scheduler. This is bad because a monitoring workload then runs, blocking
resources that probably were meant for production workloads.
Drivers need to know that because admin access may also grant additional
permissions. The allocator needs to ignore such results when determining which
devices are considered as allocated.
In both cases it is conceptually cleaner to not rely on the content of the
ClaimSpec.