The DRA plugin does that. It didn't actually work and only printed an error
message about NodeInfo not implementing klog.KMetata. That's not a compile-time
check due to limitations with Go generics and had been missed earlier.
This finishes the transition to the assume cache as source of truth for the
current set of claims.
The tests have to be adapted. It's not enough anymore to directly put objects
into the informer store because that doesn't change the assume cache
content. Instead, normal Create/Update calls and waiting for the cache update
are needed.
This enables connecting the event handler for ResourceClaim to the assume
cache, which addresses a theoretic race condition.
It may also be useful for implementing the autoscaler support, because now
the autoscaler can modify the content of the cache.
The claim parameter key didn't include the namespace of the claim. In the case
where two namespaces used the exact same parameter reference, the "too many
generated parameters" case got triggered incorrectly and lookup could have
returned an object from the wrong namespace.
Found while running the E2E tests in parallel:
message: 'running PreFilter plugin "DynamicResources": multiple generated claim
parameters for ConfigMap. dra-8794/parameters-3 found: [dra-4729/parameters-4
dra-7328/parameters-4 dra-8794/parameters-4 dra-3402/parameters-4 dra-6156/parameters-4
dra-1839/parameters-4 dra-7434/parameters-4 dra-6504/parameters-4]'
Any error result from PreBind was treated as a pod scheduling failure. This was
overlooked when moving blocking API calls in the DRA plugin into a PreBind
implementation, leading to:
E0604 15:45:50.980929 306340 schedule_one.go:1048] "Error scheduling pod; retrying" err="waiting for resource driver" pod="test/test-draqld28"
That's because DRA's PreBind does some updates in the apiserver, then returns
Pending to wait for the outcome.
The fix is to allow PreBind to return the same special status codes as other
extension points.
There's no reason for having the interface because there is only one
implementation. Makes the implementation of the test functions a bit
simpler (no casting). They are still stand-alone functions instead of methods
because they should not be considered part of the "normal" API.
This is now used by both the volumebinding and dynamicresources plugin, so
promoting it to a common helper package is better.
In terms of functionality, nothing was changed. Documentation got
updated (warns about storing locally modified objects, clarifies what the Get
parameters are). Code coverage should be a bit better than before (tested with
and without indexer, exercises event handlers, more error paths).
Checking for specific errors can now be done via errors.Is.
Clearing some irrelevant fields in objects caused a flaky data race alert
because in some cases, the objects were pointers into a shared cache. A better
solution is to treat the objects as read-only and ignore the irrelevant fields.
Coverage was checked with a cover profile. The biggest remaining gap is for
isSchedulableAfterClaimParametersChange and
isSchedulableAfterClassParametersChange which will get handled when refactoring
the
foreachPodResourceClaim (https://github.com/kubernetes/kubernetes/issues/123697).
The code was incorrectly checking for a controller, but only the boolean
is set for allocated claims. As a result, deallocation was requested from
a non-existent control plane controller.
While at it, let's also clear the driver name. It's not needed when the
claim is deallocated.
Without this, the scheduler was crashing in newClaimController() in
pkg/scheduler/framework/plugins/dynamicresources/structuredparameters.go
The code in newClaimController() assumes that the parameters are not nil.
Furthermore it assumes that there is at least one DriverRequest populated in
order to allocate any resources to a claim.
This PR adds logic to define default claim/class parameters that will allow
allocation to proceed even if an end user doesn't provide any class or claim
parameters themselves.
Signed-off-by: Kevin Klues <kklues@nvidia.com>
Storing a modified claim with allocation and the original resource version in
the assume cache was not reliable: if an update was received, it replaced the
modified claim and the resource that was reserved for the claim might have been
used for some other claim.
To fix this, the in-flight claims are now stored in the map instead of just a
boolean and the status stored there overrides whatever is in the assume cache.
Logging got extended to diagnose this problem better. It started to occur in
E2E tests after splitting the claim update so that first the finalizer is set
and then the status, because setting the finalizer triggered an update.