To update the right statuses, the controller must collect more information
about why a pod is being evicted. Updating the DeviceTaintRule statuses then is
handled by the same work queue as evicting pods.
Both operations already share the same client instance and thus QPS+server-side
throttling, so they might as well share the same work queue. Deleting pods is
not necessarily more important than informing users or vice-versa, so there is
no strong argument for having different queues.
While at it, switching the unit tests to usage of the same mock work queue as
in staging/src/k8s.io/dynamic-resource-allocation/internal/workqueue. Because
there is no time to add it properly to a staging repo, the implementation gets
copied.
This was broken since 666a41c2ea when the label value became non-integer encoded
The chance of one controller revision hash label being int-parsable: 7/27 ^ 8 = 0.00002041 = ~0
The chance of both being int-parsable: 0.00002041^2 = ~0
Hash comparison locks in differences in content failing EqualRevision
even when the semantic content is normalized to be equal.
The approach copied from node taint eviction was to fire off one goroutine per
pod the intended time. This leads to the "thundering herd" problem: when a
single taint causes eviction of several pods and those all have no or the same
toleration grace period, then they all get deleted concurrently at the same
time.
For node taint eviction that is limited by the number of pods per node, which
is typically ~100. In an integration test, that already led to problems with
watchers:
cacher.go:855] cacher (pods): 100 objects queued in incoming channel.
cache_watcher.go:203] Forcing pods watcher close due to unresponsiveness: key: "/pods/", labels: "", fields: "". len(c.input) = 10, len(c.result) = 10, graceful = false
It also causes spikes in memory consumption (mostly the 2KB stack per goroutine
plus closure) with no upper limit.
Using a workqueue makes concurrency more deterministic because there is an
upper limit. In the integration test, 10 workers kept the watch active.
Another advantage is that failures to evict the pod get retried with
exponential backoff per affected pod forever. Previously, evicting was tried a
few times with a fixed rate and then the controller gave up. If the apiserver
was down long enough, pods didn't get evicted.
When the ResourceSlice no longer exists, the ResourceSlice tracker didn't and
couldn't report the tainted devices even if they are allocated and in use. The
controller must keep track of DeviceTaintRules itself and handle this scenario.
In this scenario it is impossible to evaluation CEL expressions because the
necessary device attributes aren't available. We could:
- Copy them in the allocation result: too large, big change.
- Limit usage of CEL expressions to rules with no eviction: inconsistent.
- Remove the fields which cannot be supported well.
The last option is chosen.
The tracker is now no longer needed by the eviction controller. Reading
directly from the informer means that we cannot assume that pointers are
consistent. We have to track ResourceSlices by their name, not their pointer.
The immediate benefit is that the time required for running the package's unit
test goes down from ~10 seconds (because of required real-world delays) to ~0.5
seconds (depending on the CPU performance of the host). It can also make
writing tests easier because after a `Wait` there is no need for locking before
accessing internal state (all background goroutines are known to be blocked
waiting for the main goroutine).
What somewhat ruins the perfect determinism is the polling for informer cache
syncs: that can take an unknown number of loop iterations. Probably could be
fixed by making the waiting block on channels (requires work in client-go).
The only change required in the implementation is avoiding the sleep when
deleting a pod failed for the last time in the loop (a useful, albeit minor
improvement by itself): the test proceeds after having blocked that last Delete
call, in which case synctest expects the background goroutine to exit without
delay.