Defer the initialization of label value allow lists until the first invocation of `WithLabelValues` or `With`. This fixes the issue that metrics initialized before flag applied which results in label value allow list is not honored.
As a quick fix for a flake, bceec5a3ff
introduced polling with wait.Poll in all callers of CheckDaemonStatus.
This commit reverts all callers to what they were before (CheckDaemonStatus +
ExpectNoError) and implements polling according to E2E best practices
(https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/writing-good-e2e-tests.md#polling-and-timeouts):
- no logging while polling
- support for progress reporting while polling
- last but not least, produce an informative failure message in case of a
timeout, including a dump of the daemon set as YAML
adds exemplar support for counters
* utilizes Prometheus' underlying exemplar machinery
* introduces contextual counters (which were a no-op till now)
* adds testcases
addresses (a part of): #119697
Refactor Healthz with Metrics Address for internal configuration of
kube-proxy adhering to the v1alpha2 version specifications as detailed
in https://kep.k8s.io/784.
Signed-off-by: Daman Arora <aroradaman@gmail.com>
Comment fix and removal of `skippedUnknownDevice`. That field was originally
meant to somehow influence how a failure to allocate gets reported, but in the
end that distinction was not implemented.
The internal "stop allocation" error was sometimes erroneously (pun intended)
returned as result of the allocation and then shown to users. No error and no
results should have been returned instead, which then is shown as "allocation
not possible".
Aborting allocation early is only correct if the device which cannot be
allocated is one of the "all" devices which were requested.
The code which pre-determined the set of "all" devices when using
"allocationMode: all" accidentally ignored the selector of the device class.
As a result, allocation worked correctly only when a node had only devices
matching the intended device class. When there were additional devices, things
went wrong:
- Unrelated devices allocated for a request.
- Claim allocation failed completely.
The util for checking on daemonstatus was checking once if the Status of
the daemonset was reporting that all the desired Pods are scheduled and
ready.
However, the pattern used in the e2e test for this function was not
taking into consideration that the controller needs to propagate the Pod
status to the DeamonSet status, and was asserting on the condition only
once after waiting for all the Pods to be ready.
In order to avoid more churn code, change the CheckDaemonStatus
signature to the wait.Condition type and use it in a async poll loop on
the tests.
The RPC call usually does not take much time for containerd or CRI-O. We
now assume the default timeout is fine and therefore resolve the `TODO`.
Signed-off-by: Sascha Grunert <sgrunert@redhat.com>