CSI is used by both the kubelet and kube-controller-manager. Both
components will initialize the csiPlugin with different VolumeHost
objects. The csiPlugin will then assign a global variable for
the node info manager. It is then possible that the kubelet gets
the credentials of the kube-controller-manager and that will cause
CSI to fail.
Automated cherry pick of #136529: test: Read /proc/net/nf_conntrack instead of using conntrack binary
#136554: test: Fix KubeProxy CLOSE_WAIT test for IPv6 environments (and where /proc/net/nf_conntrack may be missing)
- Use netutils.IsIPv6(ip) instead of manual nil/To4 check
- Remove unnecessary ip.To16() call since IPv6 is already 16 bytes
- Remove ipFamily from grep pattern since IP format ensures correctness
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
The /proc/net/nf_conntrack file uses fully expanded IPv6 addresses
with leading zeros in each 16-bit group. For example:
fc00:f853:ccd:e793::3 -> fc00:f853:0ccd:e793:0000:0000:0000:0003
Add expandIPv6ForConntrack() helper function to expand IPv6 addresses
to the format used by /proc/net/nf_conntrack before using them in
the grep pattern.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
The distroless-iptables image no longer includes the conntrack binary
as of v0.8.7 (removed in kubernetes/release#4223 since kube-proxy no
longer needs it after kubernetes#126847).
Update the KubeProxy CLOSE_WAIT timeout test to read /proc/net/nf_conntrack
directly instead of using the conntrack command. The file contains the
same connection tracking data and is accessible from the privileged
host-network pod.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
- bump init backoff to Duration=30ms, Factor=8 (Steps=6) to yield ~140s total
- prevent kubelet restarts when DNS is blackholed and NSS must fall back to myhostname
- keep CSI/CSINode initialization alive long enough to complete in ARO DNS-failure scenarios
GatherAllocatedState and ListAllAllocatedDevices need to collect information
from different sources (allocated devices, in-flight claims), potentially even
multiple times (GatherAllocatedState first gets allocated devices, then the
capacities).
The underlying assumption that nothing bad happens in parallel is not always
true. The following log snippet shows how an update of the assume
cache (feeding the allocated devices tracker) and in-flight claims lands such
that GatherAllocatedState doesn't see the device in that claim as allocated:
dra_manager.go:263: I0115 15:11:04.407714 18778] scheduler: Starting GatherAllocatedState
...
allocateddevices.go:189: I0115 15:11:04.407945 18066] scheduler: Observed device allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-094" claim="testdra-all-usesallresources-hvs5d/claim-0553"
dynamicresources.go:1150: I0115 15:11:04.407981 89109] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680"
dra_manager.go:201: I0115 15:11:04.408008 89109] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 version="1211"
dynamicresources.go:1157: I0115 15:11:04.408044 89109] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680" allocation=<
{
"devices": {
"results": [
{
"request": "req-1",
"driver": "testdra-all-usesallresources-hvs5d.driver",
"pool": "worker-5",
"device": "worker-5-device-094"
}
]
},
"nodeSelector": {
"nodeSelectorTerms": [
{
"matchFields": [
{
"key": "metadata.name",
"operator": "In",
"values": [
"worker-5"
]
}
]
}
]
},
"allocationTimestamp": "2026-01-15T14:11:04Z"
}
>
dra_manager.go:280: I0115 15:11:04.408085 18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-095" claim="testdra-all-usesallresources-hvs5d/claim-0086"
dra_manager.go:280: I0115 15:11:04.408137 18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-096" claim="testdra-all-usesallresources-hvs5d/claim-0165"
default_binder.go:69: I0115 15:11:04.408175 89109] scheduler: Attempting to bind pod to node pod="testdra-all-usesallresources-hvs5d/my-pod-0553" node="worker-5"
dra_manager.go:265: I0115 15:11:04.408264 18778] scheduler: Finished GatherAllocatedState allocatedDevices=<map[string]interface {} | len:2>: {
Initial state: "worker-5-device-094" is in-flight, not in cache
- goroutine #1: starts GatherAllocatedState, copies cache
- goroutine #2: adds to assume cache, removes from in-flight
- goroutine #1: checks in-flight
=> device never seen as allocated
This is the second reason for double allocation of the same device in two
different claims. The other was timing in the assume cache. Both were
tracked down with an integration test (separate commit). It did not fail
all the time, but enough that regressions should show up as flakes.
DRA depends on the assume cache having invoked all event handlers before
Assume() returns, because DRA maintains state that is relevant for scheduling
through those event handlers.
This log snippet shows how this went wrong during PreBind:
dynamicresources.go:1150: I0115 10:35:29.264437] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-kqjpj/my-pod-0091" claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 resourceVersion="5636"
dra_manager.go:198: I0115 10:35:29.264448] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 version="287"
dynamicresources.go:1157: I0115 10:35:29.264463] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-kqjpj/my-pod-0091" claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 resourceVersion="5636" allocation=<
...
allocateddevices.go:189: I0115 10:35:29.267315] scheduler: Observed device allocation device="testdra-all-usesallresources-kqjpj.driver/worker-1/worker-1-device-096" claim="testdra-all-usesallresources-kqjpj/claim-0091"
- goroutine #1: UpdateStatus result delivered via informer.
AssumeCache updates cache, pushes event A, emitEvents pulls event A from queue.
*Not* done with delivering it yet!
- goroutine #2: AssumeCache.Assume called. Updates cache, pushes event B, emits it.
Old and new claim have allocation, so no "Observed device allocation".
- goroutine #3: Schedules next pod, without considering device as allocated (not in the log snippet).
- goroutine #1: Finally delivers event A: "Observed device allocation", but too late.
Also, events are delivered out-of-order.
The fix is to let emitEvents when called by Assume wait for a potentially
running emitEvents in some other goroutine, thus ensuring that an event pulled
out of the queue by that other goroutine got delivered before Assume itself
checks the queue one more time and then returns.
The time window were things go wrong is small. An E2E test covering this only
flaked rarely, and only in the CI. An integration test (separate commit) with
higher number of pods finally made it possible to reproduce locally. It also
uncovered a second race (fix in separate commit).
The unit test fails without the fix:
=== RUN TestAssumeConcurrency
assume_cache_test.go:311: FATAL ERROR:
Assume should have blocked and didn't.
--- FAIL: TestAssumeConcurrency (0.00s)