Commit Graph

7039 Commits

Author SHA1 Message Date
Maciej Szulik
499bff4ca4 Revert "controller: duplicate utility method cleanup" 2025-11-05 21:06:09 +01:00
Kubernetes Prow Robot
9ef1a14d68 Merge pull request #134840 from ahmetb/ahmet/mini-cleanup
controller: duplicate utility method cleanup
2025-11-05 08:06:58 -08:00
Kubernetes Prow Robot
9a192aa1c3 Merge pull request #134432 from Karthik-K-N/fix-sv-test
Fix storage version test flake
2025-11-05 06:56:52 -08:00
Ayato Tokubi
320987ead3 Addressed comments 2025-11-05 10:44:50 +00:00
Ayato Tokubi
5102591a6b Refactor resource claim metrics to use structured labels and add "source" dimension.
Signed-off-by: Ayato Tokubi <atokubi@redhat.com>
2025-11-05 09:52:47 +00:00
Kubernetes Prow Robot
c1a6a3ca71 Merge pull request #134152 from pohly/dra-device-taints-1.35
DRA: device taints: new ResourceSlice API, new features
2025-11-04 15:32:07 -08:00
Kubernetes Prow Robot
97cb47a913 Merge pull request #135080 from dejanzele/feat/promote-job-managedby-to-ga
KEP-4368: Job Managed By; Promote to GA
2025-11-04 13:42:12 -08:00
Patrick Ohly
bbf8bc766e DRA device taints: DeviceTaintRule status
To update the right statuses, the controller must collect more information
about why a pod is being evicted. Updating the DeviceTaintRule statuses then is
handled by the same work queue as evicting pods.

Both operations already share the same client instance and thus QPS+server-side
throttling, so they might as well share the same work queue. Deleting pods is
not necessarily more important than informing users or vice-versa, so there is
no strong argument for having different queues.

While at it, switching the unit tests to usage of the same mock work queue as
in staging/src/k8s.io/dynamic-resource-allocation/internal/workqueue. Because
there is no time to add it properly to a staging repo, the implementation gets
copied.
2025-11-04 21:57:24 +01:00
Patrick Ohly
0689b628c7 generated files 2025-11-04 21:57:24 +01:00
Patrick Ohly
f4a453389d DRA device taint eviction: configurable number of workers
It might never be necessary to change the default, but it is hard to be sure.
It's better to have the option, just in case.
2025-11-04 21:57:24 +01:00
Kubernetes Prow Robot
a058cf788a Merge pull request #134624 from yt2985/podcertificates-beta
Promote Pod Certificates feature to beta
2025-11-04 11:42:12 -08:00
Dejan Zele Pejchev
3dabd4417d KEP-4368: Job Managed By; Promote to GA
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
2025-11-04 10:59:45 +01:00
Kubernetes Prow Robot
d6aa2db57e Merge pull request #135027 from omerap12/remove-reactor-hpa
Remove unused delete reactor
2025-11-04 01:30:10 -08:00
Kubernetes Prow Robot
48c56e04e0 Merge pull request #135017 from liggitt/stateful-set-noop-rollout
Fix spurious statefulset rollout from 1.33 → 1.34
2025-11-03 19:58:11 -08:00
Kubernetes Prow Robot
41673c7198 Merge pull request #134910 from tchap/kcm-controllers-thread-mgmt
pkg/controller: Improve goroutine management
2025-11-03 17:58:03 -08:00
Jordan Liggitt
979c442774 Fix spurious workload rollout due to null creationTimestamp in controller revisions 2025-11-03 17:11:06 -05:00
Jordan Liggitt
7d186d870f Remove unused and fragile revision hash comparisons
This was broken since 666a41c2ea when the label value became non-integer encoded
The chance of one controller revision hash label being int-parsable: 7/27 ^ 8 = 0.00002041 = ~0
The chance of both being int-parsable: 0.00002041^2 = ~0

Hash comparison locks in differences in content failing EqualRevision
even when the semantic content is normalized to be equal.
2025-11-03 16:33:40 -05:00
Jordan Liggitt
94e085e15c Add unit test detecting spurious statefulset rollout 2025-11-03 16:33:39 -05:00
Lukasz Szaszkiewicz
c832203707 pkg/controller/garbagecollector/garbagecollector_test: wrap kubeClient with a client that doesn't support WatchList semantics. 2025-11-03 10:41:49 +01:00
tinatingyu
59e075e8d3 Promote PodCertificateRequests to v1beta1 2025-11-02 05:33:44 +00:00
Omer Aplatony
264eab46db Remove unused delete reactor
Signed-off-by: Omer Aplatony <omerap12@gmail.com>
2025-11-01 06:13:40 +00:00
Patrick Ohly
c69259cb71 DRA device taints: switch to workqueue in controller
The approach copied from node taint eviction was to fire off one goroutine per
pod the intended time. This leads to the "thundering herd" problem: when a
single taint causes eviction of several pods and those all have no or the same
toleration grace period, then they all get deleted concurrently at the same
time.

For node taint eviction that is limited by the number of pods per node, which
is typically ~100. In an integration test, that already led to problems with
watchers:

   cacher.go:855] cacher (pods): 100 objects queued in incoming channel.
   cache_watcher.go:203] Forcing pods watcher close due to unresponsiveness: key: "/pods/", labels: "", fields: "". len(c.input) = 10, len(c.result) = 10, graceful = false

It also causes spikes in memory consumption (mostly the 2KB stack per goroutine
plus closure) with no upper limit.

Using a workqueue makes concurrency more deterministic because there is an
upper limit. In the integration test, 10 workers kept the watch active.

Another advantage is that failures to evict the pod get retried with
exponential backoff per affected pod forever. Previously, evicting was tried a
few times with a fixed rate and then the controller gave up. If the apiserver
was down long enough, pods didn't get evicted.
2025-10-31 18:11:19 +01:00
Patrick Ohly
e5fcd20a26 DRA device taints: tighten controller test
We know how often the controller should get a pod, let's check it.
Must run before we do our own GET call.
2025-10-31 18:11:18 +01:00
Patrick Ohly
6ebd853f17 DRA: implementation of none taint effect
While at it, ensure that future unknown effects are treating like
the None effect.
2025-10-31 18:11:18 +01:00
Patrick Ohly
e4dda7b282 DRA device taints: fix DeviceTaintRule + missing slice case
When the ResourceSlice no longer exists, the ResourceSlice tracker didn't and
couldn't report the tainted devices even if they are allocated and in use. The
controller must keep track of DeviceTaintRules itself and handle this scenario.

In this scenario it is impossible to evaluation CEL expressions because the
necessary device attributes aren't available. We could:
- Copy them in the allocation result: too large, big change.
- Limit usage of CEL expressions to rules with no eviction: inconsistent.
- Remove the fields which cannot be supported well.

The last option is chosen.

The tracker is now no longer needed by the eviction controller. Reading
directly from the informer means that we cannot assume that pointers are
consistent. We have to track ResourceSlices by their name, not their pointer.
2025-10-31 18:11:18 +01:00
Patrick Ohly
2e543d151b DRA device taints: convert unit test to synctest
The immediate benefit is that the time required for running the package's unit
test goes down from ~10 seconds (because of required real-world delays) to ~0.5
seconds (depending on the CPU performance of the host). It can also make
writing tests easier because after a `Wait` there is no need for locking before
accessing internal state (all background goroutines are known to be blocked
waiting for the main goroutine).

What somewhat ruins the perfect determinism is the polling for informer cache
syncs: that can take an unknown number of loop iterations. Probably could be
fixed by making the waiting block on channels (requires work in client-go).

The only change required in the implementation is avoiding the sleep when
deleting a pod failed for the last time in the loop (a useful, albeit minor
improvement by itself): the test proceeds after having blocked that last Delete
call, in which case synctest expects the background goroutine to exit without
delay.
2025-10-30 17:29:58 +01:00
Kubernetes Prow Robot
808d320de1 Merge pull request #134956 from yliaog/blockowner
removed BlockOwnerDeletion
2025-10-30 01:26:11 -07:00
yliao
4f647b3f3d removed BlockOwnerDeletion 2025-10-29 22:41:10 +00:00
Kubernetes Prow Robot
3ec2d82da5 Merge pull request #134784 from michaelasp/svm_beta2
SVM: bump the API to beta, remove unused fields
2025-10-29 13:56:02 -07:00
Michael Aspinwall
3b72759d1b Update SVM to Beta
Co-authored-by: Stanislav Láznička <stlaz.devel@proton.me>
2025-10-29 19:36:11 +00:00
Ondra Kupka
ad2c6b443d controller/validatingadmissionpolicystatus: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:07:10 +01:00
Ondra Kupka
a51285e1f2 controller/servicecidrs: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:07:10 +01:00
Ondra Kupka
63c15cbe83 controller/resourceclaim: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:07:10 +01:00
Ondra Kupka
5f423d7ba8 controller/podautoscaler: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:04:38 +01:00
Ondra Kupka
51ef94c547 controller/nodelifecycle: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:04:38 +01:00
Ondra Kupka
34e688eb3d controller/nodeipam: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:04:38 +01:00
Ondra Kupka
a265769245 controller/ttlafterfinished: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:04:37 +01:00
Ondra Kupka
821a3f5aff controller/storageversionmigrator: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:04:37 +01:00
Ondra Kupka
7240649e4f controller/ttl: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:04:37 +01:00
Ondra Kupka
3ee8c53e53 controller/podgc: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:04:37 +01:00
Ondra Kupka
1635a139b8 controller/storageversiongc: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:04:37 +01:00
Ondra Kupka
502186ca93 controller/statefulset: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:04:37 +01:00
Ondra Kupka
91cf8253a2 controller/replicaset: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:04:37 +01:00
Ondra Kupka
5f48a52bf8 controller/namespace: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:04:37 +01:00
Ondra Kupka
cb4ad79102 controller/endpointslicemirroring: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:04:37 +01:00
Ondra Kupka
cd73e8777b controller/endpointslice: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:00:31 +01:00
Ondra Kupka
ccd35f7c5e controller/endpoint: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:00:31 +01:00
Ondra Kupka
d9ba92ba3b controller/disruption: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:00:31 +01:00
Ondra Kupka
6e0a4da2f6 controller/deployment: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:00:31 +01:00
Ondra Kupka
e8b0f27456 controller/daemon: Improve goroutine mgmt
Make sure all threads are terminated when Run returns.
2025-10-29 19:00:30 +01:00