Commit Graph

129635 Commits

Author SHA1 Message Date
Erik Wilson
aaaf2c4bfe Update kubernetes service on start for port changes 2026-02-10 19:16:53 -03:00
Darren Shepherd
a8a3c6b707 Don't ever select the flannel bridge or cni bridge 2026-02-10 19:16:53 -03:00
Darren Shepherd
f7a3a8da22 Cache loopback cert in the certs dir if set 2026-02-10 19:16:53 -03:00
Darren Shepherd
80a478042a Add ability to disable proxy hostname check 2026-02-10 19:16:53 -03:00
Darren Shepherd
bd0942c26e Hide deprecated warnings 2026-02-10 19:16:53 -03:00
Darren Shepherd
9bc573e44c Set all sources so node+agent in the same process doesn't get restricted 2026-02-10 19:16:53 -03:00
Darren Shepherd
8497455126 Don't check for cpuset cgroup, not always required? 2026-02-10 19:16:53 -03:00
Darren Shepherd
290531e1f3 Wait for kube-apiserver for 2 minutes for slow (ARM) systems 2026-02-10 19:16:53 -03:00
Darren Shepherd
e5e410d22a Make kubelet.sock path changable 2026-02-10 19:16:53 -03:00
Darren Shepherd
c65e18b7aa only use the resolved name if port was zero 2026-02-10 19:16:53 -03:00
Darren Shepherd
7c72ca28b8 If you can't set hashsize on nf_conntrack don't fail 2026-02-10 19:16:53 -03:00
Darren Shepherd
c95bf4af6f Drop credential providers 2026-02-10 19:16:53 -03:00
Darren Shepherd
0837fa7a7a Drop storage plugins 2026-02-10 19:16:53 -03:00
Darren Shepherd
f2e05043f6 Drop client-go cloud auth 2026-02-10 19:16:53 -03:00
Kubernetes Release Robot
5adfc48e19 Release commit for Kubernetes v1.33.8 2026-02-10 12:54:00 +00:00
Kubernetes Prow Robot
87e1a6154c Merge pull request #136365 from dlipovetsky/automated-cherry-pick-of-#136014-upstream-release-1.33
Automated cherry pick of #136014: kubeadm: waiting for etcd learner member to be started before promoting during 'kubeadm join'
2026-02-06 03:26:30 +05:30
Kubernetes Prow Robot
dc72324ffd Merge pull request #136595 from RomanBednar/automated-cherry-pick-of-#136202-upstream-release-1.33
Automated cherry pick of #136202: csi: raise kubelet CSI init backoff to cover ~140s DNS delays
2026-02-05 20:44:33 +05:30
Kubernetes Prow Robot
c94b79f22d Merge pull request #136565 from pohly/automated-cherry-pick-of-#136269-origin-release-1.33
Automated cherry pick of #136269: DRA scheduler: double allocation fixes
2026-02-05 19:32:39 +05:30
Kubernetes Prow Robot
24d42be054 Merge pull request #136434 from thc1006/cherry-pick-136028-release-1.33
[release-1.33] fix(kubelet): convert V().Error() to V().Info() for verbosity-aware logging
2026-02-05 19:32:31 +05:30
Kubernetes Prow Robot
b50c7f32b7 Merge pull request #136375 from princepereira/automated-cherry-pick-of-#136241-upstream-release-1.33
Automated cherry pick of #136241: Fix for preferred dualstack and required dualstack in winkernel proxier.
2026-02-05 18:14:32 +05:30
Kubernetes Prow Robot
59cb443824 Merge pull request #136468 from cpanato/update-go-1256-rel133
[release-1.33] [go] Bump dependencies, images and versions used to Go 1.24.12 and distroless iptables
2026-01-31 13:22:25 +05:30
Carlos Panato
9067f06e36 Bump dependencies, images and versions used to Go 1.24.12 and distroless iptables
Signed-off-by: Carlos Panato <ctadeu@gmail.com>
2026-01-30 11:13:21 +01:00
Kubernetes Prow Robot
f184c13310 Merge pull request #136636 from dims/automated-cherry-pick-of-#136529-#136554-upstream-release-1.33
Automated cherry pick of #136529: test: Read /proc/net/nf_conntrack instead of using conntrack binary
#136554: test: Fix KubeProxy CLOSE_WAIT test for IPv6 environments (and where /proc/net/nf_conntrack may be missing)
2026-01-30 14:33:43 +05:30
Davanum Srinivas
7be92699ea Apparently some EC2 images we use do not have /proc/net/nf_conntrack
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2026-01-29 15:02:43 -05:00
Davanum Srinivas
1479c62a20 test: cleanup from review
- Use netutils.IsIPv6(ip) instead of manual nil/To4 check
- Remove unnecessary ip.To16() call since IPv6 is already 16 bytes
- Remove ipFamily from grep pattern since IP format ensures correctness

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2026-01-29 15:02:43 -05:00
Davanum Srinivas
727969bb68 test: Fix KubeProxy CLOSE_WAIT test for IPv6 environments
The /proc/net/nf_conntrack file uses fully expanded IPv6 addresses
with leading zeros in each 16-bit group. For example:
  fc00:f853:ccd:e793::3 -> fc00:f853:0ccd:e793:0000:0000:0000:0003

Add expandIPv6ForConntrack() helper function to expand IPv6 addresses
to the format used by /proc/net/nf_conntrack before using them in
the grep pattern.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2026-01-29 15:02:43 -05:00
Davanum Srinivas
2e3a2fa222 test: Read /proc/net/nf_conntrack instead of using conntrack binary
The distroless-iptables image no longer includes the conntrack binary
as of v0.8.7 (removed in kubernetes/release#4223 since kube-proxy no
longer needs it after kubernetes#126847).

Update the KubeProxy CLOSE_WAIT timeout test to read /proc/net/nf_conntrack
directly instead of using the conntrack command. The file contains the
same connection tracking data and is accessible from the privileged
host-network pod.

Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2026-01-29 15:02:41 -05:00
Roman Bednar
c45940ecdb csi: raise kubelet CSI init backoff to cover ~140s DNS delays
- bump init backoff to Duration=30ms, Factor=8 (Steps=6) to yield ~140s total
- prevent kubelet restarts when DNS is blackholed and NSS must fall back to myhostname
- keep CSI/CSINode initialization alive long enough to complete in ARO DNS-failure scenarios
2026-01-28 11:49:07 +01:00
Patrick Ohly
ba81d3040d DRA scheduler: fix another root cause of double device allocation
GatherAllocatedState and ListAllAllocatedDevices need to collect information
from different sources (allocated devices, in-flight claims), potentially even
multiple times (GatherAllocatedState first gets allocated devices, then the
capacities).

The underlying assumption that nothing bad happens in parallel is not always
true. The following log snippet shows how an update of the assume
cache (feeding the allocated devices tracker) and in-flight claims lands such
that GatherAllocatedState doesn't see the device in that claim as allocated:

    dra_manager.go:263: I0115 15:11:04.407714      18778] scheduler: Starting GatherAllocatedState
    ...
    allocateddevices.go:189: I0115 15:11:04.407945      18066] scheduler: Observed device allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-094" claim="testdra-all-usesallresources-hvs5d/claim-0553"
    dynamicresources.go:1150: I0115 15:11:04.407981      89109] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680"
    dra_manager.go:201: I0115 15:11:04.408008      89109] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 version="1211"
    dynamicresources.go:1157: I0115 15:11:04.408044      89109] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-hvs5d/my-pod-0553" claim="testdra-all-usesallresources-hvs5d/claim-0553" uid=<types.UID>: a84d3c4d-f752-4cfd-8993-f4ce58643685 resourceVersion="5680" allocation=<
        	{
        	  "devices": {
        	    "results": [
        	      {
        	        "request": "req-1",
        	        "driver": "testdra-all-usesallresources-hvs5d.driver",
        	        "pool": "worker-5",
        	        "device": "worker-5-device-094"
        	      }
        	    ]
        	  },
        	  "nodeSelector": {
        	    "nodeSelectorTerms": [
        	      {
        	        "matchFields": [
        	          {
        	            "key": "metadata.name",
        	            "operator": "In",
        	            "values": [
        	              "worker-5"
        	            ]
        	          }
        	        ]
        	      }
        	    ]
        	  },
        	  "allocationTimestamp": "2026-01-15T14:11:04Z"
        	}
         >
    dra_manager.go:280: I0115 15:11:04.408085      18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-095" claim="testdra-all-usesallresources-hvs5d/claim-0086"
    dra_manager.go:280: I0115 15:11:04.408137      18778] scheduler: Device is in flight for allocation device="testdra-all-usesallresources-hvs5d.driver/worker-5/worker-5-device-096" claim="testdra-all-usesallresources-hvs5d/claim-0165"
    default_binder.go:69: I0115 15:11:04.408175      89109] scheduler: Attempting to bind pod to node pod="testdra-all-usesallresources-hvs5d/my-pod-0553" node="worker-5"
    dra_manager.go:265: I0115 15:11:04.408264      18778] scheduler: Finished GatherAllocatedState allocatedDevices=<map[string]interface {} | len:2>: {

Initial state: "worker-5-device-094" is in-flight, not in cache
- goroutine #1: starts GatherAllocatedState, copies cache
- goroutine #2: adds to assume cache, removes from in-flight
- goroutine #1: checks in-flight

=> device never seen as allocated

This is the second reason for double allocation of the same device in two
different claims. The other was timing in the assume cache. Both were
tracked down with an integration test (separate commit). It did not fail
all the time, but enough that regressions should show up as flakes.
2026-01-27 14:44:04 +01:00
Patrick Ohly
75fd186e7a DRA scheduler: fix one root cause of double device allocation
DRA depends on the assume cache having invoked all event handlers before
Assume() returns, because DRA maintains state that is relevant for scheduling
through those event handlers.

This log snippet shows how this went wrong during PreBind:

    dynamicresources.go:1150: I0115 10:35:29.264437] scheduler: Claim stored in assume cache pod="testdra-all-usesallresources-kqjpj/my-pod-0091" claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 resourceVersion="5636"
    dra_manager.go:198: I0115 10:35:29.264448] scheduler: Removed in-flight claim claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 version="287"
    dynamicresources.go:1157: I0115 10:35:29.264463] scheduler: Removed claim from in-flight claims pod="testdra-all-usesallresources-kqjpj/my-pod-0091" claim="testdra-all-usesallresources-kqjpj/claim-0091" uid=<types.UID>: 516f274f-e1a9-4a4b-b7d2-bb86138e4240 resourceVersion="5636" allocation=<
    ...
    allocateddevices.go:189: I0115 10:35:29.267315] scheduler: Observed device allocation device="testdra-all-usesallresources-kqjpj.driver/worker-1/worker-1-device-096" claim="testdra-all-usesallresources-kqjpj/claim-0091"

- goroutine #1: UpdateStatus result delivered via informer.
  AssumeCache updates cache, pushes event A, emitEvents pulls event A from queue.
  *Not* done with delivering it yet!
- goroutine #2: AssumeCache.Assume called. Updates cache, pushes event B, emits it.
  Old and new claim have allocation, so no "Observed device allocation".
- goroutine #3: Schedules next pod, without considering device as allocated (not in the log snippet).
- goroutine #1: Finally delivers event A: "Observed device allocation", but too late.

Also, events are delivered out-of-order.

The fix is to let emitEvents when called by Assume wait for a potentially
running emitEvents in some other goroutine, thus ensuring that an event pulled
out of the queue by that other goroutine got delivered before Assume itself
checks the queue one more time and then returns.

The time window were things go wrong is small. An E2E test covering this only
flaked rarely, and only in the CI. An integration test (separate commit) with
higher number of pods finally made it possible to reproduce locally. It also
uncovered a second race (fix in separate commit).

The unit test fails without the fix:

    === RUN   TestAssumeConcurrency
        assume_cache_test.go:311: FATAL ERROR:
            	Assume should have blocked and didn't.
    --- FAIL: TestAssumeConcurrency (0.00s)
2026-01-27 14:44:03 +01:00
thc1006
dc42de643f fix(kubelet): convert V().Error() to V().Info() for verbosity-aware logging
This PR fixes incorrect usage of `logger.V(N).Error()` in the kubelet
package. The go-logr package design causes `Error()` calls to bypass
verbosity level checks entirely, meaning these logs are always printed
regardless of the configured verbosity level.

Note: In release-1.33, only pleg/generic.go uses contextual logging
(logger.V().Error). All other files use klog.V().ErrorS() which properly
respects verbosity levels, so they are NOT changed in this cherry-pick.

Files modified (1 file, 1 instance):
- pkg/kubelet/pleg/generic.go

Conversion pattern applied:
- g.logger.V(N).Error(err, "msg", ...) → g.logger.V(N).Info("msg", ..., "err", err)

Fixes: https://github.com/kubernetes/kubernetes/issues/136027
Cherry-pick of: https://github.com/kubernetes/kubernetes/pull/136028

Signed-off-by: thc1006 <84045975+thc1006@users.noreply.github.com>
2026-01-27 00:01:46 +00:00
Prince Pereira
f00408bb42 Fix for preferred dualstack and required dualstack in winkernel proxier. 2026-01-21 05:41:34 +00:00
SataQiu
7a092c63d9 kubeadm: waiting for etcd learner member to be started before promoting during 'kubeadm join' 2026-01-20 16:01:08 -08:00
Kubernetes Prow Robot
41b9b1f939 Merge pull request #136172 from jsafrane/automated-cherry-pick-of-#135629-upstream-release-1.33
Automated cherry pick of #135629: selinux: Fix the controller to ignore finished pods
2026-01-13 21:55:37 +05:30
Jan Safranek
7c5a9b9833 Add unit test with CSIDriver.SELinuxMount=false
Add unit test with a volume plugin that does not support SELinux. That
simulates a CSi driver whose spec.SELinuxMount is empty or false.

This requires a little refactoring, each unit test now has a flag if it
runs with a volume plugin that supports SELinux.
2026-01-12 14:48:14 +01:00
Jan Safranek
aa6b40b2aa Added e2e tests with disabled SELinux
Added few tests with a CSI driver that does not support SELinux and has it
disabled in its CSIDriver instance
2026-01-12 14:48:14 +01:00
Jan Safranek
6e491604a4 Use only enqueuePod to add pods to the controller queue
enqueuePod already creates the right key for a pod, it's better to reuse it
than copy the code around.
2026-01-12 14:48:14 +01:00
Jan Safranek
1c3b0b1138 Fix policy of Pods with unknown SELinux label
Reset SELinuxChangePolicy of Pods that have no SELinux label set to
Recursive. Kubelet cannot mount with `-o context=<label>`, if the label is
not known.

This fixes the e2e test error revealed by the previous commit - it changed the
e2e test to check for events when no events are expected and it found a
warning about a Pod with no label, but MountOption policy.
2026-01-12 14:48:13 +01:00
Jan Safranek
e6ae3c9405 selinux: add e2e test with a completed pod
Add a test that checks the SELinux controller does not report conflicts
with Succeeded pods.
2026-01-12 14:48:13 +01:00
Jan Safranek
d05bfe8123 Add new unit tests 2026-01-12 14:48:13 +01:00
Jan Safranek
5602c5e6b5 Rework unit tests to builder pattern 2026-01-12 14:48:13 +01:00
Jan Safranek
9222f08d22 selinux: Do not report conflits with finished pods
When a Pod reaches its final state (Succeeded or Failed), its volumes are
getting unmounted and therefore their SELinux mount option will not
conflict with any other pod.

Let the SELinux controller monitor "pod updated" events to see the pod is
finished
2026-01-12 14:48:13 +01:00
Jan Safranek
f02a1fc357 refactoring: use a common function to enqueue Pod
addPod and deletePod have the same implementation, merge them into
enqueuePod
2026-01-12 14:48:13 +01:00
Kubernetes Prow Robot
0c729828df Merge pull request #136105 from atiratree/automated-cherry-pick-of-#135625-upstream-release-1.33
Automated cherry pick of #135625: mark QuotaMonitor as not running and invalidate monitors list
2026-01-09 11:39:41 +05:30
Filip Křepinský
6a783eb8f9 mark QuotaMonitor as not running and invalidate monitors list
to prevent close of closed channel panic
2026-01-08 13:46:29 +01:00
Kubernetes Prow Robot
4e47cdc06c Merge pull request #136070 from neolit123/automated-cherry-pick-of-#135776-origin-release-1.33
Automated cherry pick of #135776: kubeadm: always retry Patch() Node API calls
2026-01-08 07:09:38 +05:30
Lubomir I. Ivanov
4e36355c5d kubeadm: always retry Patch() Node API calls
The PatchNodeOnce function has historically exited early
in scanarious when we Get a Node object, but the next Patch
API call on the same Node object fails. This can happen
in setups that are under a lot of resource pressure
or different network timeout scenarious.

Instead of exiting early and allow listing certain errors,
always retry on any Patch error. This aligns with the
general idea that kubeadm retries *all* API calls.
2026-01-07 14:27:10 +01:00
Kubernetes Prow Robot
8644ee8680 Merge pull request #135812 from AkihiroSuda/fix-135210-1.33
[release-1.33] hack/lib/util.sh: support uutils' `date` command
2026-01-07 12:43:41 +05:30
Kubernetes Prow Robot
e4caa612a9 Merge pull request #135851 from neolit123/automated-cherry-pick-of-#135400-origin-release-1.33
Automated cherry pick of #135400: kubeadm: do not sort extraArgs alpha-numerically
2026-01-06 07:42:37 +05:30
Lubomir I. Ivanov
249d35bf43 kubeadm: do not sort extraArgs alpha-numerically
If the user has provided extraArgs with an order that has
significance (e.g. --service-account-issuer for kube-apiserver),
kubeadm will correctly override any base args, but will end up
sorting the entire resulting list, which is not desired.

Instead, only sort the base arguments and preserve the order
of overrides provided by the user.
2025-12-19 17:43:22 +01:00