Commit Graph

125355 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
36bbdd692f
Merge pull request #127466 from guozheng-shen/fix-return
endpointsLeasesResourceLock and configMapsLeasesResourceLock  has been removed
2024-09-25 14:36:01 +01:00
Kubernetes Prow Robot
5fc4e71a30
Merge pull request #127499 from pohly/scheduler-perf-updates
scheduler_perf: updates to enhance performance testing of DRA
2024-09-25 13:32:00 +01:00
Kubernetes Prow Robot
75214d11d5
Merge pull request #127428 from googs1025/scheduler/plugin
chore(scheduler): refactor import package ordering in scheduler
2024-09-25 11:40:07 +01:00
Kubernetes Prow Robot
4c4edfede5
Merge pull request #127398 from my-git9/patch-23
kubeadm: update comment for ArgumentsFromCommand function in app/util/arguments
2024-09-25 11:40:00 +01:00
Lukasz Szaszkiewicz
ae35048cb0
adds watchListEndpointRestrictions for watchlist requests (#126996)
* endpoints/handlers/get: intro watchListEndpointRestrictions

* consistencydetector/list_data_consistency_detector: expose IsDataConsistencyDetectionForListEnabled

* e2e/watchlist: extract common function for adding unstructured secrets

* e2e/watchlist: new e2e scenarios for convering watchListEndpointRestrict
2024-09-25 10:12:01 +01:00
Patrick Ohly
d100768d94 scheduler_perf: track and visualize progress over time
This is useful to see whether pod scheduling happens in bursts and how it
behaves over time, which is relevant in particular for dynamic resource
allocation where it may become harder at the end to find the node which still
has resources available.

Besides "pods scheduled" it's also useful to know how many attempts were
needed, so schedule_attempts_total also gets sampled and stored.

To visualize the result of one or more test runs, use:

     gnuplot.sh *.dat
2024-09-25 11:09:15 +02:00
xin.li
706e939382 kubeadm: update comment for ArgumentsFromCommand function in app/util/arguments
Signed-off-by: xin.li <xin.li@daocloud.io>
2024-09-25 16:19:28 +08:00
Patrick Ohly
ded96042f7 scheduler_perf + DRA: load up cluster by allocating claims
Having to schedule 4999 pods to simulate a "full" cluster is slow. Creating
claims and then allocating them more or less like the scheduler would when
scheduling pods is much faster and in practice has the same effect on the
dynamicresources plugin because it looks at claims, not pods.

This allows defining the "steady state" workloads with higher number of
devices ("claimsPerNode") again. This was prohibitively slow before.
2024-09-25 09:45:39 +02:00
Patrick Ohly
385599f0a8 scheduler_perf + DRA: measure pod scheduling at a steady state
The previous tests were based on scheduling pods until the cluster was
full. This is a valid scenario, but not necessarily realistic.

More realistic is how quickly the scheduler can schedule new pods when some
old pods finished running, in particular in a cluster that is properly
utilized (= almost full). To test this, pods must get created, scheduled, and
then immediately deleted. This can run for a certain period of time.

Scenarios with empty and full cluster have different scheduling rates. This was
previously visible for DRA because the 50% percentile of the scheduling
throughput was lower than the average, but one had to guess in which scenario
the throughput was lower. Now this can be measured for DRA with the new
SteadyStateClusterResourceClaimTemplateStructured test.

The metrics collector must watch pod events to figure out how many pods got
scheduled. Polling misses pods that already got deleted again. There seems to
be no relevant difference in the collected
metrics (SchedulingWithResourceClaimTemplateStructured/2000pods_200nodes, 6 repetitions):

     │            before            │                     after                     │
     │ SchedulingThroughput/Average │ SchedulingThroughput/Average  vs base         │
                         157.1 ± 0%                     157.1 ± 0%  ~ (p=0.329 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc50 │ SchedulingThroughput/Perc50  vs base         │
                        48.99 ± 8%                    47.52 ± 9%  ~ (p=0.937 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc90 │ SchedulingThroughput/Perc90  vs base         │
                       463.9 ± 16%                   460.1 ± 13%  ~ (p=0.818 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc95 │ SchedulingThroughput/Perc95  vs base         │
                       463.9 ± 16%                   460.1 ± 13%  ~ (p=0.818 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc99 │ SchedulingThroughput/Perc99  vs base         │
                       463.9 ± 16%                   460.1 ± 13%  ~ (p=0.818 n=6)
2024-09-25 09:45:39 +02:00
Patrick Ohly
51cafb0053 scheduler_perf: more useful errors for configuration mistakes
Before, the first error was reported, which typically was the "invalid op code"
error from the createAny operation:

    scheduler_perf.go:900: parsing test cases error: error unmarshaling JSON: while decoding JSON: cannot unmarshal {"collectMetrics":true,"count":10,"duration":"30s","namespace":"test","opcode":"createPods","podTemplatePath":"config/dra/pod-with-claim-template.yaml","steadyState":true} into any known op type: invalid opcode "createPods"; expected "createAny"

Now the opcode is determined first, then decoding into exactly the matching operation is
tried and validated. Unknown fields are an error.

In the case above, decoding a string into time.Duration failed:

    scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: decoding {"collectMetrics":true,"count":10,"duration":"30s","namespace":"test","opcode":"createPods","podTemplatePath":"config/dra/pod-with-claim-template.yaml","steadyState":true} into *benchmark.createPodsOp: json: cannot unmarshal string into Go struct field createPodsOp.Duration of type time.Duration

Some typos:

    scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: unknown opcode "sleeep" in {"duration":"5s","opcode":"sleeep"}

    scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: decoding {"countParram":"$deletingPods","deletePodsPerSecond":50,"opcode":"createPods"} into *benchmark.createPodsOp: json: unknown field "countParram"
2024-09-25 09:45:39 +02:00
Kubernetes Prow Robot
99ff62e87a
Merge pull request #127491 from SataQiu/fix-etcd-20240920
kubeadm: check whether the peer URL for the added etcd member already exists when the MemberAddAsLearner/MemberAdd fails
2024-09-25 05:08:07 +01:00
Kubernetes Prow Robot
2e6216170b
Merge pull request #127319 from p0lyn0mial/upstream-define-initial-events-list-blueprint
apimachinery/meta/types.go: define InitialEventsListBlueprintAnnotationKey const
2024-09-25 05:08:00 +01:00
Kubernetes Prow Robot
f3a54b68f9
Merge pull request #127579 from chrishenzie/context
Propagate existing ctx instead of context.TODO() in sample-controller
2024-09-25 04:02:06 +01:00
Kubernetes Prow Robot
5dd244ff00
Merge pull request #125796 from haorenfsa/fix-gc-sync-blocked
garbagecollector: controller should not be blocking on failed cache sync
2024-09-25 04:02:00 +01:00
Kubernetes Prow Robot
8ccc878de0
Merge pull request #127583 from mmorel-35/testifylint/disable/require-error
chore: disable require-error rule from testifylint
2024-09-24 23:08:00 +01:00
Kubernetes Prow Robot
e9cde03b91
Merge pull request #127598 from aojea/servicecidr_seconday_dualwrite
bugfix: initialize secondary range registry with the right value
2024-09-24 21:08:08 +01:00
Kubernetes Prow Robot
63fc917521
Merge pull request #127480 from thockin/skip_test_target_normalization
Skip test target normalization
2024-09-24 21:08:01 +01:00
Kubernetes Prow Robot
9e157c5450
Merge pull request #127357 from lengrongfu/feat/add-chan-buffer
add resourceupdates.Update chan buffer
2024-09-24 20:02:01 +01:00
Antonio Ojea
7a9bca3888 bugfix: initialize secondary range registry with the right value
When MultiCIDRServiceAllocator feature is enabled, we added an
additional feature gate DisableAllocatorDualWrite that allows to enable
a mirror behavior on the old allocator to deal with problems during
cluster upgrades.

During the implementation the secondary range of the legacy allocator
was initialized with the valuye of the primary range, hence, when a
Service tried to allocate a new IP on the secondary range, it succeded
in the new ip allocator but failed when it tried to allocate the same IP
on the legacy allocator, since it has a different range.

Expand the integration test that run over all the combinations of
Service ClusterIP possibilities to run with all the possible
combinations of the feature gates.

The integration test need to change the way of starting the apiserver
otherwise it will timeout.
2024-09-24 17:48:13 +00:00
Patrick Ohly
7bbb3465e5 scheduler_perf: more realistic structured parameters tests
Real devices are likely to have a handful of attributes and (for GPUs) the
memory as capacity. Most keys will be driver specific, a few may eventually
have a domain (none standardized right now).
2024-09-24 18:52:45 +02:00
rongfu.leng
ead64fb8f0 add resourceupdates.Update chan buffer
Signed-off-by: rongfu.leng <lenronfu@gmail.com>
2024-09-24 16:48:32 +00:00
Kubernetes Prow Robot
b071443187
Merge pull request #127592 from dims/wait-for-gpus-even-for-aws-kubetest2-ec2-harness
Wait for GPUs even for AWS kubetest2 ec2 harness
2024-09-24 17:26:08 +01:00
Kubernetes Prow Robot
56071089e2
Merge pull request #127573 from benluddy/dynamic-golden-response-test
Add test for unintended changes to dynamic client response handling.
2024-09-24 17:26:01 +01:00
Tim Hockin
8912df652b
Use Go workspaces + go list to find test targets
Plain old UNIX find requires us to do all sorts of silly filtering.
2024-09-24 09:04:13 -07:00
SataQiu
9af1b25bec kubeadm: check the member list status before adding or removing an etcd member 2024-09-24 22:53:42 +08:00
Kubernetes Prow Robot
4c24b9337f
Merge pull request #127575 from alculquicondor/acondor-apps
Stepping down from SIG Apps reviewers
2024-09-24 15:38:06 +01:00
Kubernetes Prow Robot
9571d3b6c6
Merge pull request #125995 from carlory/remove-unnecessary-permissions
remove unneeded permissions for volume controllers
2024-09-24 15:38:00 +01:00
Davanum Srinivas
472ca3b279
skip control plane nodes, they may not have GPUs
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-09-24 10:09:33 -04:00
Kubernetes Prow Robot
6ded721910
Merge pull request #127496 from macsko/add_metricscollectionop_to_scheduler_perf
Add separate ops for collecting metrics from multiple namespaces in scheduler_perf
2024-09-24 14:34:00 +01:00
Davanum Srinivas
349c7136c9
Wait for GPUs even for AWS kubetest2 ec2 harness
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-09-24 09:11:18 -04:00
Maciej Skoczeń
a273e5381a Add separate ops for collecting metrics from multiple namespaces in scheduler_perf 2024-09-24 12:28:53 +00:00
Kubernetes Prow Robot
5973accf48
Merge pull request #127570 from soltysh/do_not_return_err
Do not return error where it's not needed
2024-09-24 10:20:01 +01:00
Kubernetes Prow Robot
2ade53e264
Merge pull request #124947 from toVersus/fix/eviction-message
[Sidecar Containers] Consider init containers in eviction message
2024-09-24 08:58:00 +01:00
Matthieu MOREL
64e9fd50ed chore: disable require-error rule from testifylint
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2024-09-24 07:17:52 +02:00
Kubernetes Prow Robot
f0036aac21
Merge pull request #127572 from soltysh/reuse_helper
Reuse CreateTestCRD helper for kubectl e2e
2024-09-24 06:05:59 +01:00
Kubernetes Prow Robot
4851ea85e0
Merge pull request #127582 from dims/avoid-collecting-dmesg-when-running-as-daemon
Avoid collecting dmesg when running as daemon
2024-09-24 04:55:59 +01:00
Davanum Srinivas
1dc29b74b9
Avoid collecting dmesg when running as daemon
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-09-23 21:32:05 -04:00
Kubernetes Prow Robot
94df29b8f2
Merge pull request #127464 from sanposhiho/trigger-nodedelete
fix(eventhandler): trigger Node/Delete event
2024-09-24 02:24:00 +01:00
Kubernetes Prow Robot
1137a6a0cc
Merge pull request #127093 from jpbetz/retry-generate-name-ga
Promote RetryGenerateName to GA
2024-09-24 00:46:06 +01:00
Kubernetes Prow Robot
d6bb550b10
Merge pull request #122890 from HirazawaUi/fix-pod-grace-period
[kubelet]: Fix the bug where pod grace period will be overwritten
2024-09-24 00:45:59 +01:00
Tim Hockin
7d89e9b4c0
Only normalize user-provided test targets 2024-09-23 16:25:29 -07:00
Chris Henzie
3f1c41d53e Propagate existing ctx instead of context.TODO() 2024-09-23 14:40:07 -07:00
Kubernetes Prow Robot
211d67a511
Merge pull request #125398 from AxeZhan/pvAffinity
[scheduler] When the hostname and nodename of a node do not match, ensure that pods carrying PVs with nodeAffinity are scheduled correctly.
2024-09-23 21:22:02 +01:00
Aldo Culquicondor
3d5525ec21 Stepping down from SIG Apps reviewers
Change-Id: I4ec085bfe9b5f65ae9b250bd2a7a519379874425
2024-09-23 19:11:54 +00:00
Kubernetes Prow Robot
851cf43a35
Merge pull request #127487 from hakuna-matatah/jobperf-delete-eventhandler
Offload the main Job reconciler w.r.t cleaning finalizers
2024-09-23 18:08:06 +01:00
Kubernetes Prow Robot
7ff0580bc8
Merge pull request #127458 from ii/promote-volume-attachment-status-test
Promote e2e test for VolumeAttachmentStatus Endpoints +3 Endpoints
2024-09-23 18:08:00 +01:00
Ben Luddy
c8b1037a58
Add test for unintended changes to dynamic client response handling.
The goal is to increase confidence that a change to the dynamic client does not unintentionally
introduce subtle changes to objects returned by dynamic clients in existing programs.
2024-09-23 12:45:22 -04:00
Maciej Szulik
b51d6308a7
Reuse CreateTestCRD helper for kubectl e2e 2024-09-23 18:32:27 +02:00
Maciej Szulik
3bff2b7ee9
Do not return error where it's not needed 2024-09-23 18:12:31 +02:00
Kubernetes Prow Robot
ff391cefe2
Merge pull request #127547 from dims/skip-reinstallation-of-gpu-daemonset
Skip re-installation of GPU daemonset
2024-09-23 15:28:00 +01:00