Commit Graph

125325 Commits

Author SHA1 Message Date
Patrick Ohly
ded96042f7 scheduler_perf + DRA: load up cluster by allocating claims
Having to schedule 4999 pods to simulate a "full" cluster is slow. Creating
claims and then allocating them more or less like the scheduler would when
scheduling pods is much faster and in practice has the same effect on the
dynamicresources plugin because it looks at claims, not pods.

This allows defining the "steady state" workloads with higher number of
devices ("claimsPerNode") again. This was prohibitively slow before.
2024-09-25 09:45:39 +02:00
Patrick Ohly
385599f0a8 scheduler_perf + DRA: measure pod scheduling at a steady state
The previous tests were based on scheduling pods until the cluster was
full. This is a valid scenario, but not necessarily realistic.

More realistic is how quickly the scheduler can schedule new pods when some
old pods finished running, in particular in a cluster that is properly
utilized (= almost full). To test this, pods must get created, scheduled, and
then immediately deleted. This can run for a certain period of time.

Scenarios with empty and full cluster have different scheduling rates. This was
previously visible for DRA because the 50% percentile of the scheduling
throughput was lower than the average, but one had to guess in which scenario
the throughput was lower. Now this can be measured for DRA with the new
SteadyStateClusterResourceClaimTemplateStructured test.

The metrics collector must watch pod events to figure out how many pods got
scheduled. Polling misses pods that already got deleted again. There seems to
be no relevant difference in the collected
metrics (SchedulingWithResourceClaimTemplateStructured/2000pods_200nodes, 6 repetitions):

     │            before            │                     after                     │
     │ SchedulingThroughput/Average │ SchedulingThroughput/Average  vs base         │
                         157.1 ± 0%                     157.1 ± 0%  ~ (p=0.329 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc50 │ SchedulingThroughput/Perc50  vs base         │
                        48.99 ± 8%                    47.52 ± 9%  ~ (p=0.937 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc90 │ SchedulingThroughput/Perc90  vs base         │
                       463.9 ± 16%                   460.1 ± 13%  ~ (p=0.818 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc95 │ SchedulingThroughput/Perc95  vs base         │
                       463.9 ± 16%                   460.1 ± 13%  ~ (p=0.818 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc99 │ SchedulingThroughput/Perc99  vs base         │
                       463.9 ± 16%                   460.1 ± 13%  ~ (p=0.818 n=6)
2024-09-25 09:45:39 +02:00
Patrick Ohly
51cafb0053 scheduler_perf: more useful errors for configuration mistakes
Before, the first error was reported, which typically was the "invalid op code"
error from the createAny operation:

    scheduler_perf.go:900: parsing test cases error: error unmarshaling JSON: while decoding JSON: cannot unmarshal {"collectMetrics":true,"count":10,"duration":"30s","namespace":"test","opcode":"createPods","podTemplatePath":"config/dra/pod-with-claim-template.yaml","steadyState":true} into any known op type: invalid opcode "createPods"; expected "createAny"

Now the opcode is determined first, then decoding into exactly the matching operation is
tried and validated. Unknown fields are an error.

In the case above, decoding a string into time.Duration failed:

    scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: decoding {"collectMetrics":true,"count":10,"duration":"30s","namespace":"test","opcode":"createPods","podTemplatePath":"config/dra/pod-with-claim-template.yaml","steadyState":true} into *benchmark.createPodsOp: json: cannot unmarshal string into Go struct field createPodsOp.Duration of type time.Duration

Some typos:

    scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: unknown opcode "sleeep" in {"duration":"5s","opcode":"sleeep"}

    scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: decoding {"countParram":"$deletingPods","deletePodsPerSecond":50,"opcode":"createPods"} into *benchmark.createPodsOp: json: unknown field "countParram"
2024-09-25 09:45:39 +02:00
Patrick Ohly
7bbb3465e5 scheduler_perf: more realistic structured parameters tests
Real devices are likely to have a handful of attributes and (for GPUs) the
memory as capacity. Most keys will be driver specific, a few may eventually
have a domain (none standardized right now).
2024-09-24 18:52:45 +02:00
Kubernetes Prow Robot
b071443187
Merge pull request #127592 from dims/wait-for-gpus-even-for-aws-kubetest2-ec2-harness
Wait for GPUs even for AWS kubetest2 ec2 harness
2024-09-24 17:26:08 +01:00
Kubernetes Prow Robot
56071089e2
Merge pull request #127573 from benluddy/dynamic-golden-response-test
Add test for unintended changes to dynamic client response handling.
2024-09-24 17:26:01 +01:00
Kubernetes Prow Robot
4c24b9337f
Merge pull request #127575 from alculquicondor/acondor-apps
Stepping down from SIG Apps reviewers
2024-09-24 15:38:06 +01:00
Kubernetes Prow Robot
9571d3b6c6
Merge pull request #125995 from carlory/remove-unnecessary-permissions
remove unneeded permissions for volume controllers
2024-09-24 15:38:00 +01:00
Davanum Srinivas
472ca3b279
skip control plane nodes, they may not have GPUs
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-09-24 10:09:33 -04:00
Kubernetes Prow Robot
6ded721910
Merge pull request #127496 from macsko/add_metricscollectionop_to_scheduler_perf
Add separate ops for collecting metrics from multiple namespaces in scheduler_perf
2024-09-24 14:34:00 +01:00
Davanum Srinivas
349c7136c9
Wait for GPUs even for AWS kubetest2 ec2 harness
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-09-24 09:11:18 -04:00
Maciej Skoczeń
a273e5381a Add separate ops for collecting metrics from multiple namespaces in scheduler_perf 2024-09-24 12:28:53 +00:00
Kubernetes Prow Robot
5973accf48
Merge pull request #127570 from soltysh/do_not_return_err
Do not return error where it's not needed
2024-09-24 10:20:01 +01:00
Kubernetes Prow Robot
2ade53e264
Merge pull request #124947 from toVersus/fix/eviction-message
[Sidecar Containers] Consider init containers in eviction message
2024-09-24 08:58:00 +01:00
Kubernetes Prow Robot
f0036aac21
Merge pull request #127572 from soltysh/reuse_helper
Reuse CreateTestCRD helper for kubectl e2e
2024-09-24 06:05:59 +01:00
Kubernetes Prow Robot
4851ea85e0
Merge pull request #127582 from dims/avoid-collecting-dmesg-when-running-as-daemon
Avoid collecting dmesg when running as daemon
2024-09-24 04:55:59 +01:00
Davanum Srinivas
1dc29b74b9
Avoid collecting dmesg when running as daemon
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-09-23 21:32:05 -04:00
Kubernetes Prow Robot
94df29b8f2
Merge pull request #127464 from sanposhiho/trigger-nodedelete
fix(eventhandler): trigger Node/Delete event
2024-09-24 02:24:00 +01:00
Kubernetes Prow Robot
1137a6a0cc
Merge pull request #127093 from jpbetz/retry-generate-name-ga
Promote RetryGenerateName to GA
2024-09-24 00:46:06 +01:00
Kubernetes Prow Robot
d6bb550b10
Merge pull request #122890 from HirazawaUi/fix-pod-grace-period
[kubelet]: Fix the bug where pod grace period will be overwritten
2024-09-24 00:45:59 +01:00
Kubernetes Prow Robot
211d67a511
Merge pull request #125398 from AxeZhan/pvAffinity
[scheduler] When the hostname and nodename of a node do not match, ensure that pods carrying PVs with nodeAffinity are scheduled correctly.
2024-09-23 21:22:02 +01:00
Aldo Culquicondor
3d5525ec21 Stepping down from SIG Apps reviewers
Change-Id: I4ec085bfe9b5f65ae9b250bd2a7a519379874425
2024-09-23 19:11:54 +00:00
Kubernetes Prow Robot
851cf43a35
Merge pull request #127487 from hakuna-matatah/jobperf-delete-eventhandler
Offload the main Job reconciler w.r.t cleaning finalizers
2024-09-23 18:08:06 +01:00
Kubernetes Prow Robot
7ff0580bc8
Merge pull request #127458 from ii/promote-volume-attachment-status-test
Promote e2e test for VolumeAttachmentStatus Endpoints +3 Endpoints
2024-09-23 18:08:00 +01:00
Ben Luddy
c8b1037a58
Add test for unintended changes to dynamic client response handling.
The goal is to increase confidence that a change to the dynamic client does not unintentionally
introduce subtle changes to objects returned by dynamic clients in existing programs.
2024-09-23 12:45:22 -04:00
Maciej Szulik
b51d6308a7
Reuse CreateTestCRD helper for kubectl e2e 2024-09-23 18:32:27 +02:00
Maciej Szulik
3bff2b7ee9
Do not return error where it's not needed 2024-09-23 18:12:31 +02:00
Kubernetes Prow Robot
ff391cefe2
Merge pull request #127547 from dims/skip-reinstallation-of-gpu-daemonset
Skip re-installation of GPU daemonset
2024-09-23 15:28:00 +01:00
Kubernetes Prow Robot
f187480140
Merge pull request #127558 from pohly/e2e-framework-docs
e2e framework: better documentation of ExpectNoError
2024-09-23 14:12:00 +01:00
Kubernetes Prow Robot
c9d6fd9ff7
Merge pull request #127500 from p0lyn0mial/upstream-assign-rv-to-watchCacheInterval
cacher: prevents sending events with ResourceVersion < RequiredResourceVersion
2024-09-23 12:51:59 +01:00
Kubernetes Prow Robot
15d08bf7c8
Merge pull request #127323 from vrutkovs/tracing-cacher-get
tracing: add span for get cacher
2024-09-23 10:27:59 +01:00
Patrick Ohly
e5aa609513 e2e framework: better documentation of ExpectNoError
It wasn't clear from the comments what "explain" does, leading to calls like
this:

   framework.ExpectNoError(fmt.Errorf("additional info ....: %v", ..., err))
2024-09-23 10:58:06 +02:00
Kubernetes Prow Robot
df5787a57f
Merge pull request #127540 from mmorel-35/testifylint/error-is-as@k8s.io/apiserver
fix: enable error-is-as rule from testifylint in module `k8s.io/apiserver`
2024-09-23 09:06:13 +01:00
Kubernetes Prow Robot
19500e8551
Merge pull request #127524 from mjudeikis/mjudeikis/extend.group.manager
Add GroupLister interface to discovery GroupManager
2024-09-23 09:06:06 +01:00
Kubernetes Prow Robot
89f418f29e
Merge pull request #127481 from kannon92/fix-mount-propogation-flake
Use the last kubelet pid in the pidof command
2024-09-23 09:05:59 +01:00
Kubernetes Prow Robot
e456fbfaa6
Merge pull request #127545 from mjudeikis/mjudeikis/sa.flow.fix
Fix npe in serviceAccount  flow
2024-09-23 08:00:06 +01:00
Kubernetes Prow Robot
257d6f3f5b
Merge pull request #127512 from bergerhoffer/adding-interactive-delete
Adding example for interactive delete
2024-09-23 07:59:59 +01:00
Kubernetes Prow Robot
25aa9cd074
Merge pull request #127534 from mmorel-35/testifylint/contains@k8s.io/kubectl
fix: enable contains rule from testifylint in module `k8s.io/kubectl`
2024-09-23 05:53:59 +01:00
Kubernetes Prow Robot
4c2e239047
Merge pull request #126799 from kiashok/update-cadvisor-hcsshim
Update cadvisor and hcsshim versions
2024-09-23 02:39:58 +01:00
Davanum Srinivas
1abbb00067
Double a couple of other timeouts
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-09-22 19:36:39 -04:00
Kubernetes Prow Robot
fd44b5bf3b
Merge pull request #127544 from mjudeikis/mjudeikis/npe.check.fix
Fix npe when running in limited config in generic-control-plane mode
2024-09-22 22:30:05 +01:00
Kubernetes Prow Robot
5f0b2a8a26
Merge pull request #127533 from mmorel-35/testifylint/blank-import
fix: enable blank-import rule from testifylint
2024-09-22 22:29:58 +01:00
Kubernetes Prow Robot
f7085634de
Merge pull request #127529 from mmorel-35/testifylint/compares@k8s.io/apiserver
fix: enable compares rule from testifylint in module k8s.io/apiserver
2024-09-22 21:26:05 +01:00
Kubernetes Prow Robot
6bd57ffc5c
Merge pull request #127527 from mmorel-35/testifylint/compares@k8s.io/client-go
fix: enable compares rule from testifylint in module k8s.io/client-go
2024-09-22 21:25:59 +01:00
Kubernetes Prow Robot
5253ca0511
Merge pull request #127528 from mmorel-35/testifylint/compares@k8s.io/kubernetes
fix: enable compares rule from testifylint in module k8s.io/kubernetes
2024-09-22 20:19:59 +01:00
Kirtana Ashok
3fba9930b7 Update cadvisor and hcsshim versions
Signed-off-by: Kirtana Ashok <kiashok@microsoft.com>
2024-09-22 11:50:45 -07:00
Davanum Srinivas
92683139d7
Skip re-installation of GPU daemonset
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
2024-09-22 13:54:12 -04:00
Mangirdas Judeikis
4783af9a49 fix npe when running in limited config in generic-control-plane mode 2024-09-22 19:06:45 +03:00
Kensei Nakada
421f87a4e3 feat: add a requeueing integration test for PodTopologySpread with Node/delete event (QHint: disabled) 2024-09-23 00:29:56 +09:00
Matthieu MOREL
0dfc6e2843 fix: enable error-is-as rule from testifylint in module k8s.io/apiserver
Signed-off-by: Matthieu MOREL <matthieu.morel35@gmail.com>
2024-09-22 15:08:25 +00:00