kubernetes

mirror of https://github.com/k3s-io/kubernetes.git synced 2025-11-11 04:20:49 +00:00

Author	SHA1	Message	Date
Kubernetes Prow Robot	22a30e7cbb	Merge pull request #127700 from macsko/add_option_waitforpodsprocessed Add option to wait for pods to be attempted in barrierOp in scheduler_perf	2024-10-01 05:17:49 +01:00
Maciej Skoczeń	fdbf21e03a	Allow to filter pods using labels while collecting metrics in scheduler_perf	2024-09-30 13:32:12 +00:00
Maciej Skoczeń	928670061d	Allow to wait for pods to be attempted in barrierOp in scheduler_perf	2024-09-30 08:07:15 +00:00
Maciej Skoczeń	837d917d91	Make sleepOp duration parametrizable in scheduler_perf	2024-09-26 13:07:22 +00:00
Maciej Skoczeń	40154baab0	Add updateAnyOp to scheduler_perf	2024-09-25 12:42:25 +00:00
Patrick Ohly	d100768d94	scheduler_perf: track and visualize progress over time This is useful to see whether pod scheduling happens in bursts and how it behaves over time, which is relevant in particular for dynamic resource allocation where it may become harder at the end to find the node which still has resources available. Besides "pods scheduled" it's also useful to know how many attempts were needed, so schedule_attempts_total also gets sampled and stored. To visualize the result of one or more test runs, use: gnuplot.sh *.dat	2024-09-25 11:09:15 +02:00
Patrick Ohly	ded96042f7	scheduler_perf + DRA: load up cluster by allocating claims Having to schedule 4999 pods to simulate a "full" cluster is slow. Creating claims and then allocating them more or less like the scheduler would when scheduling pods is much faster and in practice has the same effect on the dynamicresources plugin because it looks at claims, not pods. This allows defining the "steady state" workloads with higher number of devices ("claimsPerNode") again. This was prohibitively slow before.	2024-09-25 09:45:39 +02:00
Patrick Ohly	385599f0a8	scheduler_perf + DRA: measure pod scheduling at a steady state The previous tests were based on scheduling pods until the cluster was full. This is a valid scenario, but not necessarily realistic. More realistic is how quickly the scheduler can schedule new pods when some old pods finished running, in particular in a cluster that is properly utilized (= almost full). To test this, pods must get created, scheduled, and then immediately deleted. This can run for a certain period of time. Scenarios with empty and full cluster have different scheduling rates. This was previously visible for DRA because the 50% percentile of the scheduling throughput was lower than the average, but one had to guess in which scenario the throughput was lower. Now this can be measured for DRA with the new SteadyStateClusterResourceClaimTemplateStructured test. The metrics collector must watch pod events to figure out how many pods got scheduled. Polling misses pods that already got deleted again. There seems to be no relevant difference in the collected metrics (SchedulingWithResourceClaimTemplateStructured/2000pods_200nodes, 6 repetitions): │ before │ after │ │ SchedulingThroughput/Average │ SchedulingThroughput/Average vs base │ 157.1 ± 0% 157.1 ± 0% ~ (p=0.329 n=6) │ before │ after │ │ SchedulingThroughput/Perc50 │ SchedulingThroughput/Perc50 vs base │ 48.99 ± 8% 47.52 ± 9% ~ (p=0.937 n=6) │ before │ after │ │ SchedulingThroughput/Perc90 │ SchedulingThroughput/Perc90 vs base │ 463.9 ± 16% 460.1 ± 13% ~ (p=0.818 n=6) │ before │ after │ │ SchedulingThroughput/Perc95 │ SchedulingThroughput/Perc95 vs base │ 463.9 ± 16% 460.1 ± 13% ~ (p=0.818 n=6) │ before │ after │ │ SchedulingThroughput/Perc99 │ SchedulingThroughput/Perc99 vs base │ 463.9 ± 16% 460.1 ± 13% ~ (p=0.818 n=6)	2024-09-25 09:45:39 +02:00
Patrick Ohly	51cafb0053	scheduler_perf: more useful errors for configuration mistakes Before, the first error was reported, which typically was the "invalid op code" error from the createAny operation: scheduler_perf.go:900: parsing test cases error: error unmarshaling JSON: while decoding JSON: cannot unmarshal {"collectMetrics":true,"count":10,"duration":"30s","namespace":"test","opcode":"createPods","podTemplatePath":"config/dra/pod-with-claim-template.yaml","steadyState":true} into any known op type: invalid opcode "createPods"; expected "createAny" Now the opcode is determined first, then decoding into exactly the matching operation is tried and validated. Unknown fields are an error. In the case above, decoding a string into time.Duration failed: scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: decoding {"collectMetrics":true,"count":10,"duration":"30s","namespace":"test","opcode":"createPods","podTemplatePath":"config/dra/pod-with-claim-template.yaml","steadyState":true} into benchmark.createPodsOp: json: cannot unmarshal string into Go struct field createPodsOp.Duration of type time.Duration Some typos: scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: unknown opcode "sleeep" in {"duration":"5s","opcode":"sleeep"} scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: decoding {"countParram":"$deletingPods","deletePodsPerSecond":50,"opcode":"createPods"} into benchmark.createPodsOp: json: unknown field "countParram"	2024-09-25 09:45:39 +02:00
Patrick Ohly	7bbb3465e5	scheduler_perf: more realistic structured parameters tests Real devices are likely to have a handful of attributes and (for GPUs) the memory as capacity. Most keys will be driver specific, a few may eventually have a domain (none standardized right now).	2024-09-24 18:52:45 +02:00
Maciej Skoczeń	a273e5381a	Add separate ops for collecting metrics from multiple namespaces in scheduler_perf	2024-09-24 12:28:53 +00:00
Maciej Skoczeń	287b61918a	Add deletePodsOp to scheduler_perf	2024-09-20 09:46:27 +00:00
Kubernetes Prow Robot	2850d302ca	Merge pull request #127269 from sanposhiho/patch-11 chore: tidy up labels in scheduler-perf	2024-09-19 04:18:44 +01:00
Maciej Skoczeń	2d4d7e0b5f	Fix opIndex in log when deleting pod failed in scheduler_perf	2024-09-16 13:48:24 +00:00
Kensei Nakada	898cb15b18	chore: clarify the labels in scheduler-perf	2024-09-14 15:39:54 +09:00
Maciej Skoczeń	9f6fdf1b77	Decrease number of integration tests in scheduler_perf	2024-09-12 15:13:53 +00:00
Kubernetes Prow Robot	c3ebd95c83	Merge pull request #127236 from macsko/scheduler_perf_test_case_for_hints_memory_leak_scenario Add scheduler_perf test case for queueing hints memory leak scenario	2024-09-11 16:03:11 +01:00
Maciej Skoczeń	c1f7b8e9f1	Measure event_handling and QHints duration metrics in scheduler_perf	2024-09-10 10:45:19 +00:00
Maciej Skoczeń	dba24fde78	Add scheduler_perf test case for queueing hints memory leak scenario	2024-09-10 08:15:10 +00:00
Kubernetes Prow Robot	abc056843c	Merge pull request #127238 from macsko/make_scheduler_perf_integration_tests_shorter Make scheduler_perf integration tests shorter	2024-09-10 03:17:14 +01:00
Maciej Skoczeń	ccf86f1709	Make scheduler_perf integration tests shorter	2024-09-09 09:32:13 +00:00
Maciej Skoczeń	7d4c713520	Check if InFlightEvents is empty after scheduler_perf workload	2024-09-09 08:00:34 +00:00
Maciej Skoczeń	3047ab73f5	Reset only metrics configured in collector before the createPodsOp	2024-09-06 08:26:20 +00:00
Kubernetes Prow Robot	08dd9951f5	Merge pull request #126886 from pohly/scheduler-perf-output scheduler_perf: output	2024-08-26 22:23:40 +01:00
Kubernetes Prow Robot	8bbc0636b9	Merge pull request #126911 from macsko/scheduler_perf_throughput_fixes Fix wrong throughput threshold for one scheduler_perf test case	2024-08-26 18:42:17 +01:00
Kubernetes Prow Robot	0bcbc3b77a	Merge pull request #124003 from carlory/scheduler-rm-non-csi-limit kube-scheduler remove non-csi volumelimit plugins	2024-08-26 12:02:13 +01:00
Maciej Skoczeń	7a88548755	Add workload name to failed threshold log	2024-08-26 07:44:52 +00:00
Maciej Skoczeń	71c9b9e2b0	Fix wrong throughput threshold for SchedulingRequiredPodAntiAffinityWithNSSelector test	2024-08-26 07:40:04 +00:00
Maciej Skoczeń	48dc6ff43c	Disable scheduler_perf performance DRA tests	2024-08-26 07:35:36 +00:00
Kubernetes Prow Robot	605e94f6df	Merge pull request #126871 from macsko/set_thresholds_in_scheduler_perf Set scheduling throughput thresholds in scheduler_perf tests	2024-08-23 16:39:54 +01:00
Maciej Skoczeń	48a8cb2bc5	Document throughput thresholds in scheduler_perf readme	2024-08-23 14:22:48 +00:00
Patrick Ohly	bf1188d292	scheduler_perf: only store log output after failures Reconfiguring the logging infrastructure with a per-test output file mimicks the behavior of per-test output (log output captured only on failures) while still using the normal logging code, which is important for benchmarking. To enable this behavior, the ARTIFACT env variable must be set.	2024-08-23 16:02:45 +02:00
Maciej Skoczeń	d0e3fc3561	Set scheduling throughput thresholds in scheduler_perf tests	2024-08-23 12:48:28 +00:00
Kubernetes Prow Robot	a1fc2551ba	Merge pull request #126144 from likakuli/cleanup-unusedparamters cleanup: remove scheduler_perf unused parameters	2024-08-22 19:29:40 +01:00
Maciej Skoczeń	77372cf3cf	Label short workloads in scheduler_perf tests	2024-08-20 10:04:30 +00:00
Maciej Skoczeń	09fc399837	Add label to select short workloads in scheduler_perf tests	2024-08-20 10:04:30 +00:00
Maciej Skoczeń	a2cd8aa539	Make smaller workloads for scheduler_perf integration tests	2024-08-20 10:04:25 +00:00
Kubernetes Prow Robot	983875b2f5	Merge pull request #126337 from macsko/add_larger_scheduler_perf_test_cases Add larger scheduler_perf test cases	2024-08-16 09:44:38 -07:00
Maciej Skoczeń	3b7b50a2cc	Create fresh etcd instance for each workload in scheduler_perf	2024-08-16 08:19:52 +00:00
Maciej Skoczeń	5894e201fa	Measure metrics only during a specific op in scheduler_perf	2024-08-13 12:34:06 +00:00
carlory	cba2b3f773	kube-scheduler remove non-csi volumelimit plugins	2024-08-05 15:02:32 +08:00
Maciej Skoczeń	1747483922	Add larger scheduler_perf test cases	2024-07-25 14:20:51 +00:00
Maciej Skoczeń	c15cdf7431	Init etcd and apiserver per test case in scheduler_perf integration tests	2024-07-23 09:10:01 +00:00
Patrick Ohly	9f36c8d718	DRA: add DRAControlPlaneController feature gate for "classic DRA" In the API, the effect of the feature gate is that alpha fields get dropped on create. They get preserved during updates if already set. The PodSchedulingContext registration is not restricted by the feature gate. This enables deleting stale PodSchedulingContext objects after disabling the feature gate. The scheduler checks the new feature gate before setting up an informer for PodSchedulingContext objects and when deciding whether it can schedule a pod. If any claim depends on a control plane controller, the scheduler bails out, leading to: Status: Pending ... Warning FailedScheduling 73s default-scheduler 0/1 nodes are available: resourceclaim depends on disabled DRAControlPlaneController feature. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling. The rest of the changes prepare for testing the new feature separately from "structured parameters". The goal is to have base "dra" jobs which just enable and test those, then "classic-dra" jobs which add DRAControlPlaneController.	2024-07-22 18:09:34 +02:00
Patrick Ohly	599fe605f9	DRA scheduler: adapt to v1alpha3 API The structured parameter allocation logic was written from scratch in staging/src/k8s.io/dynamic-resource-allocation/structured where it might be useful for out-of-tree components. Besides the new features (amount, admin access) and API it now supports backtracking when the initial device selection doesn't lead to a complete allocation of all claims. Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com> Co-authored-by: John Belamaric <jbelamaric@google.com>	2024-07-22 18:09:34 +02:00
Patrick Ohly	8a629b9f15	DRA: remove "sharable" from claim allocation result Now all claims are shareable up to the limit imposed by the size of the "reserverFor" array. This is one of the agreed simplifications for 1.31.	2024-07-21 17:28:14 +02:00
Patrick Ohly	b51d68bb87	DRA: bump API v1alpha2 -> v1alpha3 This is in preparation for revamping the resource.k8s.io completely. Because there will be no support for transitioning from v1alpha2 to v1alpha3, the roundtrip test data for that API in 1.29 and 1.30 gets removed. Repeating the version in the import name of the API packages is not really required. It was done for a while to support simpler grepping for usage of alpha APIs, but there are better ways for that now. So during this transition, "resourceapi" gets used instead of "resourcev1alpha3" and the version gets dropped from informer and lister imports. The advantage is that the next bump to v1beta1 will affect fewer source code lines. Only source code where the version really matters (like API registration) retains the versioned import.	2024-07-21 17:28:13 +02:00
likakuli	ef9e1c39e9	cleanup: remove unused parameters Signed-off-by: likakuli <1154584512@qq.com>	2024-07-17 16:27:12 +08:00
Kubernetes Prow Robot	a6460c4f3e	Merge pull request #126036 from macsko/scheduler_perf_throughput_thresholds Allow to set scheduling throughput thresholds in scheduler_perf tests	2024-07-16 21:43:13 -07:00
Maciej Skoczeń	767d2a3e5e	Allow to set scheduling throughput thresholds in scheduler_perf tests	2024-07-15 08:06:21 +00:00

1 2 3 4 5 ...

387 Commits