fix if condition
add test
add log
eliminate unnecessary args from log
fix Queue condition
check original pod status
fix return value when can scheduleable
fix tweak
fix testcase
* endpoints/handlers/get: intro watchListEndpointRestrictions
* consistencydetector/list_data_consistency_detector: expose IsDataConsistencyDetectionForListEnabled
* e2e/watchlist: extract common function for adding unstructured secrets
* e2e/watchlist: new e2e scenarios for convering watchListEndpointRestrict
This is useful to see whether pod scheduling happens in bursts and how it
behaves over time, which is relevant in particular for dynamic resource
allocation where it may become harder at the end to find the node which still
has resources available.
Besides "pods scheduled" it's also useful to know how many attempts were
needed, so schedule_attempts_total also gets sampled and stored.
To visualize the result of one or more test runs, use:
gnuplot.sh *.dat
Having to schedule 4999 pods to simulate a "full" cluster is slow. Creating
claims and then allocating them more or less like the scheduler would when
scheduling pods is much faster and in practice has the same effect on the
dynamicresources plugin because it looks at claims, not pods.
This allows defining the "steady state" workloads with higher number of
devices ("claimsPerNode") again. This was prohibitively slow before.
The previous tests were based on scheduling pods until the cluster was
full. This is a valid scenario, but not necessarily realistic.
More realistic is how quickly the scheduler can schedule new pods when some
old pods finished running, in particular in a cluster that is properly
utilized (= almost full). To test this, pods must get created, scheduled, and
then immediately deleted. This can run for a certain period of time.
Scenarios with empty and full cluster have different scheduling rates. This was
previously visible for DRA because the 50% percentile of the scheduling
throughput was lower than the average, but one had to guess in which scenario
the throughput was lower. Now this can be measured for DRA with the new
SteadyStateClusterResourceClaimTemplateStructured test.
The metrics collector must watch pod events to figure out how many pods got
scheduled. Polling misses pods that already got deleted again. There seems to
be no relevant difference in the collected
metrics (SchedulingWithResourceClaimTemplateStructured/2000pods_200nodes, 6 repetitions):
│ before │ after │
│ SchedulingThroughput/Average │ SchedulingThroughput/Average vs base │
157.1 ± 0% 157.1 ± 0% ~ (p=0.329 n=6)
│ before │ after │
│ SchedulingThroughput/Perc50 │ SchedulingThroughput/Perc50 vs base │
48.99 ± 8% 47.52 ± 9% ~ (p=0.937 n=6)
│ before │ after │
│ SchedulingThroughput/Perc90 │ SchedulingThroughput/Perc90 vs base │
463.9 ± 16% 460.1 ± 13% ~ (p=0.818 n=6)
│ before │ after │
│ SchedulingThroughput/Perc95 │ SchedulingThroughput/Perc95 vs base │
463.9 ± 16% 460.1 ± 13% ~ (p=0.818 n=6)
│ before │ after │
│ SchedulingThroughput/Perc99 │ SchedulingThroughput/Perc99 vs base │
463.9 ± 16% 460.1 ± 13% ~ (p=0.818 n=6)
Before, the first error was reported, which typically was the "invalid op code"
error from the createAny operation:
scheduler_perf.go:900: parsing test cases error: error unmarshaling JSON: while decoding JSON: cannot unmarshal {"collectMetrics":true,"count":10,"duration":"30s","namespace":"test","opcode":"createPods","podTemplatePath":"config/dra/pod-with-claim-template.yaml","steadyState":true} into any known op type: invalid opcode "createPods"; expected "createAny"
Now the opcode is determined first, then decoding into exactly the matching operation is
tried and validated. Unknown fields are an error.
In the case above, decoding a string into time.Duration failed:
scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: decoding {"collectMetrics":true,"count":10,"duration":"30s","namespace":"test","opcode":"createPods","podTemplatePath":"config/dra/pod-with-claim-template.yaml","steadyState":true} into *benchmark.createPodsOp: json: cannot unmarshal string into Go struct field createPodsOp.Duration of type time.Duration
Some typos:
scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: unknown opcode "sleeep" in {"duration":"5s","opcode":"sleeep"}
scheduler_test.go:29: parsing test cases error: error unmarshaling JSON: while decoding JSON: decoding {"countParram":"$deletingPods","deletePodsPerSecond":50,"opcode":"createPods"} into *benchmark.createPodsOp: json: unknown field "countParram"
Now that imports aren't automatically added, the client-go generator
produces broken code for extensions since it references a few
functions and types directly without declaring them properly
(klog.Warningf, time.Duration, time.Second).
The generated code also references c.client and c.ns which are no
longer accessible following the generic refactor.
This fixes both issues by adding missing template functions and types,
and going through the appropriate getters.
Signed-off-by: Stephen Kitt <skitt@redhat.com>
When MultiCIDRServiceAllocator feature is enabled, we added an
additional feature gate DisableAllocatorDualWrite that allows to enable
a mirror behavior on the old allocator to deal with problems during
cluster upgrades.
During the implementation the secondary range of the legacy allocator
was initialized with the valuye of the primary range, hence, when a
Service tried to allocate a new IP on the secondary range, it succeded
in the new ip allocator but failed when it tried to allocate the same IP
on the legacy allocator, since it has a different range.
Expand the integration test that run over all the combinations of
Service ClusterIP possibilities to run with all the possible
combinations of the feature gates.
The integration test need to change the way of starting the apiserver
otherwise it will timeout.
Real devices are likely to have a handful of attributes and (for GPUs) the
memory as capacity. Most keys will be driver specific, a few may eventually
have a domain (none standardized right now).