Currently, there are some unit tests that are failing on Windows due
to various reasons:
- IPVS proxy mode is not supported on Windows.
- pkg/kubelet/cri/remote was moved to cri-client.
This makes the API nicer:
resourceClaims:
- name: with-template
resourceClaimTemplateName: test-inline-claim-template
- name: with-claim
resourceClaimName: test-shared-claim
Previously, this was:
resourceClaims:
- name: with-template
source:
resourceClaimTemplateName: test-inline-claim-template
- name: with-claim
source:
resourceClaimName: test-shared-claim
A more long-term benefit is that other, future alternatives
might not make sense under the "source" umbrella.
This is a breaking change. It's justified because DRA is still
alpha and will have several other API breaks in 1.31.
The JSON patch approach works, but it is complex. A retry loop is easier to
understand (detect conflict, get new claim, try again). There is one additional
API call (the get), but in practice this scenario is unlikely.
There was a race caused by having to update claim finalizer and status in two
different operations:
- Resource claim controller removes allocation, does not yet
get to remove the finalizer.
- Scheduler prepares an allocation, without adding the finalizer
because it's there.
- Controller removes finalizer.
- Scheduler adds allocation.
This is an invalid state. Automatic checking found this during the execution of
the "with translated parameters on single node.*supports sharing a claim
sequentially" E2E test, but only when run stand-alone. When running in
parallel (as in the CI), the bad outcome of the race did not occur.
The fix is to check that the finalizer is still set when adding the
allocation. The apiserver doesn't check that because it doesn't know which
finalizer goes with the allocation result. It could check for "some finalizer",
but that is not guaranteed to be correct (could be some unrelated one).
Checking the finalizer can only be done with a JSON patch. Despite the
complications, having the ability to add multiple pods concurrently to
ReservedFor seems worth it (avoids expensive rescheduling or a local retry
loop).
The resource claim controller doesn't need this, it can do a normal update
which implicitly checks ResourceVersion.
MultiCIDRServiceAllocator implements a new ClusterIP allocator based on
IPAddress object to solve the problems and limitations caused by
existing bitmap allocators.
However, during the rollout of new versions, deployments need to support
a skew of one version between kube-apiservers. To avoid the possible
problem where there are multiple Services requests on the skewed
apiservers and that both allocate the same IP to different Services,
the new allocator will implement a dual-write strategy under the
feature gate DisableAllocatorDualWrite.
After the MultiCIDRServiceAllocator is GA, the DisableAllocatorDualWrite
can be enabled safely as all apiservers will run with the new
allocators. The graduation of DisableAllocatorDualWrite can also
be used to clean up the opaque API object that contains the old bitmaps.
If MultiCIDRServiceAllocator is enabled and DisableAllocatorDualWrite is disable
and is a new environment, there is no bitmap object created, hence, the
apiserver will initialize it to be able to write on it.
The current results with 100 works and 15k services on a (n2-standard-48) vCPU: 48 RAM: 192 GB are:
Old allocator:
perf_test.go:139: [RESULT] Duration 1m9.646167533s: [quantile:0.5 value:0.462886801 quantile:0.9 value:0.496662838 quantile:0.99 value:0.725845905]
New allocator:
perf_test.go:139: [RESULT] Duration 2m12.900694343s: [quantile:0.5 value:0.481814448 quantile:0.9 value:1.3867615469999999 quantile:0.99 value:1.888190671]
The new allocator has higher latency but in contrast allow to use a
larger number of services, when tested with 65k Services the old
allocator etcd crashes with storage exceeded.
The scenario is also not realistic, as a continuous and high load on
Service creation is not expected.
ServiceCIDRs are protected by finalizers and the CIDRs fields are
inmutable once set, only the readiness state impact the allocator
as it can only allocate IPs if any of the ServiceCIDR is ready.
The Add/Update events triggers a reconcilation of the current state
of the ServiceCIDR present in the informers with the existing IP
allocators.
The Delete events are handled directly to update or delete the
corresponing IP allocator.