Azure's cloud provider VMSS VMs API accesses are mediated through
a cache holding and refreshing all VMSS together.
Due to that we hit VMSSVM.List API more often than we could: an
instance's cache miss or expiration should only require a single
VMSS re-list, while it's currently O(n) relative to the number of
attached Scale Sets.
Under hard pressure (clusters with many attached VMSS that can't all
be listed in one sequence of successive API calls) the controller
manager might be stuck trying to re-list everything from scratch,
then aborting the whole operation; then re-trying and re-triggering
API rate-limits, affecting the whole Subscription.
This patch replaces the global VMSS VMs cache by per-VMSS VMs caches.
Refreshes (VMSS VMs lists) are scoped to the single relevant VMSS; under
severe throttling the various caches can be incrementally refreshed.
Signed-off-by: Benjamin Pineau <benjamin.pineau@datadoghq.com>
clamp the max cpu.shares to the maximum value allowed by the kernel.
It is not an issue when using cgroupfs, as the kernel will
anyway make sure the value is not out of range and automatically clamp
it, systemd has an additional check that prevents the cgroup creation.
Closes: https://github.com/kubernetes/kubernetes/issues/92855
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Previously, it was possible for reusable CPUs and reusable devices (i.e.
those previously consumed by init containers) to not be reused by
subsequent init containers or app containers if the TopologyManager was
enabled. This would happen because hint generation for the
TopologyManager was not considering the reusable devices when it made
its hint calculation.
As such, it would sometimes:
1) Generate a hint for a differnent NUMA node, causing the CPUs and
devices to be allocated from that node instead of the one where the
reusable devices live; or
2) End up thinking there were not enough CPUs or devices to allocate and
throw a TopologyAffinity admission error
This patch fixes this by ensuring that reusable CPUs and devices are
considered as part of TopologyHint generation. This frunctionality is
difficult to unit test since it spans multiple components, but an e2e
test will be added in a subsequent patch to test this functionality.
The `default-go-version` field specifies the go version used for the
master branch, and if the go version is not explicitly specified for a
release branch.
This commit also uses go 1.14.6 for the `release-1.19` branch.
Adds check for index out of bounds error instead of panic when passing
container to kubectl exec.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>
Updates sig-scheduling e2e Nvidia GPU tests to install drivers using
local manifest by default. Currently the DaemonSet is fetched from the
GoogleCloudPlatform/container-enginer-accelerators repo by default.
Using a local manifest allows for manually specifying the image
cos-gpu-installer image rather than always using latest. A remote
manifest can still be fetched by setting
NVIDIA_DRIVER_INSTALLER_DAEMONSET env var.
Signed-off-by: hasheddan <georgedanielmangum@gmail.com>