v1beta4 added the Timeouts struct and a EtcdAPICall timeout
field, but it was never used in the etcd client calls.
This is a bug, so it should be fixed, we also reduced
the timeout from 200 seconds exponentional backoff to 2 minute
linear default timeout.
Compiling a CEL expression used to do the cost estimation, whether the caller
needed the result or not. Now callers can skip it. The scheduler does that,
through the CEL cache.
The main advantage is that failures in the estimator (like panics) are limited
to the apiserver. Performance in the scheduler is not expected to benefit much
because compilation results are cached.
This makes the script identical to current
master (f3cbd79db7). This is needed
because pull-kubernetes-apidiff-client-go is the same for all
branches and assumes that the script automatically determines
the diff based on Prow env variables.
Not implementing a size estimator had the effect that strings retrieved from
the attributes were treated as "unknown size", leading to wildly overestimating
the cost and validation errors even for even simple expressions like this:
device.attributes["qat.intel.com"].services.matches("[^a]?sym")
Maximum number of elements in maps and the maximum length of the driver name
string were also ignored resp. missing. Pre-defined types like
apiservercel.StringType must be avoided because they are defined as having
a zero maximum size.
- Use environment variables to pass string arguments in the node log
query PS command
- Split getLoggingCmd into getLoggingCmdEnv and getLoggingCmdArgs
for better modularization
We want to be sure that the maximum number of pods per claim are actually
scheduled concurrently. Previously the test just made sure that they ran
eventually.
Running 256 pods only works on more than 2 nodes, so network-attached resources
have to be used. This is what the increased limit is meant for anyway. Because
of the tightened validation of node selectors in 1.32, the E2E test has to
use MatchExpressions because they allow listing node names.
The original limit of 32 seemed sufficient for a single GPU on a node. But for
shared non-local resources it is too low. For example, a ResourceClaim might be
used to allocate an interconnect channel that connects all pods of a workload
running on several different nodes, in which case the number of pods can be
considerably larger.
256 is high enough for currently planned systems. If we need something even
higher in the future, an alternative approach might be needed to avoid
scalability problems.
Normally, increasing such a limit would have to be done incrementally over two
releases. In this case we decided on
Slack (https://kubernetes.slack.com/archives/CJUQN3E4T/p1734593174791519) to
make an exception and apply this change to current master for 1.33 and backport
it to the next 1.32.x patch release for production usage.
This breaks downgrades to a 1.32 release without this change if there are
ResourceClaims with a number of consumers > 32 in ReservedFor. In practice,
this breakage is very unlikely because there are no workloads yet which need so
many consumers and such downgrades to a previous patch release are also
unlikely. Downgrades to 1.31 already weren't supported when using DRA v1beta1.
The WatchListClient feature is enabled for kube-controller-manager, but
namespace-controller misses the necessary "watch" permission, which
results in 30 error logs being generated every time a namespace is
deleted and falling back to the standard LIST semantics.
Signed-off-by: Quan Tian <quan.tian@broadcom.com>
go1.24 removes the x509sha1 GODEBUG variable, and with it the
support for SHA-1 signed certs. This commit alters the regex
in unit tests to account for that and prep for go1.24.
Signed-off-by: Madhav Jivrajani <madhav.jiv@gmail.com>