The idempotency.go (perhaps not so accurately named) contains
API calls that kubeadm does against an API server using client-go.
Some users seem to have unstable setups where for unknown reasons
the API server can be unavailable or refuse to respond as expected.
Use PollUntilContextTimeout in all exported functions to ensure
such API calls are all retry-able.
NOTE: The context passed to PollUntilContextTimeout is not propagated
in the polled function. Instead the poll function creates it's own
context 'ctx := context.Background()', this is to avoid
breaking expectations on the side of the callers, that expect
a certain type of error and not "context timeout" errors.
Additional changes:
- Make all context.TODO() -> context.Background()
- Update all unit tests and make sure during testing the retry
interval and timeout are short. Test coverage of idempotency.go
is at ~97%.
- Remove the TestMutateConfigMapWithConflict test. It does not
contribute much, because conflict handling is done at the API,
server side, not on the side of kubeadm. This simulating this is not
needed.
Previous v1beta4 work added support for
ClusterConfiguration.EncryptionAlgorithm, however the possible
values were limited to just "RSA" (2048 key size) and "ECDSA" (P256).
Allow more arbitrary algorithm types, that can also include key size
or curve type encoded in the name:
"RSA-2048" (default), "RSA-3072", "RSA-4096" or "ECDSA-P256".
Update the deprecation notice of the PublicKeysECDSA FeatureGate
as ideally it should be removed only after v1beta3 is removed.
- Update the logic in checks.go to separate serial and parallel image
pulls.
- Add a new CRI function PullImagesInParallel() with a private
implementation.
- Unit test the private implementation.
- Update other unit tests in checks_test.go.
Intentionally pass a new context to this API call.
This will let the API call run independently of the parent
context timeout, which is quite short and can cause the API
call to return abruptly.
If some code is about to go over the context deadline,
"x/time/rate/rate.go" would return and untyped error with the string
"would exceed context deadline". If some code already exceeded
the deadline the error would be of type DeadlineExceeded.
Ignore such context errors and only store API and connectivity errors.
- Name the function GetConfigMapWithShortRetry to be
easier to understand that the function is with a very short timeout.
Add note that this function should be used in cases there is a
fallback to local config.
- Apply custom hardcoded interval of 50ms and timeout of 350ms to it.
Previously the fucntion used exp backoff with 5 steps up to ~340ms.
Switch to PollUntilContextTimeout() everywhere to allow
usage of the exposed timeouts in the kubeadm API. Exponential backoff
options are more difficult to expose in this regard and a bit too
detailed for the common user - i.e. have "steps", "factor" and so on.
Replace the usage of the deprecated wait.Poll() and
wait.PollImmediate() functions with wait.PollUntilContextTimeout().
Since we don't have piping of context around kubeadm,
use context.Background() everywhere.
Some wait.Poll() functions were converted to "immediate" as there
is no point for them to not be. This is done for consistency.
Replace the only instance of wait.JitterUntil with
wait.PollUntilContextTimeout. JitterUntil is not deprecated
but this is also done for consistency.
Currently, timeouts are only accessible if a kubeadm runtime.Object{}
like InitConfiguration is passed around.
Any time a config is loaded or defaulted, store the Timeouts
structure in a thread-safe way in the main kubeadm API package
with SetActiveTimeouts(). Optionally, a deep-copy can be
performed before calling SetActiveTimeouts(). Make this struct
accessible with GetActiveTimeouts(). Ensure these functions
are thread safe.
On init() make sure the struct is defaulted, so that unit
tests can work with these values.
When upconverting from v1beta3 to v1beta4, it appears there is no
easy way to migrate some of the timeout values such as:
ClusterConfiguration.APIServer.TimeoutForControlPlane
to a new location:
InitConfiguration.Timeouts.<some-timeout-field>
Yes, the internal InitConfiguratio does embed a ClusterConfiguration,
but during conversion the ClusterConfiguration is converted from an
empty source.
K8s' API machinery has ways to register custom conversion functions,
such as v1beta3.ClusterConfiguration -> internal.InitConfiguration,
but these must be triggered explicitly with a decoder.
The overall migration of fields seems very awkward.
There might be hacks around that, such as storing intermediate state,
while trying to make the fuzzer rountrip happy, but instead
mutation functions can be implemented for the internal types when
calling kubeadm's migrate code. This seems much cleaner.
The function TryRunCommand() uses an exponential backoff,
which is good, but it's inconsistent and only used in a couple
of places.
Remove its usage in the token.go#UpdateOrCreateTokens()
and switch to using the standard function used in other places -
PollUntilContextTimeout().
Remove wait.go#TryRunCommand(), as there are no other usages.
The function wait.go#WaitForKubeletAndFunc() has been used in
a number of places in kubeadm. It starts a go routine to wait for
the kubelet /healthz and in parallel starts another go routine
to wait for an custom function.
This logic is problematic. If kubeadm is waiting for the kubelet
in parallel with something that requires the kubelet, the right
solution would be to first wait for the kubelet in serial and only
then proceed with the other action. The parallelism here particularly
during "init" required a unwanted "initial timeout" of 40s, before
the kubelet waiting even starts. In most cases, this makes the kubelet
waiter to not even start, while the main point of waiting becomes
the "other action".
- Remove the function WaitForKubeletAndFunc() from the Waiter interface.
- Rename the function WaitForHealthyKubelet() to just WaitForKubelet()
to be consistent with the naming WaitForAPI().
- Update WaitForKubelet() to not use TryRunCommand() and instead
use PollUntilContextTimeout().
- Remove the "initial timeout" of 40s in WaitForKubelet().
- Make both WaitForKubelet() and WaitForAPI() use similar error
handling and output.
- Update all usage of WaitForKubelet() to be a serial call before
any other action, such as another wait* call.
- Make the default wait timeout for the kubelet
/healthz to be 1 minute (kubeadmconstants.DefaultKubeletTimeout).
- Apply updates to all implementations of the Waiter interface.
Add v1beta4.ClusterConfiguration.EncryptionAlgorithm field (string)
and allow the user to configure the cluster asymetric encryption
algorithm to be either "RSA" (default, 2048 pkey size) or "ECDSA" (P-256).
Add validation and fuzzing. Conversion from v1beta3 is not required
because an empty field value is accepted and defaulted to RSA if needed.
Leverage the existing configuration option (feature gate) PublicKeysECDSA
but rename the backend fields, arguments, function names to be more
generic - EncryptionAlgorithm instead of PublicKeyAlgorithm.
That is because once the feature gate is enabled the algorithm
configuration also applies to private keys. It also uses the kubeadm API
type (string) instead of the x509.PublicKeyAlgorithm enum (int).
Deprecate the PublicKeysECDSA feature gate with a message.
It should be removed with the release of v1beta4 or maximum one release
later (it is an alpha FG).