kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-02 15:09:45 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	54878fa373	kata-deploy: add job deployment mode driven by the job-dispatcher Phase 2 of the DaemonSet -> staged-Job migration: add an opt-in `deploymentMode: job` that installs Kata via short-lived, per-node install Jobs instead of the long-running DaemonSet. The DaemonSet remains the default and is now gated behind `deploymentMode == daemonset`. Rather than render one Job per node into the Helm release (which grows the release secret O(nodes) and offers no rollout pacing), job mode ships a single tiny post-install/post-upgrade hook Job that runs the kata-deploy-job-dispatcher. The dispatcher enumerates the selected nodes LIVE from the API server and stamps out one node-pinned install Job per node from a constant-size ConfigMap of Job templates, keeping at most `job.parallelism` in flight and refilling as they finish. This guarantees per-node coverage with a paced rollout while the Helm release stays O(1) regardless of fleet size. New nodes are picked up by re-running `helm upgrade`; there is no always-on component. Each per-node Job runs the staged install pipeline as ordered initContainers and exits: host-check -> artifacts -> cri (initContainers, run sequentially) label (main container) The privilege split is explicit: the dispatcher pod is a pure control-plane client (lists nodes, manages Jobs in its own namespace) and runs fully unprivileged under a dedicated, least-privilege ServiceAccount (kata-rbac.yaml); only the per-node Jobs it creates carry the privileged kata-deploy host-mutation rights. Node selection (templates/_helpers.tpl: nodeLabelSelector / perNodeJob): - job.nodes: explicit node-name list passed to the dispatcher, and - job.nodeSelector (equality map) ANDed with - job.nodeSelectorExpressions (k8s label-selector requirements: In / NotIn / Exists / DoesNotExist), compiled into a single label-selector string the dispatcher resolves live. The default expressions target worker (non-control-plane) nodes, so no custom node labeling is required; set the expressions to [] to target all discovered nodes. Reuses the commonEnv/commonVolume* helpers and adds the stageContainer, serviceAccountName, dispatcherServiceAccountName, dispatcherImage and perNodeJob helpers shared by the dispatcher and the staged Jobs. The default (daemonset) render is unchanged. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-06-12 18:58:33 +02:00
Fabiano Fidêncio	48ebbbec3a	kata-deploy: honor debug mode with CLI log-level Make the chart pass --log-level debug automatically when debug=true so CI and troubleshooting runs emit full rendered config dumps without requiring a separate log-level override. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <noreply@cursor.com>	2026-06-08 19:25:48 +02:00
Fabiano Fidêncio	b119b051cb	kata-deploy: support drop-in configs for default runtimes Allow operators to provide per-shim drop-in TOML for built-in runtimes and reconcile stale override files so upgrades and migrations remain safe when drop-ins are added or removed. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Codex	2026-06-08 13:31:03 +02:00
Fabiano Fidêncio	5d3e1e6396	kata-deploy: verify kata-runtime label remains stable on rke2/k3s The retry loop added in `efd468df3f` still allows the install to declare success while inside the kubelet's post-restart re-register window. On rke2/k3s, `systemctl restart rke2-agent` restarts both containerd and the kubelet, but `wait_till_node_is_ready` polls `.status.conditions[Ready]` every 2 s and returns on the first `True` observation it sees. By default the kubelet only publishes node status every ~10 s, so that first `True` is almost always the stale value from before the restart — the kubelet hasn't actually finished restarting yet. `label_node_with_retry` then applies the label, sleeps 1 s, reads back "true" (still stale, kubelet still down), and returns Ok. Install completes, `/readyz` flips to 200, helm releases its `--wait`, and the bats test starts — and only then does the kubelet finish coming up, re-register the node, and clobber the label with its cached set. The lifecycle test sees an empty `katacontainers.io/kata-runtime` and fails: # Node label katacontainers.io/kata-runtime: not ok 1 Kata artifacts are present on host after install A single-shot verification can't distinguish "still stale true" from "truly stable true after kubelet re-register". Replace it with a stability window: after (re)applying the label, require it to remain at the expected value for STABILITY_CHECKS=6 consecutive observations spaced CHECK_INTERVAL=2 s apart (≈ 12 s — comfortably more than the kubelet's status-update period). If the value ever drifts inside the window, re-apply and restart the stability counter. Bounded by MAX_APPLY_ATTEMPTS=12, so worst case is ~3 min; happy path adds ~12 s to install. Also add a short polling loop to the test's own label assertion as belt-and-suspenders for any leftover transient race, matching the existing retry pattern used for the container-runtime version check. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-22 11:53:18 +02:00
Fabiano Fidêncio	9e3bd6b576	tests: fix kata-deploy lifecycle test reliability Fix two issues in kata-deploy-lifecycle.bats that caused failures on k3s, k0s and rke2: run_on_host(): - `kubectl run --rm -i` causes k3s/rke2 to inject session-recording banners into stdout, polluting command output and breaking string assertions. Replace with a create/wait/logs/delete sequence so only the container's actual stdout is captured. "Artifacts are fully cleaned up after uninstall": - After a CRI restart the kubelet may briefly report "Unknown" for the container runtime version. Retry for up to 60s before asserting. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-03 22:09:08 +02:00
Fabiano Fidêncio	ed4f6ebc9e	tests: use readiness probes to wait for kata-deploy install Now that kata-deploy has a proper readiness probe (/readyz returns 200 only after install completes), replace the ad-hoc wait strategies with kubectl wait --for=condition=Ready on the kata-deploy pods. Note: helm --wait is ineffective for single-node clusters with maxUnavailable=1 (the DaemonSet is considered ready with 0 ready pods), so the CI uses kubectl wait on the pod readiness condition directly. gha-run-k8s-common.sh: - Drop the waitForProcess polling loop for Running pods - Drop the `sleep 60s` with its FIXME comment - Add kubectl wait --for=condition=Ready instead helm-deploy.bash: - Drop the extra `kubectl rollout status` after helm - Drop the `sleep 60` - The existing --wait on the helm command now suffices Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-03 22:09:08 +02:00
Fabiano Fidêncio	b7eb3ae402	tests: Fix shellcheck issues in helm-deploy.bash Address shellcheck warnings including proper variable quoting, use of [[ ]] over [ ], declaring and assigning variables separately, and adding appropriate shellcheck disable directives where needed. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Made-with: Cursor	2026-04-24 08:14:08 +02:00
Fabiano Fidêncio	143f9a7882	tests: Fix shellcheck issues in run-kata-deploy-tests.sh Fix shellcheck warnings and notes identified by running shellcheck --severity=style. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-24 08:14:08 +02:00
Fabiano Fidêncio	140e08044f	tests: Fix shellcheck issues in gha-run.sh Fix shellcheck warnings and notes identified by running shellcheck --severity=style. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-24 08:14:08 +02:00
Fabiano Fidêncio	cf1e6f82f2	tests: Show full kata-deploy pod logs in CI Remove --tail=N limits from `kubectl logs` for kata-deploy pods so the complete output is visible in CI job logs for debugging. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-19 13:24:31 +02:00
Fabiano Fidêncio	2131147360	tests: add kata-deploy lifecycle tests for restart resilience and cleanup Add functional tests that cover two previously untested kata-deploy behaviors: 1. Restart resilience (regression test for #12761): deploys a long-running kata pod, triggers a kata-deploy DaemonSet restart via rollout restart, and verifies the kata pod survives with the same UID and zero additional container restarts. 2. Artifact cleanup: after helm uninstall, verifies that RuntimeClasses are removed, the kata-runtime node label is cleared, /opt/kata is gone from the host filesystem, and containerd remains healthy. 3. Artifact presence: after install, verifies /opt/kata and the shim binary exist on the host, RuntimeClasses are created, and the node is labeled. Host filesystem checks use a short-lived privileged pod with a hostPath mount to inspect the node directly. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-04-01 15:20:53 +02:00
Fabiano Fidêncio	56c3618c1d	tests: kata-deploy: wait for API recovery after uninstall kata-deploy's SIGTERM cleanup restarts the CRI runtime, which on k3s/rke2 takes down the API server temporarily. The helm uninstall may complete with errors, and the next test suite would start with a dead API. Add a wait loop after uninstall to ensure the API is available before proceeding. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 11:26:31 +01:00
Fabiano Fidêncio	5c0269881e	tests: Make editorconfig-checker happy - Trim trailing whitespace and ensure final newline in non-vendor files - Add .editorconfig-checker.json excluding vendor dirs, .patch, .img, .dtb, .drawio, *.svg, and pkg/cloud-hypervisor/client so CI only checks project code - Leave generated and binary assets unchanged (excluded from checker) Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-10 21:58:28 +01:00
Fabiano Fidêncio	c9061f9e36	tests: kata-deploy: Increase post-deployment wait time Increase the sleep time after kata-deploy deployment from 10s to 60s to give more time for runtimes to be configured. This helps avoid race conditions on slower K8s distributions like k3s where the RuntimeClass may not be immediately available after the DaemonSet rollout completes. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-04 12:13:53 +01:00
Fabiano Fidêncio	0fb2c500fd	tests: kata-deploy: Merge E2E tests to avoid timing issues Merge the two E2E tests ("Custom RuntimeClass exists with correct properties" and "Custom runtime can run a pod") into a single test, as those 2 are very much dependent of each other. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-04 12:13:53 +01:00
Fabiano Fidêncio	fef93f1e08	tests: kata-deploy: Use die() instead of fail() for error handling Replace fail() calls with die() which is already provided by common.bash. The fail() function doesn't exist in the test infrastructure, causing "command not found" errors when tests fail. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-04 12:13:53 +01:00
Fabiano Fidêncio	26c534d610	tests: Use shims.disableAll in test helpers Update the CI and functional test helpers to use the new shims.disableAll option instead of iterating over every shim to disable them individually. Also adds helm repo for node-feature-discovery before building dependencies to fix CI failures on some distributions. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-26 20:50:01 +01:00
Fabiano Fidêncio	d8a3272f85	kata-deploy: Add tests for custom runtimes Helm templates Add Bats tests to verify the custom runtimes Helm template rendering, and that the we can start a pod with the custom runtime. Tests were written with Cursor's help. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-26 20:50:01 +01:00
Fabiano Fidêncio	ec18dd79ba	tests: Simplify kata-deploy test to use helm directly The kata-deploy test was using helm_helper which made it hard to debug failures (die() calls would cause "Executed 0 tests" errors) and added unnecessary complexity. The test now calls helm directly like a user would, making it simpler and more representative of real-world usage. The verification job status is explicitly checked with proper failure detection instead of relying on helm --wait. Timeouts are configurable via environment variables to account for different network speeds and image sizes: - KATA_DEPLOY_TIMEOUT (default: 600s) - KATA_DEPLOY_DAEMONSET_TIMEOUT (default: 300s) - KATA_DEPLOY_VERIFICATION_TIMEOUT (default: 120s) Documentation has been added to explain what each timeout controls and how to customize them. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-21 20:14:33 +01:00
Fabiano Fidêncio	e0158869b1	tests: Add common bats test runner function Add run_bats_tests() function to common.bash that provides consistent test execution and reporting across all test suites (k8s, nvidia, kata-deploy). This removes duplicated test runner code from run_kubernetes_tests.sh, run_kubernetes_nv_tests.sh, and run-kata-deploy-tests.sh. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-20 12:31:55 +01:00
Fabiano Fidêncio	ea18f543b4	tests: kata-deploy: Enable verification during helm install Enable post-install verification in kata-deploy CI tests. When HELM_VERIFY_DEPLOYMENT is set, a simple verification pod is created that runs with the Kata runtime to confirm deployment succeeded. The verification pod prints kernel info and exits - success indicates the Kata runtime is properly configured and functional. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-16 10:52:43 +01:00
Fabiano Fidêncio	5b01eaf929	tests: Align kata-deploy helm's uninstall Let's use the same method both on the kata-deploy and k8s tests. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-04 09:29:35 +01:00
Fabiano Fidêncio	3107533953	tests: Adjust to runtimeClass creation by the chart It's just a follow-up on the previous commit where we move away from the runtimeClass creation inside the script, and instead we do it using the chart itself. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2025-11-03 17:32:18 +01:00
Fabiano Fidêncio	60ba121a0d	kata-deploy: nit: Fix test name Just add a "is" there as it was missing. Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>	2025-09-15 15:27:54 +02:00
Aurélien Bombo	96f1d95de5	gha: Remove unnecessary install-azure-cli step az cli is already installed by the azure/login action. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2025-07-30 10:42:56 -05:00
Fabiano Fidêncio	5f17e61d11	tests: kata-deploy: Remove --wait from helm uninstall As we're using a `kubectl wait --timeout ...` to check whether the kata-deploy pod's been deleted or not, let's remove the `--wait` from the `helm uninstall ...` call as k0s tests were failing because the `kubectl wait --timeout...` was starting after the pod was deleted, making the test fail. Signed-off-by: Fabiano Fidêncio <fidencio@northflank.com>	2025-07-09 14:01:30 +02:00
Aurélien Bombo	9dd3807467	ci: Use OIDC to log into Azure This completely eliminates the Azure secret from the repo, following the below guidance: https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-azure The federated identity is scoped to the `ci` environment, meaning: * I had to specify this environment in some YAMLs. I don't believe there's any downside to this. * As previously, the CI works seamlessly both from PRs and in the manual workflow. I also deleted the tools/packaging/kata-deploy/action folder as it doesn't seem to be used anymore, and it contains a reference to the secret. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2025-06-06 15:26:10 -05:00
Fabiano Fidêncio	28be53ac92	kata-deploy: Create runtimeclasses by default Let's make the life of the users easier and create the runtimeclasses for them by default. Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>	2025-03-31 11:29:44 +01:00
Fabiano Fidêncio	404e212102	tests: kata-deploy: Use helm_helper() With this we switch to fully testing with helm, instead of testimg with the kustomizations (which will soon be removed). Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>	2025-03-26 13:30:15 +01:00
Fabiano Fidêncio	c337a21a4e	shellcheck: kata-deploy: Fix warnings He were fixing the few warnings we found in the files present in the functional tests for kata-deploy. Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>	2025-03-05 19:44:27 +01:00
stevenhorsman	c5ff513e0b	shellcheck: Fix shellcheck SC2068 > Double quote array expansions to avoid re-splitting elements Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2025-03-04 09:35:46 +00:00
Stephane Talbot	f80e7370d5	test: Verify deployement of kata-deploy on microk8s Enable fonctional test to verify deployment of kata-deploy on a Microk8s cluster Signed-off-by: Stephane Talbot <Stephane.Talbot@univ-savoie.fr>	2025-02-28 10:10:29 +01:00
Beraldo Leal	53b8158a81	tests: adding debug and skip to kata-deploy If a test is failing during setup, makes no much sense to run the suite. Let's skip and add some debug messages. Signed-off-by: Beraldo Leal <bleal@redhat.com>	2024-05-31 13:28:34 -04:00
Beraldo Leal	3e8b4806b8	tests: increase debug messages for kata-deploy When the timeout happens we can't tell much information about the nodes. Signed-off-by: Beraldo Leal <bleal@redhat.com>	2024-05-31 13:28:34 -04:00
Beraldo Leal	c99ba42d62	deps: bumping yq to v4.40.7 Since yq frequently updates, let's upgrade to a version from February to bypass potential issues with versions 4.41-4.43 for now. We can always upgrade to the newest version if necessary. Fixes #9354 Depends-on:github.com/kata-containers/tests#5818 Signed-off-by: Beraldo Leal <bleal@redhat.com>	2024-05-31 13:28:34 -04:00
Fabiano Fidêncio	e81e8a4527	tests: kata-deploy: Adjust timeout 10 minutes is waay too long. Let's give it 4 minutes only. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>	2024-05-27 06:23:00 +02:00
Fabiano Fidêncio	fba5793c0d	tests: kata-deploy: Run the tests from "${repo_root_dir}" Let's see if it helps with issues like: ``` error: must build at directory: not a valid directory: evalsymlink failure on '"/home/runner/actions-runner/_work/kata-containers/kata-containers/tests/functional/kata-deploy/../../..//tools/packaging/kata-deploy/kata-cleanup/overlays/k0s"' : lstat /home/runner/actions-runner/_work/kata-containers/kata-containers/tests/functional/kata-deploy/": no such file or directory ``` Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>	2024-05-27 06:23:00 +02:00
Fabiano Fidêncio	8a8a7ea0e5	tests: kata-deploy: Show more logs in the setup() This will also help us to better understand possible failures with the CI. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>	2024-05-27 05:05:06 +02:00
Fabiano Fidêncio	47d9589e9b	tests: kata-deploy: Show output of passing tests This will help us to debug failures and compare passing and failures outputs. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>	2024-05-27 05:05:06 +02:00
ChengyuZhu6	e23737a103	gha: refactor code with yq for better clarity refactor code with yq for better clarity: Before: ```bash yq write -i "${tools_dir}/packaging/kata-deploy/kata-deploy/base/kata-deploy.yaml" 'spec.template.spec.containers[0].env[7].value' "${KATA_HYPERVISOR}:${SNAPSHOTTER}" ``` After: ```bash yq write -i \ "${tools_dir}/packaging/kata-deploy/kata-deploy/base/kata-deploy.yaml" \ 'spec.template.spec.containers[0].env[7].value' \ "${KATA_HYPERVISOR}:${SNAPSHOTTER}" ``` Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>	2024-03-19 18:06:00 +01:00
Wainer dos Santos Moschetta	24c163e6e1	tests/kata-deploy: fix checker for kata-deploy running Currently, the checking for kata-deploy is running assume that the daemonset scheduled at least one pod, however it might not had and the kubectl wait command fails due to "error: no matching resources found". On CI I've observed that fail intermittently. I suspect the service account kata-deploy-sa take a while to show up then no kata-deploy is scheduled in meanwhile. Changed the checker logic to use waitForProcess() to keep testing if it is already running, or hit the timeout (still 10m). Fixes #9183 Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>	2024-02-29 22:26:27 -03:00
stevenhorsman	9e718b4e23	gha: kata-deploy: Add containerd status check After kata-deploy has installed, check that the worker nodes are still in Ready state and don't have a containerd://Unknown container runtime versions, identicating that container isn't working to ensure that we didn't corrupt the containerd config during kata-deploy's edits Fixes: #8678 Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2023-12-20 09:10:43 +00:00
Fabiano Fidêncio	0015257636	ci: kata-deploy: Add deploy-k8s argument to gha-run.sh We'll be using exactly the same code used for the k8s tests, which are already deploying k3s on GARM. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>	2023-09-19 13:38:10 +02:00
Fabiano Fidêncio	bf2cb02283	ci: kata-deploy: Expland tests to run on k0s / rke2 We just need to make sure the correct overlay is applied, following what we already have been doing for k3s. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>	2023-09-19 13:38:10 +02:00
Fabiano Fidêncio	9e1fb8a966	ci: kata-deploy: Export KUBERNETES env var So we have a better control on which flavour of kubernetes kata-deploy is expected to be targetting. This was also done as part of `fa62a4c01b`, for the k8s tests. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>	2023-09-19 12:37:56 +02:00
Fabiano Fidêncio	2d896ad12f	gha: kata-deploy: Do the runtime class cleanup as part of the cleanup Instead of doing this as part of the test itself, let's ensure it's done before running the tests and during the tests cleanup. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>	2023-08-17 18:54:46 +02:00
Fabiano Fidêncio	4ffc2c86f3	gha: kata-deploy: Add the first kata-deploy test This test, at least for now, only checks whether the runtimeclasses have been properly created. This is just a migration from a test we had as part of the k8s suite. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>	2023-08-17 18:54:46 +02:00
Fabiano Fidêncio	285e616b5e	tests: common: Ensure test_type is used as part of the cluster's name By doing this we can make sure there won't be any clash on the cluster name created for either the k8s or the kata-deploy tests. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>	2023-08-17 14:22:16 +02:00
Fabiano Fidêncio	ce6adecd0a	gha: kata-deploy: Add run-kata-deploy-tests.sh This will have the same function as run-k8s-tests.sh has, but for kata-deploy. Right now it doesn't have any tests, and the command to actually run the tests is commented out, but right now this is just a placeholder that will be populated sooner than later. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>	2023-08-17 09:49:03 +02:00
Fabiano Fidêncio	831e73ff91	tests: kata-deploy: Add functional/kata-deploy/gha-run.sh placeholder Right now this file does nothing, as it's not even called by any GHA. However, it'll be populated later on as part of a different series, where we'll have kata-deploy specific tests running here. Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>	2023-08-14 17:46:10 +02:00

50 Commits