kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-01 22:50:54 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	38416f78ec	Merge pull request #13190 from manuelh-dev/mahuber/fix-num-cpus-bats tests: fix k8s-number-cpus expectation	2026-06-10 21:59:21 +02:00
Fabiano Fidêncio	92a9691470	tests: add kata-monitor helm chart k8s test Add a single-job k8s test that installs the kata-deploy helm chart with monitor.enabled=true, pointed at the per-PR kata-monitor image built earlier in the same run, and exercises both the rollout and the user-visible behaviour: * the kata-monitor DaemonSet rolls out and the pod stays up without container restarts; * a real kata-runtime probe pod is scheduled, then /metrics and /sandboxes are scraped through the apiserver pod-proxy to prove kata-monitor sees the sandbox (non-zero running-shim count plus at least one per-sandbox kata_shim_* metric); * after the probe pod is deleted, /metrics drops back to a zero running-shim count. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: OpenAI Codex <codex@openai.com>	2026-06-09 14:33:30 +02:00
Fabiano Fidêncio	285d5daa23	tests: install latest cri-tools dynamically Resolve the cri-tools release at install time instead of pinning a version in versions.yaml: install_cri_tools now queries the GitHub releases API for the absolute latest stable tag, and the kata-monitor, cri-containerd and nydus jobs call it directly. Also write /etc/crictl.yaml during containerd setup so crictl stops emitting deprecation warnings about the legacy default endpoints. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: OpenAI Codex <codex@openai.com>	2026-06-09 14:33:30 +02:00
Fabiano Fidêncio	63fec205fe	tests: run kata-monitor functional tests against the dedicated image Exercise the published kata-monitor container image (the one built by publish-kata-monitor-payload-amd64) rather than the on-disk binary, so integration regressions like the recent glibc/musl mismatch surface at PR time. The kata-monitor-tests.sh script keeps the binary fallback for ad-hoc local runs. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: OpenAI Codex <codex@openai.com>	2026-06-09 14:33:30 +02:00
Fabiano Fidêncio	d5bc1177c0	tests: focus kata-monitor CI on containerd active Drop the stale CRI-O matrix entry (its cri-tools pin was several releases behind) along with the exclude that hid the containerd job, and pin the remaining job to containerd's "active" track (currently v2.2) via CONTAINERD_VERSION. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: OpenAI Codex <codex@openai.com>	2026-06-09 14:33:30 +02:00
Fabiano Fidêncio	5000000883	tests: restore SystemdCgroup in installed containerd Set runc SystemdCgroup=true when generating /etc/containerd/config.toml during containerd installation, restoring behavior that was mistakenly dropped. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-06-09 10:46:38 +02:00
Fabiano Fidêncio	3ca9eb94b9	cri-containerd: fix v1 sanity-check config generation Avoid emitting unsupported plugin keys and empty runtime options in the v1.x config path so containerd 1.7 can load the generated TOML during runc sanity checks. While here, let's also dump the temporary cri-integration config on failure to speed diagnosis. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-06-09 10:46:38 +02:00
Fabiano Fidêncio	ac2221a6a5	Merge pull request #13004 from fidencio/topic/versions-bump-containerd-to-2.3 versions: Bump containerd to 2.3	2026-06-09 08:21:58 +02:00
Manuel Huber	f37fb18b8c	tests: fix k8s-number-cpus expectation As pointed out in kata-containers/kata-containers#12961, the k8s-number-cpus retry loop could fail all retried assertions and still pass. k8s-number-cpus retried until the guest reported three CPUs, but the post-loop result was never checked. Bash suppresses errexit for the equality test before && break, so the test could exhaust retries and still pass. The current kata-qemu handler sizes vCPUs from fractional container quotas: two 500m limits produce one workload vCPU, then the default vCPU is added and rounded once. Expect two CPUs and assert the final retry result so the test fails if the count never converges. Signed-off-by: Manuel Huber <manuelh@nvidia.com> Assisted-by: OpenAI Codex <codex@openai.com>	2026-06-08 22:50:02 +00:00
Fabiano Fidêncio	48ebbbec3a	kata-deploy: honor debug mode with CLI log-level Make the chart pass --log-level debug automatically when debug=true so CI and troubleshooting runs emit full rendered config dumps without requiring a separate log-level override. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <noreply@cursor.com>	2026-06-08 19:25:48 +02:00
Fabiano Fidêncio	95b8e8bea9	tests: update remaining containerd callers for containerd 2.x tests/functional/vfio-ap/run.sh: - Source tests/common.bash so the schema helpers are available. - configure_containerd_for_runtime_rs: write kata-qemu-runtime-rs configuration via a conf.d drop-in. Schema >= 3 uses io.containerd.cri.v1.runtime; schema 2 uses io.containerd.grpc.v1.cri. The sandboxer field is emitted only for schema >= 3. tests/integration/nerdctl/gha-run.sh: - Fix "containerd config default" pipe: propagate PATH so the newly installed binary is found, suppress stdout, and call ensure_containerd_conf_d_rootful_api_sockets. tests/integration/kubernetes/gha-run.sh: - Fix jq filter for devmapper snapshotter (.version // 0 >= 3). - Add ensure_containerd_conf_d_rootful_api_sockets after config setup. tests/gha-run-k8s-common.sh: - Remove the redundant "containerd config default \| sed" override; overwrite_containerd_config (called via check_containerd_config_for_kata) now handles SystemdCgroup and all other containerd config setup. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <noreply@cursor.com>	2026-06-08 19:20:14 +02:00
Fabiano Fidêncio	1caacda174	tests/cri-containerd: update integration tests for containerd 2.x Adapt create_containerd_config to work with containerd 2.x while keeping compatibility with v1.x for completeness: - Drop the direct config.toml patching in favour of conf.d fragments: use containerd_render_config_default_with_imports to generate the base config, then write separate drop-ins for API socket overrides, debug settings, and the Kata runtime. - Use CONTAINERD_SYSTEM_FRAGMENT_PREFIX directly (no PREFIX= indirection). - Detect cfg_schema via _containerd_blob_schema_version to select the right plugin table: schema >= 3 -> io.containerd.cri.v1.runtime schema 2 -> io.containerd.grpc.v1.cri and to emit the sandboxer field only on schema >= 3. - Pass GOTOOLCHAIN via "sudo -E make clean" so the environment variable set by export_go_toolchain_for_containerd_source_builds is preserved during the containerd source build. The require_containerd_binary_default_schema_v3_plus call is kept: the test explicitly clones and builds containerd 2.x from source, so a schema v2 binary should never appear here. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <noreply@cursor.com>	2026-06-08 19:20:14 +02:00
Fabiano Fidêncio	7428832c86	tests/nydus: make containerd config schema-aware Configure containerd for nydus differently depending on the active config schema, because conf.d drop-in fragments are only honoured the same way by containerd 2.x. config_containerd now delegates to _containerd_resolved_schema_version (from common.bash) to detect the active schema and passes it to config_containerd_core, which emits schema-appropriate config: schema >= 3 (containerd v2.x): Keep the base config and add a conf.d drop-in fragment using the io.containerd.cri.v1.runtime plugin (sandboxer = 'podsandbox') and io.containerd.cri.v1.images to select nydus as the snapshotter. schema 2 (containerd v1.x): conf.d is not honoured the same way, so replace config.toml wholesale with a complete, self-contained file using the io.containerd.grpc.v1.cri plugin with nydus as the snapshotter and no sandboxer field. The [proxy_plugins] block is written in both cases as it is schema-version agnostic. Teardown restores the whole config.toml (schema v2 path) or removes the drop-in fragment (schema v3+ path) as appropriate. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <noreply@cursor.com>	2026-06-08 19:20:14 +02:00
Fabiano Fidêncio	1bb43d0a19	tests/common: make overwrite_containerd_config schema-aware Rewrite overwrite_containerd_config so that it works with containerd v1.x (schema v2) as well as containerd v2.x (schema v3+): - Always regenerate /etc/containerd/config.toml from the installed binary via "sudo containerd config default". - Call ensure_containerd_conf_d_rootful_api_sockets after regenerating the base config. - Detect the effective schema via _containerd_resolved_schema_version. - Schema >= 3 (containerd v2.x): write io.containerd.cri.v1.runtime plugin path with sandboxer = podsandbox into a conf.d drop-in. - Schema 2 (containerd v1.x): write io.containerd.grpc.v1.cri plugin path without sandboxer into the drop-in. check_containerd_config_for_kata no longer appends a schema guard; the function supports both schema generations intentionally. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <noreply@cursor.com>	2026-06-08 19:20:14 +02:00
Fabiano Fidêncio	18fbf4cd5d	tests/common: fix install_cri_containerd for containerd 2.x Three issues prevented containerd 2.x from working correctly after installation: 1. Socket uid/gid mismatch: "containerd config default" was run as the unprivileged user, which produced uid = <runner-uid> in the API socket stanza instead of uid = 0. Run it under sudo so the default output is owned by root. 2. Stale systemd unit: the CI runner ships a pre-installed containerd whose unit file is left in place after the binary is replaced by the test installer. The old unit causes "MigrateConfigTo: index out of range" panics when the new binary tries to load a schema v4 config. Always overwrite the unit file from the template so the running binary and the unit file stay in sync. 3. Schema guard removed: install_cri_containerd installs whatever version was requested (v1.7 or v2.3) and must not abort on a valid schema v2 binary. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <noreply@cursor.com>	2026-06-08 19:20:14 +02:00
Fabiano Fidêncio	fbf133ce3a	tests/common: add containerd config schema helpers Introduce helper functions used by later commits to make containerd configuration schema-aware. _containerd_blob_schema_version(): Parse the version = <n> line from a containerd config blob and echo the integer. _containerd_resolved_schema_version(): Run "containerd config default" and return the schema version of the active binary. Drives conditional logic in overwrite_containerd_config and other helpers. containerd_emit_rootful_api_socket_overrides(): Emit the TOML fragment that fixes uid/gid on the grpc/ttrpc sockets. Schema v3 uses top-level [grpc]/[ttrpc]; schema v4+ uses plugin-scoped tables. require_containerd_config_schema_v3_plus() / require_containerd_binary_default_schema_v3_plus(): Guard helpers that abort with a clear message when the installed containerd is older than v2.x. Used only in test paths that explicitly build containerd 2.x from source. containerd_render_config_default_with_imports(): Write a fresh "containerd config default" to a file and ensure the conf.d import glob is present, ready for drop-in fragments. export_go_toolchain_for_containerd_source_builds(): Set GOTOOLCHAIN=auto so "go build" of containerd 2.x downloads the exact toolchain in its go.mod without changing the global Go version. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <noreply@cursor.com>	2026-06-08 19:20:14 +02:00
Fabiano Fidêncio	8ffe4e6c02	tests: add journalctl diagnostics on containerd restart failure When restart_systemd_service_with_no_burst_limit fails or times out waiting for the containerd socket, emit "journalctl -xeu containerd.service" output so the failure reason is visible in CI logs without requiring a separate log-collection step. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <noreply@cursor.com>	2026-06-08 19:20:14 +02:00
Fabiano Fidêncio	e122d7ffb0	versions: bump containerd to 2.3 and define minimum/latest test matrix Bump the containerd version used by CI from v1.7.25 to v2.3.0. Rename the version-range fields in versions.yaml and throughout the GitHub Actions workflows from lts/active/version/sandbox_api to minimum/latest to make their meaning self-evident: minimum: "v1.7" # oldest containerd branch under test latest: "v2.3" # newest containerd branch under test Drop the bare version field (superseded by the matrix) and the sandbox_api alias (covered by latest). Update all containerd_version matrix entries in the workflow files accordingly, and update gha-run-k8s-common.sh to resolve the new key names. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <noreply@cursor.com>	2026-06-08 19:20:14 +02:00
Fabiano Fidêncio	b119b051cb	kata-deploy: support drop-in configs for default runtimes Allow operators to provide per-shim drop-in TOML for built-in runtimes and reconcile stale override files so upgrades and migrations remain safe when drop-ins are added or removed. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Codex	2026-06-08 13:31:03 +02:00
Fupan Li	024c2531a5	Merge pull request #13029 from fidencio/topic/rfc-composable-vm-images docs: add composable VM images design proposal	2026-06-08 18:40:35 +08:00
Fabiano Fidêncio	2440b5940b	docs: add composable VM images design proposal Add an RFC document describing the composable image architecture that replaces monolithic guest rootfs images with a lean base image plus purpose-specific addon images cold-plugged as virtio-blk devices. The proposal covers the runtime configuration (extra_images), host-side cold-plugging, guest-side mounting via systemd and dm-verity, agent-side dynamic path resolution, the image build pipeline, and the security model. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-06-07 13:58:17 +02:00
Fabiano Fidêncio	57c61e0c2f	tests: unskip hard-coded policy tests on qemu-tdx-runtime-rs Enable the hard-coded init-data policy test gate for qemu-tdx-runtime-rs so runtime-rs and Go TDX variants exercise the same Kubernetes policy coverage. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-06-06 22:48:20 +02:00
Fabiano Fidêncio	43321c7a78	Merge pull request #12931 from mythi/qemu-tdx-tests tests: fix TDX runtime-rs and initdata tests	2026-06-06 11:42:19 +02:00
Fabiano Fidêncio	f6ff9578d4	Merge pull request #13161 from kata-containers/sprt/remove-configure-mariner ci: remove Mariner annotations and use new config	2026-06-05 20:22:58 +02:00
Mikko Ylinen	013e901f1b	tests: re-enable initdata tests for qemu-tdx The coco initdata tests signature verification and authenticated registry never worked on qemu-tdx and so they have been disabled since. Add them back now that all necessary fixes are in place. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-06-05 16:04:05 +03:00
Mikko Ylinen	9313e336b5	tests: set image.image_pull_proxy for CDH initdata initdata tests set kernel arguments to "" which resets the kernel arguments configured by Helm install. However, TDX runner depends on agent.https_proxy= kernel arguments to pull images. In order for initdata tests to work on TDX, the same needs to be added to CDH configuration via image.image_pull_proxy. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-06-05 16:04:05 +03:00
Mikko Ylinen	f3a0ef6a7c	tests: use kubectl set to configure KBS env No need to patch yamls locally. Also, set RUST_LOG=debug and enable https_proxy for all TDX targets when the runner has HTTPS_PROXY is set. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-06-05 16:04:05 +03:00
Fabiano Fidêncio	743b0a4839	Merge pull request #13165 from stevenhorsman/bump-go-to-1.25.11 versions: bump golang to 1.25.11	2026-06-04 20:24:57 +02:00
Fabiano Fidêncio	2a1ce7b8c4	Merge pull request #12539 from mythi/no-vcpu-hotplug Disable CPU hotplug when confidential guest setting enabled	2026-06-04 10:56:52 +02:00
stevenhorsman	879912be25	versions: bump golang to 1.25.11 Bump the go version to resolve CVEs: - GO-2026-5037 - GO-2026-5038 - GO-2026-5039 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-04 08:49:17 +01:00
Aurélien Bombo	de5333f275	ci: remove Mariner annotations and use new config This is a follow-up to #13126 where we forgot to remove this now-unused code. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-06-03 09:25:12 -05:00
Mikko Ylinen	018389cb22	tests: enable k8s-sandbox-vcpus-allocation.bats for tdx and coco-dev k8s-sandbox-vcpus-allocation.bats was disabled for qemu-tdx due to errors when moving to use "upstream" TDX KVM code. The failing test is vcpus-less-than-one-with-no-limits pod which ends up getting x86 default MaxCPU = 240 and erroring: Number of hotpluggable cpus requested (240) exceeds the maximum cpus supported by KVM (224) TDX max vcpus is capped to host's logical CPUs so 240 is too much. With the maxcpus logic fixed (=maxcpus not set at all) for configurations where confidential guest is enabled, qemu-tdx can be enabled for k8s-sandox-vcpus-allocation.bats again. Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>	2026-06-03 15:27:35 +03:00
stevenhorsman	144ab161f1	tetss: bump golang.org/x/sys dependency Bump golang.org/x/sys from v0.19.0 to v0.44.0 to resolve CVE: - GO-2026-5024 Signed-off-by: stevenhorsman <steven@uk.ibm.com> Generated-By: IBM Bob	2026-06-03 09:56:54 +01:00
Fabiano Fidêncio	230e01b04e	Merge pull request #13126 from kata-containers/topic/runtimes-introduce-azure-specific-configs runtime/runtime-rs: introduce Azure specific configs	2026-06-02 09:17:09 +02:00
Manuel Huber	7d9a143747	ci: cover EROFS snapshotter default_size=0 path kata-deploy currently hard-codes the EROFS snapshotter default_size to "10G", so the CoCo EROFS CI lane only exercises the path where the snapshotter provides an rwlayer. Use the generic containerd.userDropIn support for the EROFS default_size and thread it through the Kubernetes CI helpers. Keep the kata-deploy default at "10G" to preserve current behavior, but allow the workflow to set "0" for the runtime-rs no-rwlayer path. Expand the existing EROFS snapshotter job to run both values. The override is written to containerd as a TOML string so "0" is not parsed as an integer. Assisted-by: OpenAI Codex <codex@openai.com> Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-28 22:54:56 +00:00
Fabiano Fidêncio	744ab0b548	ci: improve kata-deploy pod wait and timeout diagnostics Make kata-deploy deployment waits more robust by deriving the pod selector from the rendered helm values and using it consistently for readiness checks and logs. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-28 23:32:37 +02:00
Fabiano Fidêncio	81ce51a9aa	ci: target Azure CLH runtimes directly in AKS tests Switch AKS Mariner matrix entries to clh-azure handlers and remove the temporary host-OS based helm value overrides. Update integration test wiring and required test labels so CI tracks the new runtime names. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-28 23:32:37 +02:00
Manuel Huber	3e874d0eaf	tests: accept EROFS empty-image rootfs rejection The empty-image test expects pod creation to fail. With an EROFS snapshot that has a disk-backed rwlayer, runtime-rs can still reject that pod with the existing unsupported mount-count error. With default_size=0, there is no rwlayer mount. The same negative test can instead reach the bind rootfs shape produced for the empty active snapshot, which runtime-rs rejects as an unsupported rootfs mount. Accept both messages so the test covers the expected failure for both EROFS rwlayer modes. Assisted-by: OpenAI Codex <codex@openai.com> Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-27 17:12:20 +00:00
Manuel Huber	6a715cf4f7	tests: nvidia: No policy for runtime-rs path The current if condition causes agent security policies to be attached to the non-TEE NVIDIA runtime-rs runtime class. While this is good to see that it works, this is not intended. Thus, replacting the condition with is_confidential_gpu_hypervisor. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-25 16:00:49 -07:00
Fabiano Fidêncio	f763e9cca9	tests: Add NUMA topology / GPU placement tests to the NV CIs Add k8s-nvidia-numa.bats with five tests that validate NUMA behaviour on hosts where NUMA is configured by default (qemu-nvidia-gpu, qemu-nvidia-gpu-snp, qemu-nvidia-gpu-tdx): 1. Multi-node sandbox (large workload spanning all host NUMA nodes): - Guest NUMA node count matches host - Guest vCPU distribution is balanced across nodes (max-min <= 1) - Guest memory is distributed across NUMA nodes - Host-side vCPU pinning is balanced across NUMA nodes 2. Right-sized single-node sandbox (small workload fitting one node): - Guest collapses to a single NUMA node - All host vCPU threads pinned to that one NUMA node 3. GPU passthrough with VFIO, multi-node: - Guest NUMA topology is balanced (same as test 1) - Guest GPU's NUMA node matches the host GPU's NUMA node (resolved via the vfio-pci,host=<BDF> from the QEMU command line and /sys/bus/pci/devices/<BDF>/numa_node) - QEMU command line contains pxb-pcie and policy=bind - Host vCPU pinning is balanced 4. GPU passthrough with VFIO, right-sized single-node: small workload plus GPU that fits in a single host NUMA node: - Guest collapses to a single NUMA node - The chosen node is the GPU's host NUMA node, not just any node that fits — verified by matching host-nodes= in the memory backend and pxb-pcie numa_node= against the GPU's host node - Guest GPU reports the same NUMA node as the host GPU 5. Explicit numa_mapping in the runtime TOML (QEMU-only): - Drops a config.d/ fragment that sets numa_mapping = ["1"], so the auto-derive + right-sizing path is bypassed entirely - Guest sees exactly 1 NUMA node - QEMU memory backend is bound to host node 1 (host-nodes=1, policy=bind), not host node 0 - Host-side vCPU threads land on host node 1 - Drop-in is removed on teardown so subsequent tests are unaffected Guest-side checks use a dedicated container image (quay.io/kata-containers/numa) that reads sysfs and prints results to stdout — no kubectl exec or CoCo policy overrides needed. Host-side checks (crictl, pgrep, taskset) run directly on the host via sudo; a standalone numa-pinning-check.sh script handles the vCPU thread affinity inspection. The config.d/ helpers used by test 5 are runtime-agnostic (probe Go vs runtime-rs layout on disk) but the test is gated to qemu-* shims since runtime-rs does not yet implement NUMA. Skips cleanly on single-NUMA hosts, unsupported hypervisors, or when no nvidia.com/pgpu resources are available (GPU tests only). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	20705470e9	docs: Add NUMA support guide for Kata Containers with QEMU Add a step-by-step how-to guide covering host inspection, Kata NUMA drop-in setup (via kata-deploy Helm and manual config.d/), pod deployment examples, and guest/host verification procedures. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	cbcdd999e4	Merge pull request #12957 from Apokleos/fix-sb-api runtime-rs: Fix sandbox-api lifecycle and CRI status handling	2026-05-23 09:26:14 +02:00
Fabiano Fidêncio	5d3e1e6396	kata-deploy: verify kata-runtime label remains stable on rke2/k3s The retry loop added in `efd468df3f` still allows the install to declare success while inside the kubelet's post-restart re-register window. On rke2/k3s, `systemctl restart rke2-agent` restarts both containerd and the kubelet, but `wait_till_node_is_ready` polls `.status.conditions[Ready]` every 2 s and returns on the first `True` observation it sees. By default the kubelet only publishes node status every ~10 s, so that first `True` is almost always the stale value from before the restart — the kubelet hasn't actually finished restarting yet. `label_node_with_retry` then applies the label, sleeps 1 s, reads back "true" (still stale, kubelet still down), and returns Ok. Install completes, `/readyz` flips to 200, helm releases its `--wait`, and the bats test starts — and only then does the kubelet finish coming up, re-register the node, and clobber the label with its cached set. The lifecycle test sees an empty `katacontainers.io/kata-runtime` and fails: # Node label katacontainers.io/kata-runtime: not ok 1 Kata artifacts are present on host after install A single-shot verification can't distinguish "still stale true" from "truly stable true after kubelet re-register". Replace it with a stability window: after (re)applying the label, require it to remain at the expected value for STABILITY_CHECKS=6 consecutive observations spaced CHECK_INTERVAL=2 s apart (≈ 12 s — comfortably more than the kubelet's status-update period). If the value ever drifts inside the window, re-apply and restart the stability counter. Bounded by MAX_APPLY_ATTEMPTS=12, so worst case is ~3 min; happy path adds ~12 s to install. Also add a short polling loop to the test's own label assertion as belt-and-suspenders for any leftover transient race, matching the existing retry pattern used for the container-runtime version check. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-22 11:53:18 +02:00
Alex Lyn	adf6d43e24	test: skip TestContainerMemoryUpdate for sandbox api Temporarily skip the `TestContainerMemoryUpdate` test case for sandbox api. This test case is currently skipped in other VMMs (e.g., QEMU, Cloud-Hypervisor) due to known issues and environmental stability concerns. To maintain consistency across the project, we are skipping it for sandbox as well. A follow-up PR will be dedicated to addressing these issues and properly enabling/refining this test case for all VMMs. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:46:44 +08:00
Alex Lyn	b5349f4d78	versions: bump containerd to 2.3 for sandbox API tests containerd 2.3 requires Go 1.26.3, but Kata still pins Go 1.25.10. Use Go 1.26.3 for the sandbox-api job so that make cri-integration can build containerd from source. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:46:16 +08:00
Alex Lyn	9f78dc687f	tests: exclude TestContainerRestart from the cri-containerd test list Creating a new container in the same sandbox VM after the previous container has exited and been removed has never been supported by kata-containers (neither with the go-based nor the rust-based runtime). When the last container is removed the kata VM shuts down, so any attempt to start a new container in the same sandbox fails. This test exercises a use-case kata does not currently support, and it has never been part of the passing list for good reason. Mark it explicitly excluded with a comment so it is clear this is a deliberate omission rather than an oversight. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:45:50 +08:00
Alex Lyn	a7739579d6	tests: Use podsandbox sandboxer for the runc sanity check The check_daemon_setup function verifies that containerd + runc are functional before the real kata tests run. Using the shim sandboxer for this runc check hits a known containerd bug where the OCI spec is not populated before NewBundle is called, so config.json is never written and containerd-shim-runc-v2 fails at startup. See containerd/containerd#11640 The sandboxer choice is irrelevant for this sanity check, so use podsandbox which works correctly with runc. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-22 10:44:38 +08:00
stevenhorsman	f47d1c0d69	tests/agent-ctl: Add debug The agent-ctl tests are failing in the CI, but there is no log reporting, so debugging is not possible. Add some debug to help. Assisted-by: IBM Bob Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-05-19 12:00:47 +01:00
Fabiano Fidêncio	2c1dec0c14	Merge pull request #13035 from stevenhorsman/docs-static-checks-cleanup ci: remove docs URL alive check workflow	2026-05-18 17:59:03 +02:00
Fabiano Fidêncio	05f836ea23	Merge pull request #13038 from stevenhorsman/move-k8s-measured-rootfs ci: Move measure-rootfs to run on TEE PRs	2026-05-18 17:29:25 +02:00

1 2 3 4 5 ...

2158 Commits