Commit Graph

2117 Commits

Author SHA1 Message Date
Fabiano Fidêncio
cbcdd999e4 Merge pull request #12957 from Apokleos/fix-sb-api
runtime-rs: Fix sandbox-api lifecycle and CRI status handling
2026-05-23 09:26:14 +02:00
Fabiano Fidêncio
5d3e1e6396 kata-deploy: verify kata-runtime label remains stable on rke2/k3s
The retry loop added in efd468df3f still allows the install to declare
success while inside the kubelet's post-restart re-register window.

On rke2/k3s, `systemctl restart rke2-agent` restarts both containerd
and the kubelet, but `wait_till_node_is_ready` polls `.status.conditions[Ready]`
every 2 s and returns on the first `True` observation it sees. By default
the kubelet only publishes node status every ~10 s, so that first `True`
is almost always the stale value from before the restart — the kubelet
hasn't actually finished restarting yet. `label_node_with_retry` then
applies the label, sleeps 1 s, reads back "true" (still stale, kubelet
still down), and returns Ok. Install completes, `/readyz` flips to 200,
helm releases its `--wait`, and the bats test starts — and only then
does the kubelet finish coming up, re-register the node, and clobber
the label with its cached set. The lifecycle test sees an empty
`katacontainers.io/kata-runtime` and fails:

  # Node label katacontainers.io/kata-runtime:
  not ok 1 Kata artifacts are present on host after install

A single-shot verification can't distinguish "still stale true" from
"truly stable true after kubelet re-register". Replace it with a
stability window: after (re)applying the label, require it to remain
at the expected value for STABILITY_CHECKS=6 consecutive observations
spaced CHECK_INTERVAL=2 s apart (≈ 12 s — comfortably more than the
kubelet's status-update period). If the value ever drifts inside the
window, re-apply and restart the stability counter. Bounded by
MAX_APPLY_ATTEMPTS=12, so worst case is ~3 min; happy path adds ~12 s
to install.

Also add a short polling loop to the test's own label assertion as
belt-and-suspenders for any leftover transient race, matching the
existing retry pattern used for the container-runtime version check.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-22 11:53:18 +02:00
Alex Lyn
adf6d43e24 test: skip TestContainerMemoryUpdate for sandbox api
Temporarily skip the `TestContainerMemoryUpdate` test case
for sandbox api.

This test case is currently skipped in other VMMs (e.g.,
QEMU, Cloud-Hypervisor) due to known issues and environmental
stability concerns.
To maintain consistency across the project, we are skipping it
for sandbox as well.

A follow-up PR will be dedicated to addressing these issues and
properly enabling/refining this test case for all VMMs.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-22 10:46:44 +08:00
Alex Lyn
b5349f4d78 versions: bump containerd to 2.3 for sandbox API tests
containerd 2.3 requires Go 1.26.3, but Kata still pins Go 1.25.10.
Use Go 1.26.3 for the sandbox-api job so that make cri-integration
can build containerd from source.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-22 10:46:16 +08:00
Alex Lyn
9f78dc687f tests: exclude TestContainerRestart from the cri-containerd test list
Creating a new container in the same sandbox VM after the previous
container has exited and been removed has never been supported by
kata-containers (neither with the go-based nor the rust-based runtime).
When the last container is removed the kata VM shuts down, so any
attempt to start a new container in the same sandbox fails.

This test exercises a use-case kata does not currently support, and it
has never been part of the passing list for good reason.  Mark it
explicitly excluded with a comment so it is clear this is a deliberate
omission rather than an oversight.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-22 10:45:50 +08:00
Alex Lyn
a7739579d6 tests: Use podsandbox sandboxer for the runc sanity check
The check_daemon_setup function verifies that containerd + runc are
functional before the real kata tests run. Using the shim sandboxer
for this runc check hits a known containerd bug where the OCI spec
is not populated before NewBundle is called, so config.json is never
written and containerd-shim-runc-v2 fails at startup.

See containerd/containerd#11640

The sandboxer choice is irrelevant for this sanity check, so use
podsandbox which works correctly with runc.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-22 10:44:38 +08:00
stevenhorsman
f47d1c0d69 tests/agent-ctl: Add debug
The agent-ctl tests are failing in the CI, but there is no log reporting,
so debugging is not possible. Add some debug to help.

Assisted-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-05-19 12:00:47 +01:00
Fabiano Fidêncio
2c1dec0c14 Merge pull request #13035 from stevenhorsman/docs-static-checks-cleanup
ci: remove docs URL alive check workflow
2026-05-18 17:59:03 +02:00
Fabiano Fidêncio
05f836ea23 Merge pull request #13038 from stevenhorsman/move-k8s-measured-rootfs
ci: Move measure-rootfs to run on TEE PRs
2026-05-18 17:29:25 +02:00
Hyounggyu Choi
f6fce19e01 Merge pull request #13062 from BbolroC/skip-coco-test-with-no-reference-values-ibm-sel
test: skip CDH resource test for qemu-se without reference values
2026-05-18 14:47:50 +02:00
Hyounggyu Choi
540986bc8f test: skip CDH resource test for qemu-se without reference values
Since gc and trustee were bumped (#13046), the test
"Cannot get CDH resource when affirming policy is set without reference values"
has started failing for IBM SEL.

The attestation policy for IBM SEL returns an "affirming"
result whenever the claim can be parsed successfully,
meaning the evidence verification succeeds. As a result,
the negative test above always produces a positive result.

Skip this negative test for IBM SEL environments
(e.g. qemu-se*).

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-05-18 08:40:16 +02:00
Steve Horsman
59b27c4645 Merge pull request #13057 from microsoft/danmihai1/deploy-check-hypervisor-name
gha: k8s: reject unsupported KATA_HYPERVISOR values
2026-05-17 18:43:49 +01:00
Dan Mihai
ddc36060d2 gha: k8s: reject unsupported KATA_HYPERVISOR values
Exit early with an error message instead of starting kata-deploy if
the value of KATA_HYPERVISOR is not expected during CI.

For example: "cloud-hypervisor" was renamed recently to
"clh-runtime-rs" and user scripts depending on the old name were
getting tangled in kata-deploy instead of just rejecting the old
value quickly.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-05-16 01:04:31 +00:00
Dan Mihai
b85fc8ed13 tests: export target_branch="${branch}"
Avoid running "git remote show origin" repeatedly when common.bash
gets sourced multiple times and target_branch was not specified by
the caller.

Repeated "git remote show origin" calls inflicted the additional
overhead of authenticating and communicating with the remote git
repository.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-05-16 00:59:44 +00:00
Dan Mihai
0f3df5d1e4 Merge pull request #13025 from manuelh-dev/mahuber/img-pull-policy
tests: generate guest-pull image pull agent security policies
2026-05-15 14:09:00 -07:00
Fabiano Fidêncio
c19bdbf23b tests: nvidia-nim: use trusted storage templates for runtime-rs
Now that runtime-rs supports block-encrypted emptyDir volumes, remove
the no-trusted-storage workaround templates and the is_runtime_rs
branching in the NIM test. Runtime-rs now uses the same TEE templates
as the Go runtime with emptyDir + PVC at 48Gi memory, instead of the
128Gi workaround that compensated for lacking trusted storage.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-14 22:56:11 +02:00
Fabiano Fidêncio
54aaa1ea2a tests: enable trusted ephemeral storage for runtime-rs
Remove the runtime-rs skip from the trusted ephemeral data storage
test now that runtime-rs implements block-encrypted emptyDir volumes.

Also remove the genpolicy drop-in that disabled encrypted_emptydir
for runtime-rs and the corresponding copy logic in tests_common.sh.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-14 22:56:11 +02:00
Manuel Huber
ed4233bf91 rootfs: cdh: Update CDH to new version
Update CDH to a newer version and:
- adjust the NVIDIA root filesystem build to reflect the change from
  using libcryptsetup to using the cryptsetup binary.
- adjust image-pull test cases to conduct parallel write operations
  on the /dev/trusted_store backed guest image pull location since
  issue #12721 has been solved on CDH side.

Fixes #12721

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-13 20:20:45 +02:00
stevenhorsman
5c55726d11 tests/k8s: Update measured-rootfs image
Try and switch the docker nginx image to our versions.yaml
one so we avoid rate limit issues

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-05-13 17:17:48 +01:00
stevenhorsman
2870f7c2dd ci: Move measure-rootfs to run on TEE PRs
k8s-measured-rootfs only runs on confidential runtime,
so we should move it into the subset on tests that run on TEEs

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-05-13 17:01:50 +01:00
stevenhorsman
9398ab3433 ci: remove docs URL alive check workflow
The docs URL alive check workflow has been disabled for months
and never passed since we moved to GHA., so removes the workflow
and all associated code.

Note: Although the static-checks.sh --doc code was wider scope than
URL check, it wasn't being called anywhere else, so it was removed too.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Assisted-by: IBM Bob
2026-05-13 09:12:00 +01:00
Greg Kurz
d2dc0a923c Merge pull request #13030 from stevenhorsman/go-1.25.10-bump
Go 1.25.10 bump
2026-05-13 08:09:51 +02:00
Dan Mihai
3799473041 Merge pull request #13010 from microsoft/danmihai1/label-references
genpolicy: support env variable values sourced from metadata.labels values
2026-05-12 15:41:11 -07:00
Manuel Huber
da4307efb7 tests: generate policies for guest-pull images
Replace guest-pull image allow-all placeholders with explicit
auto-generated policies for each generated pod manifest.

Generate policy after the final YAML edits so initdata and image
pull secrets are represented in the policy inputs.

Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-12 15:03:15 -07:00
Manuel Huber
6a274b5110 tests: seed auto-generated policy from initdata
Teach auto_generate_policy to reuse a cc_init_data annotation by
decoding it into the temporary default-initdata.toml file.

This lets tests preserve CDH initdata while genpolicy appends the
generated agent security policy for the workload.

Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-12 15:03:14 -07:00
Manuel Huber
e774b13c95 tests: share genpolicy registry auth setup helper
Move the Docker auth setup into common.bash so tests beyond the
NVIDIA runner can provide credentials for genpolicy image pulls.

Make the registry, username, password and output directory explicit
while preserving the nvcr.io setup used by the NIM tests.

Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-12 15:03:14 -07:00
Greg Kurz
733ccb3254 Merge pull request #12996 from stevenhorsman/swap-agent-ctl-to-skopeo&umoci
agent-ctl: Swap rootfs bundle pull implementation
2026-05-12 19:12:27 +02:00
stevenhorsman
4a65aca9cf versions: bump golang to 1.25.10
Bump the go version to resolve CVEs:
- GO-2026-4918
- GO-2026-4971
- GO-2026-4976
- GO-2026-4977
- GO-2026-4980
- GO-2026-4981
- GO-2026-4982
- GO-2026-4986

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Assisted-by: IBM Bob
2026-05-12 11:56:13 +01:00
Manuel Huber
c265e4905f tests: nvidia: avoid NIM journal dumps on success
BATS_TEST_COMPLETED is per-test and remains empty in teardown_file.
Track file-level state so successful NIM runs skip the journal dump
while setup or test failures still include node diagnostics.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-10 09:10:01 -07:00
Manuel Huber
1c081ff434 tests: nvidia: place NIM service into namespace
Place the NIM service into our test namespace. We are still observing
various situations where for some reasons, the NIM service appears in
the default namespace in our CI.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-10 07:36:23 +00:00
Fabiano Fidêncio
f7be57efe2 Merge pull request #13007 from manuelh-dev/mahuber/dbg-nim-svc
tests: nvidia: Wait for NIM operator pod and print
2026-05-08 20:58:51 +02:00
Manuel Huber
714adec3f8 tests: nvidia: Wait for NIM operator pod and print
Wait for the NIM operator pod to run before deploying NIM services.
Add a temporary debug function to print resource placement into the
different namespaces. Remove this function again when the NIM tests
are stabilized.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-08 06:27:48 +00:00
Ubuntu
b95be5332a genpolicy: env variables from metadata.labels
Add basic genpolicy support for container environment variables sourced
from metadata.labels.

In this implementation, the relevant labels must be available as input
to the policy tool. This is slightly different from the way variables
sourced from metadata.annotations are treated by the tool: when the
relevant annotation is not available as input, the generated Policy
allows any value. Depending on metadata.labels use cases that we might
encounter maybe the labels will be handled the same way as the
annotations in the future.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-05-07 23:35:56 +00:00
Dan Mihai
39b9c318e2 tests: k8s: merge two policy-pod test cases
One of these test cases was a subset of the other, so remove that
redundancy.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-05-07 22:39:23 +00:00
stevenhorsman
e92d954b51 agent-ctl: Swap rootfs bundle pull implementation
Switch the rootfs bundle pull implementatio from using image-rs to
use skopeo and umoci to remove the really long crate dependency
tail that image-rs brings.

Generated-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-05-07 21:11:27 +01:00
Fabiano Fidêncio
8dde5f39b7 tests: dump kata-deploy pod describe+logs on install timeout
When kubectl wait times out the pod never reached Ready, so the
existing log collection (which runs after wait succeeds) produces
"-- No entries --" with zero useful information.

Capture kubectl describe and kubectl logs (including previous
container) immediately on timeout so the next CI run shows exactly
why the pod is stuck (ImagePullBackOff, OOMKilled, probe failures,
containerd restart hang, etc.).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00
Fabiano Fidêncio
0f3160276b ci: k8s: skip no-op Helm uninstall on free runners
In cleanup_kata_deploy, bail out early when no kata-deploy Helm release
exists so baremetal-* pre-deploy cleanup on fresh clusters does not
block on helm uninstall --wait (up to 10m).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-07 13:40:55 +02:00
Fabiano Fidêncio
19c194aa94 ci: Add runtime-rs GPU shims to NVIDIA GPU CI workflow
Add qemu-nvidia-gpu-runtime-rs and qemu-nvidia-gpu-snp-runtime-rs to
the NVIDIA GPU test matrix so CI covers the new runtime-rs shims.

Introduce a `coco` boolean field in each matrix entry and use it for
all CoCo-related conditionals (KBS, snapshotter, KBS deploy/cleanup
steps). This replaces fragile name-string comparisons that were already
broken for the runtime-rs variants: `nvidia-gpu (runtime-rs)` was
incorrectly getting KBS steps, and `nvidia-gpu-snp (runtime-rs)` was
not getting the right env vars.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-07 10:33:26 +02:00
Fabiano Fidêncio
acfb9f9762 Merge pull request #12954 from zvonkok/modular-makefile
build: remove gha-adjust-to-use-prebuilt-components.sh
2026-05-07 10:32:32 +02:00
manuelh-dev
8473144ee5 Merge pull request #12989 from microsoft/danmihai1/ignore-unnecessary-fields
genpolicy: ignore additional irrelevant fields
2026-05-06 23:54:39 -07:00
Greg Kurz
16bc6db59e static-checks: Drop vendor checks
The repo doesn't track vendor code anymore. Also, I could not find any
evidence that this code is actually called. The reference to URL

```
https://github.com/kata-containers/community/blob/main/VENDORING.md
```

that was recently removed by

https://github.com/kata-containers/community/pull/442

is another indication that this flow is outdated.

Drop it.

Signed-off-by: Greg Kurz <groug@kaod.org>
2026-05-06 09:49:53 +02:00
Greg Kurz
af54cd8a27 tests: Remove vendor directory
Now shipped in the vendored code tarball.

Signed-off-by: Greg Kurz <groug@kaod.org>
2026-05-06 09:32:05 +02:00
Dan Mihai
fcee4864e7 genpolicy: ignore additional PodAffinity fields
1. Ignore PodAffinity's preferredDuringSchedulingIgnoredDuringExecution.
2. Ignore additional PodAffinityTerm fields.
3. Add basic tests for the new fields.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-05-06 01:38:02 +00:00
Dan Mihai
b6349f50ab genpolicy: ignore preemptionPolicy
Ignore the pod preemptionPolicy field from input YAML - irrelevant
for building the Policy.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-05-06 00:35:27 +00:00
Dan Mihai
9f4a7a9d55 Merge pull request #12978 from microsoft/danmihai1/empty-env-var
genpolicy: support empty environment variables
2026-05-05 14:10:35 -07:00
Dan Mihai
99dd897814 genpolicy: support empty environment variables
K8s supports them, so genpolicy should support them too.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-05-05 18:53:25 +00:00
Fabiano Fidêncio
29e63c21a1 tests: k8s-cron-job: set runtimeClassName to kata
The cron-job test workload was missing `runtimeClassName: kata`, which
meant the cron job was not actually being executed under the Kata
runtime, defeating the purpose of the test.

Set it explicitly, consistent with the sibling `job.yaml` workload.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-05 11:21:05 +02:00
Fabiano Fidêncio
27c3dfbb8c Merge pull request #12943 from fidencio/topic/kata-deploy-add-http-health-probes
kata-deploy: add HTTP health probes (healthz/readyz)
2026-05-05 09:30:17 +02:00
Dan Mihai
0a6dc2fae0 ci: mariner: use OCI version 1.2.1
Mariner moved from version 1.2.0 to version 1.2.1.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-05-05 02:23:30 +00:00
Fabiano Fidêncio
9e3bd6b576 tests: fix kata-deploy lifecycle test reliability
Fix two issues in kata-deploy-lifecycle.bats that caused failures on
k3s, k0s and rke2:

  run_on_host():
  - `kubectl run --rm -i` causes k3s/rke2 to inject session-recording
    banners into stdout, polluting command output and breaking string
    assertions. Replace with a create/wait/logs/delete sequence so only
    the container's actual stdout is captured.

  "Artifacts are fully cleaned up after uninstall":
  - After a CRI restart the kubelet may briefly report "Unknown" for the
    container runtime version. Retry for up to 60s before asserting.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-03 22:09:08 +02:00