Commit Graph

1872 Commits

Author SHA1 Message Date
Fabiano Fidêncio
aa3a795cd4 tests: k8s: coco: rely more on free runners
Run all CoCo non-TEE variants in a single job on the free runner with an
explicit environment matrix (vmm, snapshotter, pull_type, kbs,
containerd_version).

Here we're testing CoCo only with the "active" version of containerd.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-19 21:21:57 +01:00
Fabiano Fidêncio
50fab83173 tests: k8s: rely more on free runners
We were running most of the k8s integration tests on AKS. The ones that
don't actually depend on AKS's environment now run on normal
ubuntu-24.04 GitHub runners instead: we bring up a kubeadm cluster
there, test with both containerd lts and active, and skip attestation
tests since those runtimes don't need them. AKS is left only for the
jobs that do depend on it.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-19 20:44:41 +01:00
Dan Mihai
ea53779b90 ci: k8s: temporarily disable mariner host
Disable mariner host testing in CI, and auto-generated policy testing
for the temporary replacements of these hosts (based on ubuntu), to work
around missing:

1. cloud-hypervisor/cloud-hypervisor@0a5e79a, that will allow Kata
   in the future to disable the nested property of guest VPs. Nested
   is enabled by default and doesn't work yet with mariner's MSHV.
2. cloud-hypervisor/cloud-hypervisor@bf6f0f8, exposed by the large
   ttrpc replies intentionally produced by the Kata CI Policy tests.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-02-19 20:42:50 +01:00
Dan Mihai
3e2153bbae ci: k8s: easier to modify az aks create command
Make `az aks create` command easier to change when needed, by moving the
arguments specific to mariner nodes onto a separate line of this script.
This change also removes the need for `shellcheck disable=SC2046` here.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-02-19 20:42:50 +01:00
Steve Horsman
6a67250397 Merge commit from fork
runtime-go/rs: Disable virtio-pmem for Cloud Hypervisor
2026-02-19 09:00:56 +00:00
Chiranjeevi Uddanti
88203cbf8d tests: Add regression test for sandbox_cgroup_only=false
Add unit test for get_ch_vcpu_tids() and integration test that creates
a pod with sandbox_cgroup_only=false to verify it starts successfully.

Signed-off-by: Chiranjeevi Uddanti <244287281+chiranjeevi-max@users.noreply.github.com>
Co-authored-by: Antigravity <antigravityagent@google.com>
2026-02-18 20:20:14 +01:00
Aurélien Bombo
8ff9cd1f12 Merge pull request #12455 from ajaypvictor/secret-cm-without-sharedfs
ci: Add integration tests for secret & configmap propagation
2026-02-18 12:06:48 -06:00
Aurélien Bombo
336b922d4f tests/cbl-mariner: Stop disabling NVDIMM explicitly
This is not needed anymore since now disable_image_nvdimm=true for
Cloud Hypervisor.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-02-18 11:52:51 -06:00
Dan Mihai
eee25095b5 tests: mariner annotations for k8s-openvpn
This test uses YAML files from a different directory than the other
k8s CI tests, so annotations have to be added into these separate
files.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-02-18 07:17:04 +01:00
Fabiano Fidêncio
80a175d09b kata-deploy: Add TEE nodeSelectors for TEE shims when NFD is detected
When NFD is detected (deployed by the chart or existing in the cluster),
apply shim-specific nodeSelectors only for TEE runtime classes (snp,
tdx, and se).

Non-TEE shims keep existing behavior (e.g. runtimeClass.nodeSelector for
nvidia GPU from f3bba0885 is unchanged).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-02-16 12:07:51 +01:00
Ajay Victor
83935e005c ci: Add integration tests for secret & configmap propagation
Enhance k8s-configmap.bats and k8s-credentials-secrets.bats to test that ConfigMap and Secret updates propagate to volume-mounted pods.

- Enhanced k8s-configmap.bats to test ConfigMap propagation
  * Added volume mount test for ConfigMap consumption
  * Added verification that ConfigMap updates propagate to volume-mounted pods

- Enhanced k8s-credentials-secrets.bats to test Secret propagation
  * Added verification that Secret updates propagate to volume-mounted pods

Fixes #8015

Signed-off-by: Ajay Victor <ajvictor@in.ibm.com>
2026-02-14 08:56:21 +05:30
Fabiano Fidêncio
2930c68c0b ci: tdx: properly skip k8s-sandbox-vcpus-allocation.bats
This is a follow-up for 25962e9325

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-02-13 20:56:08 +01:00
Fabiano Fidêncio
8cb7d0be9d tests: nvidia: Fix genpolicy error when pulling nvcr.io images
genpolicy pulls image manifests from nvcr.io to generate policy and was
failing with 'UnauthorizedError' because it had no registry credentials.

Genpolicy (src/tools/genpolicy) uses docker_credential::get_credential()
in registry.rs, which reads from DOCKER_CONFIG/config.json. Add
setup_genpolicy_registry_auth() to create a Docker config with nvcr.io
auth (NGC_API_KEY) and set DOCKER_CONFIG before running genpolicy so it
can authenticate when pulling manifests.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-02-13 13:12:55 +01:00
Fabiano Fidêncio
6a3bbb1856 tests: Retry k8s deployment
We've seen a lot of spurious issues when deploying the infra needed for
the tests.  Let's give it a few tries before actually failing.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-02-12 20:13:59 +01:00
Mikko Ylinen
25962e9325 tests/coco: disable k8s-sandbox-vcpus-allocation.bats for TDX
After the move to Linux 6.17 and QEMU 10.2 from Kata,
k8s-sandbox-vcpus-allocation.bats started failing on TDX.

2026-02-10T16:39:39.1305813Z # pod/vcpus-less-than-one-with-no-limits created
2026-02-10T16:39:39.1306474Z # pod/vcpus-less-than-one-with-limits created
2026-02-10T16:39:39.1307090Z # pod/vcpus-more-than-one-with-limits created
2026-02-10T16:39:39.1307672Z # pod/vcpus-less-than-one-with-limits condition met
2026-02-10T16:39:39.1308373Z # timed out waiting for the condition on pods/vcpus-less-than-one-with-no-limits
2026-02-10T16:39:39.1309132Z # timed out waiting for the condition on pods/vcpus-more-than-one-with-limits
2026-02-10T16:39:39.1310370Z # Error from server (BadRequest): container "vcpus-less-than-one-with-no-limits" in pod "vcpus-less-than-one-with-no-limits" is waiting to start: ContainerCreating

A manual test without agent policies added it seems to work OK but disable
the test for now to get CI stable.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2026-02-11 22:02:59 +01:00
Hyounggyu Choi
c84e37f6ac Merge pull request #12486 from BbolroC/cpu-hotplug-s390x-runtime-rs
runtime-rs: Skip sockets and threads for hotplug_vcpus on Z/P
2026-02-11 09:40:21 +01:00
Hyounggyu Choi
67f54bdcb5 tests: Remove skip condition for runtime-rs on s390x in k8s-cpu-ns
This commit removes the skip condition for qemu-runtime-rs on s390x
in k8s-cpu-ns.bats.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-02-11 05:52:13 +01:00
stevenhorsman
15d6a681ed doc: Fix spelling issues
Put things in backticks

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-02-10 21:58:28 +01:00
Fabiano Fidêncio
5c0269881e tests: Make editorconfig-checker happy
- Trim trailing whitespace and ensure final newline in non-vendor files
- Add .editorconfig-checker.json excluding vendor dirs, *.patch, *.img,
  *.dtb, *.drawio, *.svg, and pkg/cloud-hypervisor/client so CI only
  checks project code
- Leave generated and binary assets unchanged (excluded from checker)

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-02-10 21:58:28 +01:00
Fabiano Fidêncio
cb652e0da1 tests: Update NVRC trace to use drop-in config mechanism
Update the enable_nvrc_trace() function to use the new drop-in
configuration mechanism instead of directly modifying the base
configuration file. The function now creates a 90-nvrc-trace.toml
drop-in file that properly combines existing kernel parameters
with the nvrc.log=trace setting.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-02-10 18:12:17 +01:00
Manuel Huber
a6ca5c6628 ci: add editorconfig checker
This adds a basic configuration for editorconfig checker. The
supplied configuration checks against trailing whitespaces and
issues with newlines.
Example:
| tools/packaging/kernel/configs/fragments/x86_64/numa.conf:
|       Wrong line endings or no final newline
| tools/packaging/release/generate_vendor.sh:
|       44: Trailing whitespace

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-02-09 15:03:26 -08:00
Manuel Huber
525192832f tests: Clean up superfluous GPU annotation
This annotation was required for GPU cold-plug before using a
newer device plugin and before querying the pod resources API.
As this annotation is no longer required, cleaning it up.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-02-09 11:28:24 -08:00
Alex Lyn
3fda59e27d tests: rename pod_exec_with_retries to pod_exec and update callers
It will do following works in this commit:

(1) Rename pod_exec_with_retries() to pod_exec().
(2) Update implementation to call container_exec().
(3) Replace all usages of pod_exec_with_retries across tests
    with pod_exec.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-02-09 15:56:13 +01:00
Alex Lyn
861d39305c tests: drop kubectl exec retries in container_exec
This commit aims to drop retries when kubectl exec a container:

(1) Rename container_exec_with_retries() to container_exec().
(2) Remove the retry loop and sleep backoff around kubectl exec.

Keep the same logging and container-selection logic and return
kubectl exec exit status directly.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-02-09 15:56:13 +01:00
stevenhorsman
b29312289f versions: Bump go to 1.24.13
Bump go to 1.24.13 to fix CVE GO-2026-4337

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-02-09 14:49:31 +01:00
Manuel Huber
cf7f340b39 tests: Read and overwrite kernel_verity_parameters
Read the kernel_verity_paramers from the shim config and adjust
the root hash for the negative test.
Further, improve some of the test logic by using shared
functions. This especially ensures we don't read the full
journalctl logs on a node but only the portion of the logs we are
actually supposed to look at.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-02-05 23:04:35 +01:00
Manuel Huber
e120dd4cc6 tests: cc: Remove quotes from kernel command line
With dm-mod.create parameters using quotes, we remove the
backslashes used to escape these quotes from the output we
retrieve. This will enable attestation tests to work with the
kernelinit dm-verity mode.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-02-05 23:04:35 +01:00
Manuel Huber
282014000f tests: cc: support initrd, image for attestation
Allow using an image instead of an initrd. For confidential
guests using images, the assumption is that the guest kernel uses
dm-verity protection, implicitly measuring the rootfs image via
the kernel command line's dm-verity information.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-02-05 23:04:35 +01:00
Fabiano Fidêncio
dda1b30c34 tests: nvidia-nim: Use sealed secrets for NGC_API_KEY
Convert the NGC_API_KEY from a regular Kubernetes secret to a sealed
secret for the CC GPU tests. This ensures the API key is only accessible
within the confidential enclave after successful attestation.

The sealed secret uses the "vault" type which points to a resource stored
in the Key Broker Service (KBS). The Confidential Data Hub (CDH) inside
the guest will unseal this secret by fetching it from KBS after
attestation.

The initdata file is created AFTER create_tmp_policy_settings_dir()
copies the empty default file, and BEFORE auto_generate_policy() runs.
This allows genpolicy to add the generated policy.rego to our custom
CDH configuration.

The sealed secret format follows the CoCo specification:
sealed.<JWS header>.<JWS payload>.<signature>

Where the payload contains:
- version: "0.1.0"
- type: "vault" (pointer to KBS resource)
- provider: "kbs"
- resource_uri: KBS path to the actual secret

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-02-04 12:34:44 +01:00
Fabiano Fidêncio
c9061f9e36 tests: kata-deploy: Increase post-deployment wait time
Increase the sleep time after kata-deploy deployment from 10s to 60s
to give more time for runtimes to be configured. This helps avoid
race conditions on slower K8s distributions like k3s where the
RuntimeClass may not be immediately available after the DaemonSet
rollout completes.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-02-04 12:13:53 +01:00
Fabiano Fidêncio
0fb2c500fd tests: kata-deploy: Merge E2E tests to avoid timing issues
Merge the two E2E tests ("Custom RuntimeClass exists with correct
properties" and "Custom runtime can run a pod") into a single test, as
those 2 are very much dependent of each other.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-02-04 12:13:53 +01:00
Fabiano Fidêncio
fef93f1e08 tests: kata-deploy: Use die() instead of fail() for error handling
Replace fail() calls with die() which is already provided by
common.bash. The fail() function doesn't exist in the test
infrastructure, causing "command not found" errors when tests fail.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-02-04 12:13:53 +01:00
Dan Mihai
d7ff54769c tests: policy: remove the need for using sudo
Modify the copy of root user's settings file, instead of modifying the
original file.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-02-01 20:09:50 +01:00
Dan Mihai
4d860dcaf5 tests: policy: avoid redundant debug output
Avoid redundant and confusing teardown_common() debug output for
k8s-policy-pod.bats and k8s-policy-pvc.bats.

The Policy tests skip the Message field when printing information about
their pods, because unfortunately that field might contain a truncated
Policy log - for the test cases that intentiocally cause Policy
failures. The non-truncated Policy log is already available from other
"kubectl describe" fields.

So, avoid the redundant pod information from teardown_common(), that
also included the confusing Message field.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2026-02-01 20:09:50 +01:00
Steve Horsman
4d1095e653 Merge pull request #12350 from manuelh-dev/mahuber/term-grace-period
tests: Remove terminationGracePeriod in manifests
2026-01-29 15:17:17 +00:00
Fabiano Fidêncio
500146bfee versions: Bump Go to 1.24.12
Update Go from 1.24.11 to 1.24.12 to address security vulnerabilities
in the standard library:

- GO-2026-4342: Excessive CPU consumption in archive/zip
- GO-2026-4341: Memory exhaustion in net/url query parsing
- GO-2026-4340: TLS handshake encryption level issue in crypto/tls

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-29 00:23:26 +01:00
Dan Mihai
20ca4d2d79 runtime: DEFDISABLEBLOCK := true
1. Add disable_block_device_use to CLH settings file, for parity with
   the already existing QEMU settings.

2. Set DEFDISABLEBLOCK := true by default for both QEMU and CLH. After
   this change, Kata Guests will use by default virtio-fs to access
   container rootfs directories from their Hosts. Hosts that were
   designed to use Host block devices attached to the Guests can
   re-enable these rootfs block devices by changing the value of
   disable_block_device_use back to false in their settings files.

3. Add test using container image without any rootfs layers. Depending
   on the container runtime and image snapshotter being used, the empty
   container rootfs image might get stored on a host block device that
   cannot be safely hotplugged to a guest VM, because the host is using
   the same block device.

4. Add block device hotplug safety warning into the Kata Shim
   configuration files.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Cameron McDermott <cameron@northflank.com>
2026-01-28 19:47:49 +01:00
Fabiano Fidêncio
d0fe60e784 tests: Fix empty string handling for helm
Fix empty string handling in format conversion

When HELM_ALLOWED_HYPERVISOR_ANNOTATIONS, HELM_AGENT_HTTPS_PROXY, or
HELM_AGENT_NO_PROXY are empty, the pattern matching condition
`!= *:*` or `!= *=*` evaluates to true, causing the conversion loop
to create invalid entries like "qemu-tdx: qemu-snp:".

Add -n checks to ensure conversion only runs when variables are
non-empty.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-26 20:50:01 +01:00
Fabiano Fidêncio
4b2d4e96ae tests: Add qemu-{tdx,snp}-runtime-rs to the list of tee shims
We missed doing this as part of
b5a986eacf.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-26 20:50:01 +01:00
Fabiano Fidêncio
26c534d610 tests: Use shims.disableAll in test helpers
Update the CI and functional test helpers to use the new
shims.disableAll option instead of iterating over every shim
to disable them individually.

Also adds helm repo for node-feature-discovery before building
dependencies to fix CI failures on some distributions.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-26 20:50:01 +01:00
Fabiano Fidêncio
d8a3272f85 kata-deploy: Add tests for custom runtimes Helm templates
Add Bats tests to verify the custom runtimes Helm template rendering,
and that the we can start a pod with the custom runtime.

Tests were written with Cursor's help.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-26 20:50:01 +01:00
Manuel Huber
6438fe7f2d tests: Remove terminationGracePeriod in manifests
Do not kill containers immediately, instead use Kubernetes'
default termination grace period.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-01-23 16:18:44 -08:00
Manuel Huber
0d35b36652 Revert "ci: Ensure the KBS resources are created"
This reverts commit c0d7222194.

Soon, guest components will switch to using a DB instead of
storing resources in the filesystem. Further, I don't see any
more indicators why kbs-client would struggle to set simple
resources.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-01-23 16:18:10 -08:00
Fabiano Fidêncio
5b82b160e2 runtime-rs: Add arm64 QEMU support
Add the necessary configuration and code changes to support QEMU
on arm64 architecture in runtime-rs.

Changes:
- Set MACHINETYPE to "virt" for arm64
- Add machine accelerators "usb=off,gic-version=host" required for
  proper arm64 virtualization
- Add arm64-specific kernel parameter "iommu.passthrough=0"
- Guard vIOMMU (Intel IOMMU) to skip on arm64 since it's not supported

These changes align runtime-rs with the Go runtime's arm64 QEMU support.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Kevin Zhao <kevin.zhao@linaro.org>
2026-01-23 19:48:31 +01:00
Fabiano Fidêncio
ec18dd79ba tests: Simplify kata-deploy test to use helm directly
The kata-deploy test was using helm_helper which made it hard to debug
failures (die() calls would cause "Executed 0 tests" errors) and added
unnecessary complexity.

The test now calls helm directly like a user would, making it simpler
and more representative of real-world usage. The verification job status
is explicitly checked with proper failure detection instead of relying
on helm --wait.

Timeouts are configurable via environment variables to account for
different network speeds and image sizes:
- KATA_DEPLOY_TIMEOUT (default: 600s)
- KATA_DEPLOY_DAEMONSET_TIMEOUT (default: 300s)
- KATA_DEPLOY_VERIFICATION_TIMEOUT (default: 120s)

Documentation has been added to explain what each timeout controls and
how to customize them.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-21 20:14:33 +01:00
Fabiano Fidêncio
2369cf585d tests: Fix retry loop bugs in helm_helper
The retry loop in helm_helper had two bugs:
1. Counter initialized to 10 instead of 0, causing immediate failure
2. Exit condition used -eq instead of -ge, incorrect for loop logic

These bugs would cause helm_helper to fail immediately on the first
retry attempt instead of properly retrying up to max_tries times.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-21 20:14:33 +01:00
Fabiano Fidêncio
e0158869b1 tests: Add common bats test runner function
Add run_bats_tests() function to common.bash that provides consistent
test execution and reporting across all test suites (k8s, nvidia,
kata-deploy).

This removes duplicated test runner code from run_kubernetes_tests.sh,
run_kubernetes_nv_tests.sh, and run-kata-deploy-tests.sh.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-20 12:31:55 +01:00
Fabiano Fidêncio
b5a986eacf kata-deploy: Add runtime-rs TDX / SNP runtimeclasses
https://github.com/kata-containers/kata-containers/pull/11534 has been
merged and it added all the needed bits to deploy the QEMU SNP / TDX
runtime-rs variants, apart from the kata-deploy additions, which is done
by this PR.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-19 22:41:50 +01:00
Fabiano Fidêncio
c7570427d2 tests: Add report generation to NVIDIA tests
The NVIDIA GPU test runner script was not generating test reports,
causing the report_tests() function in gha-run.sh to have nothing
to display. This aligns the script with run_kubernetes_tests.sh by:

- Adding set -o pipefail for proper pipeline error handling
- Creating a reports directory with timestamped subdirectory
- Capturing test output to files with ok-/not_ok- prefixes
- Adding --timing flag to bats for timing information

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-19 18:21:43 +01:00
Fabiano Fidêncio
96e1fb4ca6 tools: Remove runk
The runk tool hasn't been supported for a few years, with no maintainers
since ManaSugi stopped being involved in the project and the CI was
disabled in 2024.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-01-19 14:43:53 +01:00