After the move to Linux 6.17 and QEMU 10.2 from Kata,
k8s-sandbox-vcpus-allocation.bats started failing on TDX.
2026-02-10T16:39:39.1305813Z # pod/vcpus-less-than-one-with-no-limits created
2026-02-10T16:39:39.1306474Z # pod/vcpus-less-than-one-with-limits created
2026-02-10T16:39:39.1307090Z # pod/vcpus-more-than-one-with-limits created
2026-02-10T16:39:39.1307672Z # pod/vcpus-less-than-one-with-limits condition met
2026-02-10T16:39:39.1308373Z # timed out waiting for the condition on pods/vcpus-less-than-one-with-no-limits
2026-02-10T16:39:39.1309132Z # timed out waiting for the condition on pods/vcpus-more-than-one-with-limits
2026-02-10T16:39:39.1310370Z # Error from server (BadRequest): container "vcpus-less-than-one-with-no-limits" in pod "vcpus-less-than-one-with-no-limits" is waiting to start: ContainerCreating
A manual test without agent policies added it seems to work OK but disable
the test for now to get CI stable.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
- Trim trailing whitespace and ensure final newline in non-vendor files
- Add .editorconfig-checker.json excluding vendor dirs, *.patch, *.img,
*.dtb, *.drawio, *.svg, and pkg/cloud-hypervisor/client so CI only
checks project code
- Leave generated and binary assets unchanged (excluded from checker)
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Update the enable_nvrc_trace() function to use the new drop-in
configuration mechanism instead of directly modifying the base
configuration file. The function now creates a 90-nvrc-trace.toml
drop-in file that properly combines existing kernel parameters
with the nvrc.log=trace setting.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This adds a basic configuration for editorconfig checker. The
supplied configuration checks against trailing whitespaces and
issues with newlines.
Example:
| tools/packaging/kernel/configs/fragments/x86_64/numa.conf:
| Wrong line endings or no final newline
| tools/packaging/release/generate_vendor.sh:
| 44: Trailing whitespace
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
This annotation was required for GPU cold-plug before using a
newer device plugin and before querying the pod resources API.
As this annotation is no longer required, cleaning it up.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
It will do following works in this commit:
(1) Rename pod_exec_with_retries() to pod_exec().
(2) Update implementation to call container_exec().
(3) Replace all usages of pod_exec_with_retries across tests
with pod_exec.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit aims to drop retries when kubectl exec a container:
(1) Rename container_exec_with_retries() to container_exec().
(2) Remove the retry loop and sleep backoff around kubectl exec.
Keep the same logging and container-selection logic and return
kubectl exec exit status directly.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Read the kernel_verity_paramers from the shim config and adjust
the root hash for the negative test.
Further, improve some of the test logic by using shared
functions. This especially ensures we don't read the full
journalctl logs on a node but only the portion of the logs we are
actually supposed to look at.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
With dm-mod.create parameters using quotes, we remove the
backslashes used to escape these quotes from the output we
retrieve. This will enable attestation tests to work with the
kernelinit dm-verity mode.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Allow using an image instead of an initrd. For confidential
guests using images, the assumption is that the guest kernel uses
dm-verity protection, implicitly measuring the rootfs image via
the kernel command line's dm-verity information.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Convert the NGC_API_KEY from a regular Kubernetes secret to a sealed
secret for the CC GPU tests. This ensures the API key is only accessible
within the confidential enclave after successful attestation.
The sealed secret uses the "vault" type which points to a resource stored
in the Key Broker Service (KBS). The Confidential Data Hub (CDH) inside
the guest will unseal this secret by fetching it from KBS after
attestation.
The initdata file is created AFTER create_tmp_policy_settings_dir()
copies the empty default file, and BEFORE auto_generate_policy() runs.
This allows genpolicy to add the generated policy.rego to our custom
CDH configuration.
The sealed secret format follows the CoCo specification:
sealed.<JWS header>.<JWS payload>.<signature>
Where the payload contains:
- version: "0.1.0"
- type: "vault" (pointer to KBS resource)
- provider: "kbs"
- resource_uri: KBS path to the actual secret
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Increase the sleep time after kata-deploy deployment from 10s to 60s
to give more time for runtimes to be configured. This helps avoid
race conditions on slower K8s distributions like k3s where the
RuntimeClass may not be immediately available after the DaemonSet
rollout completes.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Merge the two E2E tests ("Custom RuntimeClass exists with correct
properties" and "Custom runtime can run a pod") into a single test, as
those 2 are very much dependent of each other.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Replace fail() calls with die() which is already provided by
common.bash. The fail() function doesn't exist in the test
infrastructure, causing "command not found" errors when tests fail.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Avoid redundant and confusing teardown_common() debug output for
k8s-policy-pod.bats and k8s-policy-pvc.bats.
The Policy tests skip the Message field when printing information about
their pods, because unfortunately that field might contain a truncated
Policy log - for the test cases that intentiocally cause Policy
failures. The non-truncated Policy log is already available from other
"kubectl describe" fields.
So, avoid the redundant pod information from teardown_common(), that
also included the confusing Message field.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Update Go from 1.24.11 to 1.24.12 to address security vulnerabilities
in the standard library:
- GO-2026-4342: Excessive CPU consumption in archive/zip
- GO-2026-4341: Memory exhaustion in net/url query parsing
- GO-2026-4340: TLS handshake encryption level issue in crypto/tls
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
1. Add disable_block_device_use to CLH settings file, for parity with
the already existing QEMU settings.
2. Set DEFDISABLEBLOCK := true by default for both QEMU and CLH. After
this change, Kata Guests will use by default virtio-fs to access
container rootfs directories from their Hosts. Hosts that were
designed to use Host block devices attached to the Guests can
re-enable these rootfs block devices by changing the value of
disable_block_device_use back to false in their settings files.
3. Add test using container image without any rootfs layers. Depending
on the container runtime and image snapshotter being used, the empty
container rootfs image might get stored on a host block device that
cannot be safely hotplugged to a guest VM, because the host is using
the same block device.
4. Add block device hotplug safety warning into the Kata Shim
configuration files.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Cameron McDermott <cameron@northflank.com>
Fix empty string handling in format conversion
When HELM_ALLOWED_HYPERVISOR_ANNOTATIONS, HELM_AGENT_HTTPS_PROXY, or
HELM_AGENT_NO_PROXY are empty, the pattern matching condition
`!= *:*` or `!= *=*` evaluates to true, causing the conversion loop
to create invalid entries like "qemu-tdx: qemu-snp:".
Add -n checks to ensure conversion only runs when variables are
non-empty.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Update the CI and functional test helpers to use the new
shims.disableAll option instead of iterating over every shim
to disable them individually.
Also adds helm repo for node-feature-discovery before building
dependencies to fix CI failures on some distributions.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add Bats tests to verify the custom runtimes Helm template rendering,
and that the we can start a pod with the custom runtime.
Tests were written with Cursor's help.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This reverts commit c0d7222194.
Soon, guest components will switch to using a DB instead of
storing resources in the filesystem. Further, I don't see any
more indicators why kbs-client would struggle to set simple
resources.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Add the necessary configuration and code changes to support QEMU
on arm64 architecture in runtime-rs.
Changes:
- Set MACHINETYPE to "virt" for arm64
- Add machine accelerators "usb=off,gic-version=host" required for
proper arm64 virtualization
- Add arm64-specific kernel parameter "iommu.passthrough=0"
- Guard vIOMMU (Intel IOMMU) to skip on arm64 since it's not supported
These changes align runtime-rs with the Go runtime's arm64 QEMU support.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Kevin Zhao <kevin.zhao@linaro.org>
The kata-deploy test was using helm_helper which made it hard to debug
failures (die() calls would cause "Executed 0 tests" errors) and added
unnecessary complexity.
The test now calls helm directly like a user would, making it simpler
and more representative of real-world usage. The verification job status
is explicitly checked with proper failure detection instead of relying
on helm --wait.
Timeouts are configurable via environment variables to account for
different network speeds and image sizes:
- KATA_DEPLOY_TIMEOUT (default: 600s)
- KATA_DEPLOY_DAEMONSET_TIMEOUT (default: 300s)
- KATA_DEPLOY_VERIFICATION_TIMEOUT (default: 120s)
Documentation has been added to explain what each timeout controls and
how to customize them.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The retry loop in helm_helper had two bugs:
1. Counter initialized to 10 instead of 0, causing immediate failure
2. Exit condition used -eq instead of -ge, incorrect for loop logic
These bugs would cause helm_helper to fail immediately on the first
retry attempt instead of properly retrying up to max_tries times.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add run_bats_tests() function to common.bash that provides consistent
test execution and reporting across all test suites (k8s, nvidia,
kata-deploy).
This removes duplicated test runner code from run_kubernetes_tests.sh,
run_kubernetes_nv_tests.sh, and run-kata-deploy-tests.sh.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The NVIDIA GPU test runner script was not generating test reports,
causing the report_tests() function in gha-run.sh to have nothing
to display. This aligns the script with run_kubernetes_tests.sh by:
- Adding set -o pipefail for proper pipeline error handling
- Creating a reports directory with timestamped subdirectory
- Capturing test output to files with ok-/not_ok- prefixes
- Adding --timing flag to bats for timing information
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The runk tool hasn't been supported for a few years, with no maintainers
since ManaSugi stopped being involved in the project and the CI was
disabled in 2024.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Enable post-install verification in kata-deploy CI tests. When
HELM_VERIFY_DEPLOYMENT is set, a simple verification pod is created
that runs with the Kata runtime to confirm deployment succeeded.
The verification pod prints kernel info and exits - success indicates
the Kata runtime is properly configured and functional.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
set_container_command() previously appended command arguments
one-by-one with
'.command += [...]'. This makes the helper non-idempotent and can
lead to unexpected command arrays when invoked multiple times.
Update the helper to set the full command array in a single yq v4
expression and print the target YAML path plus the command being
applied to simplify debugging when tests fail.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The pod config file created by new_pod_config() was generated via
mktemp using the template "pod-config.yaml.in.XXX", which produces
filenames that do not end with ".yaml" (e.g. pod-config.yaml.in.ABC).
If the random combination of special suffix with ".Csv" or ".Xml", etc.
the following operations with yq will fail.
Some helpers and tooling assume the config path ends with ".yaml".
Switch the mktemp template to place the random suffix before the
extension so the returned path always ends with ".yaml".
Fixes: #12268, #12319
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Change the secure_storage_integrity option's default value to true.
With this, integrity protection for encrypted block device contents
will be requested from the confidential data hub by default, see the
agent's cdh_handler_trusted_storage function in rpc.rs.
This behavior can be disabled by explicitly setting the
agent.secure_storage_integrity parameter to 0 or false via kernel
command line parameters.
This will affect the trusted storage implementation for the guest-pull
mechanism, and it will affect future implementations using this code
path, such as implementations for ephemeral secure storage.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Create a new page for a reference implementation for Kubernetes
using QEMU, the go shim and an NVIDIA rootfs. The new page
contains information on:
- components involved in the NVIDIA (TEE) GPU scenario
- orchestration flow for GPU passthrough scenarios
- deployment guidance
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Till k8s 1.34 we could grep by "Started containerd". From k8s 1.35
onwards the event message changed and we should, instead, grep by
"Container started".
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Since #12204 was merged, the following error has been observed:
```
bats warning: Executed 1 instead of expected 2 tests
[run_kubernetes_tests.sh:162] ERROR: Tests FAILED from suites: k8s-empty-dirs.bats
```
The cause is that `pod_logs_file` is re-declared as a local variable
in the second test before skipping, which makes it inaccessible
in `teardown()` and leads to an error.
This commit removes the re-declaration of the variable.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Changes in NIM/RAG samples:
- update image references
- update memory requirements, timeouts, model name
- sanitize some of the probes and print-out
Further refinements can be made in the future.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Add two attestation tests. The first one sets a resource policy that
requires CPU0 to have an affirming trust level. This is a negative test
which can run on any platform. Setting this policy without setting any
reference values should result in an attestation failure.
Next, a second test will set the same policy, but this time it will use
the journal log to find the QEMU command line from the previous test and
calculate the expected reference values. Currently this is only
supported on SNP using the sev-snp-measure tool, but the same flow
should work on other platforms.
Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
It will address the issue:
"# bats warning: Executed 0 instead of expected 1 tests"
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>