Update Go from 1.24.11 to 1.24.12 to address security vulnerabilities
in the standard library:
- GO-2026-4342: Excessive CPU consumption in archive/zip
- GO-2026-4341: Memory exhaustion in net/url query parsing
- GO-2026-4340: TLS handshake encryption level issue in crypto/tls
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
1. Add disable_block_device_use to CLH settings file, for parity with
the already existing QEMU settings.
2. Set DEFDISABLEBLOCK := true by default for both QEMU and CLH. After
this change, Kata Guests will use by default virtio-fs to access
container rootfs directories from their Hosts. Hosts that were
designed to use Host block devices attached to the Guests can
re-enable these rootfs block devices by changing the value of
disable_block_device_use back to false in their settings files.
3. Add test using container image without any rootfs layers. Depending
on the container runtime and image snapshotter being used, the empty
container rootfs image might get stored on a host block device that
cannot be safely hotplugged to a guest VM, because the host is using
the same block device.
4. Add block device hotplug safety warning into the Kata Shim
configuration files.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Cameron McDermott <cameron@northflank.com>
Fix empty string handling in format conversion
When HELM_ALLOWED_HYPERVISOR_ANNOTATIONS, HELM_AGENT_HTTPS_PROXY, or
HELM_AGENT_NO_PROXY are empty, the pattern matching condition
`!= *:*` or `!= *=*` evaluates to true, causing the conversion loop
to create invalid entries like "qemu-tdx: qemu-snp:".
Add -n checks to ensure conversion only runs when variables are
non-empty.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Update the CI and functional test helpers to use the new
shims.disableAll option instead of iterating over every shim
to disable them individually.
Also adds helm repo for node-feature-discovery before building
dependencies to fix CI failures on some distributions.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add Bats tests to verify the custom runtimes Helm template rendering,
and that the we can start a pod with the custom runtime.
Tests were written with Cursor's help.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This reverts commit c0d7222194.
Soon, guest components will switch to using a DB instead of
storing resources in the filesystem. Further, I don't see any
more indicators why kbs-client would struggle to set simple
resources.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Add the necessary configuration and code changes to support QEMU
on arm64 architecture in runtime-rs.
Changes:
- Set MACHINETYPE to "virt" for arm64
- Add machine accelerators "usb=off,gic-version=host" required for
proper arm64 virtualization
- Add arm64-specific kernel parameter "iommu.passthrough=0"
- Guard vIOMMU (Intel IOMMU) to skip on arm64 since it's not supported
These changes align runtime-rs with the Go runtime's arm64 QEMU support.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Kevin Zhao <kevin.zhao@linaro.org>
The kata-deploy test was using helm_helper which made it hard to debug
failures (die() calls would cause "Executed 0 tests" errors) and added
unnecessary complexity.
The test now calls helm directly like a user would, making it simpler
and more representative of real-world usage. The verification job status
is explicitly checked with proper failure detection instead of relying
on helm --wait.
Timeouts are configurable via environment variables to account for
different network speeds and image sizes:
- KATA_DEPLOY_TIMEOUT (default: 600s)
- KATA_DEPLOY_DAEMONSET_TIMEOUT (default: 300s)
- KATA_DEPLOY_VERIFICATION_TIMEOUT (default: 120s)
Documentation has been added to explain what each timeout controls and
how to customize them.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The retry loop in helm_helper had two bugs:
1. Counter initialized to 10 instead of 0, causing immediate failure
2. Exit condition used -eq instead of -ge, incorrect for loop logic
These bugs would cause helm_helper to fail immediately on the first
retry attempt instead of properly retrying up to max_tries times.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add run_bats_tests() function to common.bash that provides consistent
test execution and reporting across all test suites (k8s, nvidia,
kata-deploy).
This removes duplicated test runner code from run_kubernetes_tests.sh,
run_kubernetes_nv_tests.sh, and run-kata-deploy-tests.sh.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The NVIDIA GPU test runner script was not generating test reports,
causing the report_tests() function in gha-run.sh to have nothing
to display. This aligns the script with run_kubernetes_tests.sh by:
- Adding set -o pipefail for proper pipeline error handling
- Creating a reports directory with timestamped subdirectory
- Capturing test output to files with ok-/not_ok- prefixes
- Adding --timing flag to bats for timing information
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The runk tool hasn't been supported for a few years, with no maintainers
since ManaSugi stopped being involved in the project and the CI was
disabled in 2024.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Enable post-install verification in kata-deploy CI tests. When
HELM_VERIFY_DEPLOYMENT is set, a simple verification pod is created
that runs with the Kata runtime to confirm deployment succeeded.
The verification pod prints kernel info and exits - success indicates
the Kata runtime is properly configured and functional.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
set_container_command() previously appended command arguments
one-by-one with
'.command += [...]'. This makes the helper non-idempotent and can
lead to unexpected command arrays when invoked multiple times.
Update the helper to set the full command array in a single yq v4
expression and print the target YAML path plus the command being
applied to simplify debugging when tests fail.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The pod config file created by new_pod_config() was generated via
mktemp using the template "pod-config.yaml.in.XXX", which produces
filenames that do not end with ".yaml" (e.g. pod-config.yaml.in.ABC).
If the random combination of special suffix with ".Csv" or ".Xml", etc.
the following operations with yq will fail.
Some helpers and tooling assume the config path ends with ".yaml".
Switch the mktemp template to place the random suffix before the
extension so the returned path always ends with ".yaml".
Fixes: #12268, #12319
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Change the secure_storage_integrity option's default value to true.
With this, integrity protection for encrypted block device contents
will be requested from the confidential data hub by default, see the
agent's cdh_handler_trusted_storage function in rpc.rs.
This behavior can be disabled by explicitly setting the
agent.secure_storage_integrity parameter to 0 or false via kernel
command line parameters.
This will affect the trusted storage implementation for the guest-pull
mechanism, and it will affect future implementations using this code
path, such as implementations for ephemeral secure storage.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Create a new page for a reference implementation for Kubernetes
using QEMU, the go shim and an NVIDIA rootfs. The new page
contains information on:
- components involved in the NVIDIA (TEE) GPU scenario
- orchestration flow for GPU passthrough scenarios
- deployment guidance
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Till k8s 1.34 we could grep by "Started containerd". From k8s 1.35
onwards the event message changed and we should, instead, grep by
"Container started".
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Since #12204 was merged, the following error has been observed:
```
bats warning: Executed 1 instead of expected 2 tests
[run_kubernetes_tests.sh:162] ERROR: Tests FAILED from suites: k8s-empty-dirs.bats
```
The cause is that `pod_logs_file` is re-declared as a local variable
in the second test before skipping, which makes it inaccessible
in `teardown()` and leads to an error.
This commit removes the re-declaration of the variable.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Changes in NIM/RAG samples:
- update image references
- update memory requirements, timeouts, model name
- sanitize some of the probes and print-out
Further refinements can be made in the future.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Add two attestation tests. The first one sets a resource policy that
requires CPU0 to have an affirming trust level. This is a negative test
which can run on any platform. Setting this policy without setting any
reference values should result in an attestation failure.
Next, a second test will set the same policy, but this time it will use
the journal log to find the QEMU command line from the previous test and
calculate the expected reference values. Currently this is only
supported on SNP using the sev-snp-measure tool, but the same flow
should work on other platforms.
Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
It will address the issue:
"# bats warning: Executed 0 instead of expected 1 tests"
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
As each case need such preparation of get_pod_config_dir,
a better method is directly move it into the setup_common method.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
To measure the duration for journal, we need clearly print the journal
start time and end time for each case which helps to ensure the journal
log is for the specified period for the case.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Currently policy_settings_dir is created only when
BATS_TEST_NUMBER == "1",
but delete_tmp_policy_settings_dir "${policy_settings_dir}" is
called in teardown() for every test. This means that for tests
after the first one teardown() may attempt to delete a directory
that was already removed by a previous test, or rely on a value
that does not belong to the current test execution.
Adjust teardown logic so that policy_settings_dir is only deleted
for the first test case (BATS_TEST_NUMBER == "1") and ignored for
subsequent tests. This keeps the original optimization of running
genpolicy only once, while avoiding unnecessary or confusing cleanup
attempts in later test cases.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
the previous pod_name is set as local which can not be captured
within the teardown() function, causing failure.
This commit just remove the `local pod_name` to make it a global
variable.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Align with other test logic - declare the KATA_HYPERVISOR in the
run bash script, then declare the RUNTIME_CLASS_NAME variable in
the bats files.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Now that we have a more restrictive resource policy for KBS, let
us start adopting it across all NVIDIA test cases. This policy was
previously introduced by the NVIDIA attestation test.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
With these changes, we create pod security policies when running
against NVIDIA TEE GPU handlers where AUTO_GENERATE_POLICY is set.
For the non-TEE GPU tests, the added functions bail out by design.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Following existing patterns, we adapt the common policy settings
for NVIDIA GPU CI platforms. For instance, for our CI runners, we
use containerd 2.x.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Enable auto-generate policy for qemu-nvidia-gpu-* if the user
didn't specify an AUTO_GENERATE_POLICY value.
Setting this in run_kubernetes_nv_tests.sh is too late as
gha-run.sh calls into run_tests, setup.sh, and then into
create_common_genpolicy_settings() where the rules.rego and
genpolicy-settings file are being copied to the right locations.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Set the attestation policy for GPU0 to affirming. This requires
the GPU, for instance, to have production properties, such as
properly signed VBIOS firmware.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>