Introduce a new function to install additional packages into the
devkit flavor. With modprobe, we avoid errors on pod startup
related to loading nvidia kernel modules in the NVRC phase.
Note, the production flavor gets modprobe from busybox, see its
configuration file containing CONFIG_MODPROBE=y.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Remove the initramfs folder, its build steps, and use the kernel
based dm-verity enforcement for the handlers which used the
initramfs mode. Also, remove the initramfs verity mode
capability from the shims and their configs.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Read the kernel_verity_paramers from the shim config and adjust
the root hash for the negative test.
Further, improve some of the test logic by using shared
functions. This especially ensures we don't read the full
journalctl logs on a node but only the portion of the logs we are
actually supposed to look at.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Similar to the kernel_params annotation, add a
kernel_verity_params annotation and add logic to make these
parameters overwritable. For instance, this can be used in test
logic to provide bogus dm-verity hashes for negative tests.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Similar to the kernel_params annotation, add a
kernel_verity_params annotation and add logic to make these
parameters overwritable. For instance, this can be used in test
logic to provide bogus dm-verity hashes for negative tests.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
This change introduces the kernel_verity_parameters knob to the
rust based shim, picking up dm-verity information in a new config
field (the corresponding build variable is already produced by
the shim build). The change extends the shim to parse dm-verity
information from this parameter and to construct the kernel command
line appropriately, based on the indicated initramfs or kernelinit
build variant.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
This change introduces the kernel_verity_parameters knob to the
Go based shim, picking up dm-verity information in a new config
field (the corresponding build variable is already produced by
the shim build). The change extends the shim to parse dm-verity
information from this parameter and to construct the kernel command
line appropriately, based on the indicated initramfs or kernelinit
build variant.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
With dm-mod.create parameters using quotes, we remove the
backslashes used to escape these quotes from the output we
retrieve. This will enable attestation tests to work with the
kernelinit dm-verity mode.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Measured rootfs mode and CDH secure storage feature require the
cryptsetup-bin and e2fsprogs components in the guest.
This change makes this more explicity - confidential guests are
users of the CDH secure container image layer storage feature.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
This change introduces the kernelinit dm-verity mode, allowing
initramfs-less dm-verity enforcement against the rootfs image.
For this, the change introduces a new variable with dm-verity
information. This variable will be picked up by shim
configurations in subsequent commits.
This will allow the shims to build the kernel command line
with dm-verity information based on the existing
kernel_parameters configuration knob and a new
kernel_verity_params configuration knob. The latter
specifically provides the relevant dm-verity information.
This new configuration knob avoids merging the verity
parameters into the kernel_params field. Avoiding this, no
cumbersome escape logic is required as we do not need to pass the
dm-mod.create="..." parameter directly in the kernel_parameters,
but only relevant dm-verity parameters in semi-structured manner
(see above). The only place where the final command line is
assembled is in the shims. Further, this is a line easy to comment
out for developers to disable dm-verity enforcement (or for CI
tasks).
This change produces the new kernelinit dm-verity parameters for
the NVIDIA runtime handlers, and modifies the format of how
these parameters are prepared for all handlers. With this, the
parameters are currently no longer provided to the
kernel_params configuration knob for any runtime handler.
This change alone should thus not be used as dm-verity
information will no longer be picked up by the shims.
systemd-analyze on the coco-dev handler shows that using the
kernelinit mode on a local machine, less time is spent in the
kernel phase, slightly speeding up pod start-up. On that machine,
the average of 172.5ms was reduced to 141ms (4 measurements, each
with a basic pod manifest), i.e., the kernel phase duration is
improved by about 18 percent.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
This reverts commit 923f97bc66 in
order to re-instantiate the logic from commit
e4a13b9a4a.
The latter commit was previously reverted due to the NVIDIA GPU TEE
handler using an initrd, not an image.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Shift NVIDIA shim configurations to use an image instead of an initrd,
and remove trailing whitespaces from the configs.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Allow using an image instead of an initrd. For confidential
guests using images, the assumption is that the guest kernel uses
dm-verity protection, implicitly measuring the rootfs image via
the kernel command line's dm-verity information.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Convert the NGC_API_KEY from a regular Kubernetes secret to a sealed
secret for the CC GPU tests. This ensures the API key is only accessible
within the confidential enclave after successful attestation.
The sealed secret uses the "vault" type which points to a resource stored
in the Key Broker Service (KBS). The Confidential Data Hub (CDH) inside
the guest will unseal this secret by fetching it from KBS after
attestation.
The initdata file is created AFTER create_tmp_policy_settings_dir()
copies the empty default file, and BEFORE auto_generate_policy() runs.
This allows genpolicy to add the generated policy.rego to our custom
CDH configuration.
The sealed secret format follows the CoCo specification:
sealed.<JWS header>.<JWS payload>.<signature>
Where the payload contains:
- version: "0.1.0"
- type: "vault" (pointer to KBS resource)
- provider: "kbs"
- resource_uri: KBS path to the actual secret
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Increase the sleep time after kata-deploy deployment from 10s to 60s
to give more time for runtimes to be configured. This helps avoid
race conditions on slower K8s distributions like k3s where the
RuntimeClass may not be immediately available after the DaemonSet
rollout completes.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Merge the two E2E tests ("Custom RuntimeClass exists with correct
properties" and "Custom runtime can run a pod") into a single test, as
those 2 are very much dependent of each other.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Replace fail() calls with die() which is already provided by
common.bash. The fail() function doesn't exist in the test
infrastructure, causing "command not found" errors when tests fail.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
We cannot overwrtie a binary that's currently in use, and that's the
reason that elsewhere we remove / unlink the binary (the running process
keeps its file descriptor, so we're good doing that) and only then we
copy the binary. However, we missed doing this for the
nydus-snapshotter deployment.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Clean up trailing whitespaces, making life easier for those who
have configured their IDE to clean these up.
Suggest to not add new code with trailing whitespaces etc.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Add support for CRI-O annotations when fetching pod identifiers for
device cold plug. The code now checks containerd CRI annotations first,
then falls back to CRI-O annotations if they are empty.
This enables device cold plug to work with both containerd and CRI-O
container runtimes.
Annotations supported:
- containerd: io.kubernetes.cri.sandbox-name, io.kubernetes.cri.sandbox-namespace
- CRI-O: io.kubernetes.cri-o.KubeName, io.kubernetes.cri-o.Namespace
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Clean up existing nydus-snapshotter state to ensure fresh start with new
version.
This is safe across all K8s distributions (k3s, rke2, k0s, microk8s,
etc.) because we only touch the nydus data directory, not containerd's
internals.
When containerd tries to use non-existent snapshots, it will
re-pull/re-unpack.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
As we have moved to use QEMU (and OVMF already earlier) from
kata-deploy, the custom tdx configurations and distro checks
are no longer needed.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Currently, a working TDX setup expects users to install special
TDX support builds from Canonical/CentOS virt-sig for TDX to
work. kata-deploy configured TDX runtime handler to use QEMU
from the distro's paths.
With TDX support now being available in upstream Linux and
Ubuntu 24.04 having an install candidate (linux-image-generic-6.17)
for a new enough kernel, move TDX configuration to use QEMU from
kata-deploy.
While this is the new default, going back to the original
setup is possible by making manual changes to TDX runtime handlers.
Note: runtime-rs is already using QEMUPATH for TDX.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
This Allows the updateStrategy to be configured for the kata-deploy helm
chart, this is enabling administrators to control the aggressiveness of
updates. For a less aggressive approach, the strategy can be set to
`OnDelete`. Alternatively, the update process can be made more
aggressive by adjusting the `maxUnavailable` parameter.
Signed-off-by: Nikolaj Lindberg Lerche <nlle@ambu.com>
Avoid redundant and confusing teardown_common() debug output for
k8s-policy-pod.bats and k8s-policy-pvc.bats.
The Policy tests skip the Message field when printing information about
their pods, because unfortunately that field might contain a truncated
Policy log - for the test cases that intentiocally cause Policy
failures. The non-truncated Policy log is already available from other
"kubectl describe" fields.
So, avoid the redundant pod information from teardown_common(), that
also included the confusing Message field.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Delete the pause_bundle directory before running the umoci unpack
operation. This will make builds idempotent and not fail with
errors like "create runtime bundle: config.json already exists in
.../build/pause-image/destdir/pause_bundle". This will make life
better when building locally.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Update Go from 1.24.11 to 1.24.12 to address security vulnerabilities
in the standard library:
- GO-2026-4342: Excessive CPU consumption in archive/zip
- GO-2026-4341: Memory exhaustion in net/url query parsing
- GO-2026-4340: TLS handshake encryption level issue in crypto/tls
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
1. Add disable_block_device_use to CLH settings file, for parity with
the already existing QEMU settings.
2. Set DEFDISABLEBLOCK := true by default for both QEMU and CLH. After
this change, Kata Guests will use by default virtio-fs to access
container rootfs directories from their Hosts. Hosts that were
designed to use Host block devices attached to the Guests can
re-enable these rootfs block devices by changing the value of
disable_block_device_use back to false in their settings files.
3. Add test using container image without any rootfs layers. Depending
on the container runtime and image snapshotter being used, the empty
container rootfs image might get stored on a host block device that
cannot be safely hotplugged to a guest VM, because the host is using
the same block device.
4. Add block device hotplug safety warning into the Kata Shim
configuration files.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Cameron McDermott <cameron@northflank.com>
Remove the initrd function and add the image function to align
with the actually existing functions in this file.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Confidential guests cannot use traditional IOMMU Group based VFIO.
Instead, they need to use IMMUFD. This is mainly because the group
abstraction is incompatible with a confidential device model.
If traditional VFIO is specified for a confidential guest, detect
the error and bail out early.
Fixes#12393
Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>
in CI we are testing the latest kata-deploy, which requires the latest
helm chart. The previous query doesn't work anymore, but these days we
should be able to rely on the "0.0.0-dev" tag and on helm to print the
to-be-installed version into console.
Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
I keep struggling finding the debug images, let's include them in the
peer-pods-azure.sh script so people can find them easier.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
This comment was first introduced in e111093 with secure_join()
but then we forgot to remove it when we switched to the safe-path
lib in c0ceaf6
Signed-off-by: Qingyuan Hou <lenohou@gmail.com>
We want to enable local and remote CUDA repository builds.
Moving the cuda and tools repo to versions.yaml with a
unified build for both types.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Fix empty string handling in format conversion
When HELM_ALLOWED_HYPERVISOR_ANNOTATIONS, HELM_AGENT_HTTPS_PROXY, or
HELM_AGENT_NO_PROXY are empty, the pattern matching condition
`!= *:*` or `!= *=*` evaluates to true, causing the conversion loop
to create invalid entries like "qemu-tdx: qemu-snp:".
Add -n checks to ensure conversion only runs when variables are
non-empty.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Update the CI and functional test helpers to use the new
shims.disableAll option instead of iterating over every shim
to disable them individually.
Also adds helm repo for node-feature-discovery before building
dependencies to fix CI failures on some distributions.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Update the Helm chart README to document the new shims.disableAll
option and simplify the examples that previously required listing
every shim to disable.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Simplify the example values files by using the new shims.disableAll
option instead of listing every shim to disable.
Before (try-kata-nvidia-gpu.values.yaml):
shims:
clh:
enabled: false
cloud-hypervisor:
enabled: false
# ... 15 more lines ...
After:
shims:
disableAll: true
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add a new `shims.disableAll` option that disables all standard shims
at once. This is useful when:
- Enabling only specific shims without listing every other shim
- Using custom runtimes only mode (no standard Kata shims)
Usage:
shims:
disableAll: true
qemu:
enabled: true # Only qemu is enabled
All helper templates are updated to check for this flag before
iterating over shims.
One thing that's super important to note here is that helm recursively
merges user values with chart defaults, making a simple
`disableAll` flag problematic: if defaults have `enabled: true`, user's
`disableAll: true` gets merged with those defaults, resulting in all
shims still being enabled.
The workaround found is to use null (`~`) as the default for `enabled`
field. The template logic interprets null differently based on
disableAll:
| enabled value | disableAll: false | disableAll: true |
|---------------|-------------------|------------------|
| ~ (null) | Enabled | Disabled |
| true | Enabled | Enabled |
| false | Disabled | Disabled |
This is backward compatible:
- Default behavior unchanged: all shims enabled when disableAll: false
- Users can set `disableAll: true` to disable all, then explicitly
enable specific shims with `enabled: true`
- Explicit `enabled: false` always disables, regardless of disableAll
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add Bats tests to verify the custom runtimes Helm template rendering,
and that the we can start a pod with the custom runtime.
Tests were written with Cursor's help.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add functions to install and remove custom runtime configuration files.
Each custom runtime gets an isolated directory structure:
custom-runtimes/{handler}/
configuration-{baseConfig}.toml # Copied from base config
config.d/
50-overrides.toml # User's drop-in overrides
The base config is copied AFTER kata-deploy has applied its modifications
(debug settings, proxy configuration, annotations), so custom runtimes
inherit these settings.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add functions to configure custom runtimes in containerd and CRI-O.
Custom runtimes use an isolated config directory under:
custom-runtimes/{handler}/
Custom runtimes automatically derive the shim binary path from the
baseConfig field using the existing is_rust_shim() logic.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add support for parsing custom runtime configurations from a mounted
ConfigMap. This allows users to define their own RuntimeClasses with
custom Kata configurations.
The ConfigMap format uses a custom-runtimes.list file with entries:
handler:baseConfig:containerd_snapshotter:crio_pulltype
Drop-in files are read from dropin-{handler}.toml, if present.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Let's extract the common logic from configure_containerd_runtime and
configure_crio_runtime into reusable helper functions. This reduces
code duplication and prepares for adding custom runtime support.
For containerd:
- Add ContainerdRuntimeParams struct to encapsulate common parameters
- Add get_containerd_pluginid() to extract version detection logic
- Add get_containerd_output_path() to extract file path resolution
- Add write_containerd_runtime_config() to write common TOML values
For CRI-O:
- Add CrioRuntimeParams struct to encapsulate common parameters
- Add write_crio_runtime_config() to write common configuration
While here, let's also simplify pod_annotations to always use
"[\"io.katacontainers.*\"]" for all runtimes, as the NVIDIA specific
case has been removed from the shell script, but we forgot to do so
here.
No functional changes intended.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add -info flag handling to containerd-shim-kata-v2 (Rust version).
This outputs RuntimeInfo protobuf (name, version, revision) to stdout,
providing compatibility with containerd v2.0+ which queries runtime
information via this flag.
This is the runtime-rs counterpart to the Go implementation.
Fixes#12133
Signed-off-by: tak-ka3 <takumi.hiraoka@acompany-ac.com>
It aims to make QMP initialize robust by retrying QMP handshake with
global deadline to handle slow QEMU bring-up.
Qmp::new() used DEFAULT_QMP_READ_TIMEOUT as the effective deadline
for the QMP handshake read. When QEMU initialization is slow (e.g.
heavy host load, large memory/device init, slow storage, confidential
guests, etc.), the QMP greeting may not become readable within a small
per-read timeout (e.g. 250ms). This caused QMP init to fail with
"Resource temporarily unavailable (os error 11)" and spam
"couldn't initialise QMP", while subsequent retries might eventually
succeed once QEMU became ready.
To address this issue, keep a short per-read timeout to avoid
indefinite blocking, but add a global "wait for QMP ready" deadline
that retries the handshake with a small backoff. This improves startup
reliability under load and avoids unnecessary reconnect failures.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
HashMap cannot guarantee the order. The command line is always changed.
This commit change kv of get_agent_kernel_params to BTreeMap to make
sure the command line is not changed.
Fixes: #10977
Signed-off-by: Hui Zhu <teawater@antgroup.com>
It aims to address the issue:
"run_io_copy[Stdout]: failed to copy stream: Not a socket (os error 88)"
The `Not a socket (os error 88)` error was caused by incorrectly wrapping
a FIFO file descriptor in a `UnixStream`. The following changes:
(1) Refactor `open_fifo_write` to return `tokio::fs::File` (or a generic
async reader/writer) instead of `AsyncUnixStream`.
(2) Ensure IO copying logic treats stdout/stderr streams as file-like
objects rather than sockets.
This fix eliminates the "failed to copy stream" errors in the IO loop
and ensures reliable log forwarding for legacy-io.
Fixes: #12387
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Move the private closure out and make it a public method which is
responsible for clear O_NONBLOCK for an fd and turn it into blocking
mode.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This reverts commit c0d7222194.
Soon, guest components will switch to using a DB instead of
storing resources in the filesystem. Further, I don't see any
more indicators why kbs-client would struggle to set simple
resources.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Add the necessary configuration and code changes to support QEMU
on arm64 architecture in runtime-rs.
Changes:
- Set MACHINETYPE to "virt" for arm64
- Add machine accelerators "usb=off,gic-version=host" required for
proper arm64 virtualization
- Add arm64-specific kernel parameter "iommu.passthrough=0"
- Guard vIOMMU (Intel IOMMU) to skip on arm64 since it's not supported
These changes align runtime-rs with the Go runtime's arm64 QEMU support.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Kevin Zhao <kevin.zhao@linaro.org>
Add support for the -info flag that containerd v2.0+ passes to shims.
The flag outputs RuntimeInfo protobuf to stdout containing the shim
name and version information.
Fixes#12133
Signed-off-by: tak-ka3 <takumi.hiraoka@acompany-ac.com>
The enable_debug parameter was explicitly set to false rather than
being commented out (e.g., # enable_debug = true). As the previous
enabling method failed to account for this explicit setting, it was
rendered invalid. This commit updates the matching logic to correctly
handle and toggle the explicit false value.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
It was observed that some kata-deploy cleanup steps could hang,
causing the workflow to never finish properly. In these cases,
a QEMU process was not cleaned up and kept printing debug logs
to the journal. Over time, this maxed out the runner’s disk
usage and caused the runner service to stop.
Set timeouts for the relevant cleanup steps to avoid this.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The verification job mounts a ConfigMap containing the pod spec for
the Kata runtime test. Previously, both the ConfigMap and the Job were
Helm hooks with different weights (-5 and 0 respectively).
On k3s, a race condition was observed where the Job pod would be
scheduled before the kubelet's informer cache had registered the
ConfigMap, causing a FailedMount error:
MountVolume.SetUp failed for volume "pod-spec": object
"kube-system"/"kata-deploy-verification-spec" not registered
This happened because k3s's lightweight architecture schedules pods
very quickly, and the hook weight difference only controls Helm's
ordering, not actual timing between resource creation and cache sync.
By making the ConfigMap a regular chart resource (removing hook
annotations), it is created during the main chart installation phase,
well before any post-install hooks run. This guarantees the ConfigMap
is fully propagated to all kubelets before the verification Job starts.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The verification job needs to list nodes to check for the
katacontainers.io/kata-runtime label and list events to detect
FailedCreatePodSandBox errors during pod creation.
This was discovered when testing with k0s, where the service account
lacked the required cluster-scope permissions to list nodes.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Remove k0s-worker and k0s-controller from
RUNTIMES_WITHOUT_CONTAINERD_DROP_IN_SUPPORT and always return true for
k0s in is_containerd_capable_of_using_drop_in_files since k0s auto-loads
from containerd.d/ directory regardless of containerd version.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add microk8s case to get_containerd_paths() method and remove microk8s
from RUNTIMES_WITHOUT_CONTAINERD_DROP_IN_SUPPORT to enable dynamic
containerd version checking.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Introduce ContainerdPaths struct and get_containerd_paths() method to
centralize the complex logic for determining containerd configuration
file paths across different Kubernetes distributions.
The new ContainerdPaths struct includes:
- config_file: File to read containerd version from and write to
- backup_file: Backup file path before modification
- imports_file: File to add/remove drop-in imports from (Option<String>)
- drop_in_file: Path to the drop-in configuration file
- use_drop_in: Whether drop-in files can be used
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The JSONPath parser was incorrectly splitting on escaped dots (\.)
causing microk8s detection to fail. Labels like "microk8s.io/cluster"
were being split into ["microk8s\", "io/cluster"] instead of being
treated as a single key.
This adds a split_jsonpath() helper that properly handles escaped dots,
allowing the automatic microk8s detection via the node label to work
correctly.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The kata-deploy test was using helm_helper which made it hard to debug
failures (die() calls would cause "Executed 0 tests" errors) and added
unnecessary complexity.
The test now calls helm directly like a user would, making it simpler
and more representative of real-world usage. The verification job status
is explicitly checked with proper failure detection instead of relying
on helm --wait.
Timeouts are configurable via environment variables to account for
different network speeds and image sizes:
- KATA_DEPLOY_TIMEOUT (default: 600s)
- KATA_DEPLOY_DAEMONSET_TIMEOUT (default: 300s)
- KATA_DEPLOY_VERIFICATION_TIMEOUT (default: 120s)
Documentation has been added to explain what each timeout controls and
how to customize them.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The verification job now supports configurable timeouts to accommodate
different environments and network conditions. The daemonset timeout
defaults to 1200 seconds (20 minutes) to allow for large image downloads,
while the verification pod timeout defaults to 180 seconds.
The job now waits for the DaemonSet to exist, pods to be scheduled,
rollout to complete, and nodes to be labeled before creating the
verification pod. A 15-second delay is added after node labeling to
allow kubelet time to refresh runtime information.
Retry logic with 3 attempts and a 10-second delay handles transient
FailedCreatePodSandBox errors that can occur during runtime
initialization. The job only fails on pod errors after a 30-second
grace period to avoid false positives from timing issues.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The retry loop in helm_helper had two bugs:
1. Counter initialized to 10 instead of 0, causing immediate failure
2. Exit condition used -eq instead of -ge, incorrect for loop logic
These bugs would cause helm_helper to fail immediately on the first
retry attempt instead of properly retrying up to max_tries times.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When looking into stale bot more for issues, I realised that our existing
stale job would need permissions to work. Unfortunately the behaviour
of the actions without these permissions is to log, but still finish as successful.
This means it was hard to spot we had an issue.
Add the required permissions to get this working again and improve the message
Also add concurrency rule to make zizmor happy
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
We've had a couple of occasions that Cargo.lock has been out of sync
with Cargo.toml, so try and extend our rust check to pick this up in the CI.
There is probably a more elegant way than doing `cargo check` and
checking for changes, but I'll start with this approach
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Downstream builders at Red Hat complain that `Cargo.lock` doesn't match
`Cargo.toml`.
Run `cargo check` to refresh `Cargo.lock`.
`git bisect` shows that 7cfb97d41b is the first commit where
`cargo check` has an effect in `src/agent`.
Signed-off-by: Greg Kurz <groug@kaod.org>
Add run_bats_tests() function to common.bash that provides consistent
test execution and reporting across all test suites (k8s, nvidia,
kata-deploy).
This removes duplicated test runner code from run_kubernetes_tests.sh,
run_kubernetes_nv_tests.sh, and run-kata-deploy-tests.sh.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The NVIDIA GPU test runner script was not generating test reports,
causing the report_tests() function in gha-run.sh to have nothing
to display. This aligns the script with run_kubernetes_tests.sh by:
- Adding set -o pipefail for proper pipeline error handling
- Creating a reports directory with timestamped subdirectory
- Capturing test output to files with ok-/not_ok- prefixes
- Adding --timing flag to bats for timing information
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Let's just point to the official documentation rather than explaining
exactly how to deploy (and the current text was very outdated).
Removing fluentd / minikube examples is out of context of this commit.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The runk tool hasn't been supported for a few years, with no maintainers
since ManaSugi stopped being involved in the project and the CI was
disabled in 2024.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This reverts commit 6130d7330f, as we're
officially swithcing to the rust version of kata-deploy.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
a2534e7bc8 introduced the logic to also
release a kata-tools tarball, but it missed allowing
KATA_TOOLS_STATIC_TARBALL env var to be passed to the release script,
leading to the following error during the release process:
```
ERROR: Invalid environment variable "KATA_TOOLS_STATIC_TARBALL"
```
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
In startVM(), for VMMs without hotplug support (e.g., Firecracker or
QEMU microvm), the runtime runs prestart hooks but misses rescanning
the network namespace. This causes VMs to boot with uninitialized
network configs, as updates from CNI plugins are not captured.
This patch adds a network rescan via AddEndpoints after prestart hooks
for the non-hotplug path, ensuring correct network info is passed to
the VMM configuration before the VM starts.
Fixes#11500
Signed-off-by: XanderC <xanderc@qq.com>
The virtio-9p is not supported for a long time, specially within
the runtime-rs, we have no such plan to support it. Removal of the
related items is reasonable.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
As Memory Agent feature is not used within CoCo(TDX/SNP) scenarios,
with this fact, it's better to just remove the related sections.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
It aims to introduce some related items within Makefile to enable
Intel SNP settings in configuration when do make build. And make it
possible to generate the rendered qemu-snp-runtime-rs configuration
based on the *.in template.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
To make it work well on the SEV-SNP platforms for qemu-runtime-rs with
coco, a dedicated SEV-SNP configuration should be introduced to help
prepare related CVM resources.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Enable measured rootfs within configuration when make build. And add
some other important items to make the configuration work well.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
It aims to introduce some related items within Makefile to enable
Intel TDX settings in configuration when do make build. And make it
possible to generate the rendered qemu-tdx-runtime-rs configuration
based on the *.in template.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
To make it work well on the TDX platforms for qemu-runtime-rs with
coco, a dedicated TDX configuration should be introduced to help
prepare related CVM resources.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Systemd-managed cgroups use the slice:prefix:name format, which is
not a filesystem path. Calling MoveTo() on such paths fails with
"invalid group path" and can abort cleanup before Delete() runs.
In some cases, this causes pod teardown delays.
Skip MoveTo for systemd-formatted sandbox/overhead cgroup paths when
sandbox_cgroup_only is true; systemd moves tasks on unit deletion.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
With cold-plug becoming by design the only supported mode with the
update of NVRC to v0.1.1, resolving references to hot-plug.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Enable post-install verification in kata-deploy CI tests. When
HELM_VERIFY_DEPLOYMENT is set, a simple verification pod is created
that runs with the Kata runtime to confirm deployment succeeded.
The verification pod prints kernel info and exits - success indicates
the Kata runtime is properly configured and functional.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add optional verification that runs after kata-deploy installation.
When a pod spec is provided via --set-file verification.pod=<file>,
a verification job runs after install/upgrade to validate deployment.
The user is fully responsible for the verification pod content:
- Pod name, runtimeClassName, annotations, and verification logic
- Pod must exit 0 on success, non-zero on failure
The verification job simply:
1. Waits for kata-deploy DaemonSet to be ready
2. Applies the user-provided pod spec
3. Waits for the pod to complete
4. Shows logs and cleans up
Usage:
helm install kata-deploy ... \
--set-file verification.pod=/path/to/your-pod.yaml
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
To unlock the release, move the job to publish kata payload after push to an alternate runner(IBM owned) for ppc64le.
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
The new NVRC version works for CC and non-CC use cases,
no --feature confidential needed anymore.
Bump versions.yaml and adjust deployment instructions.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Disable NVDIMM. When using GPU passthrough, using NVDIMM would create
a r/o file-backed memory region. When using a GPU, QEMU tries to DMA-
map guest memory for the device, resulting in a mapping error:
memory listener initialization failed: Region mem0:
vfio_container_dma_map ... -22 (Invalid argument).
For the CC configs, NVDIMM is disabled by default in qemu_amd64.go
with a warning, but we also explicitly disable the setting in the
shim configuration file.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
We don't need to store the kernel headers anymore. We do need to store
the kernel modules, instead.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
We've done some bad file based driver determination,
now with versions.yaml there is a single source of truth.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
We need to package the build modules for the rootfs
to be able to consume it. We package the whole
/lib/modules/$(uname -r) directory strip=2.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
We want to have deterministic behaviour and only
one valid driver version acceptable via versions.yaml
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
We actually never installed yq to the kernel build,
there are some path that use yq but were never hit,
for the GPU use-case we need to read values from versions.yaml
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
In preparation for coco v0.18.0, bump the version of image-rs we use in
agent-ctl to match what we have in versions.yaml.
Drop the snapshotter-overlayfs feature. This was dropped from image-rs
when we removed enclave-cc support.
Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
Before cutting the Kata release that will be used with CoCo v0.18.0,
let's bump the versions of Trustee and guest-components to latest.
Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
This is needed as the 580 driver doesn't build against 6.18.x, and the
590 driver is not yet fully working for our case, thus we stick to the
previous version that worked before.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Bump both the kernel and kernel-confidential versions from v6.12.x and
v6.16.x to v6.18.4, aligning with the new LTS release.
Kernel 6.18 introduced several configuration changes that required
updates to our kernel config fragments:
* CRYPTO_FIPS dependencies changed:
- In 6.12: depended on !CRYPTO_MANAGER_DISABLE_TESTS
- In 6.18: now depends on CRYPTO_SELFTESTS (which requires EXPERT)
Added CONFIG_EXPERT=y and CONFIG_CRYPTO_SELFTESTS=y to crypto.conf
to satisfy the new dependency chain.
* CONFIG_EXPERT is a naughty one, as it disables / enables a bunch
of things behind ones back, probably just to prove a point that
it is for experts ;-) ... regardless, a reasonable amount of
options had to be re-added in order to make sure anything ends
up broken.
* Legacy iptables support:
Kernel 6.18 requires explicit legacy xtables/iptables configs for
IP_NF_* options. Added CONFIG_NETFILTER_XTABLES_LEGACY,
CONFIG_IP_NF_IPTABLES_LEGACY, and CONFIG_IP6_NF_IPTABLES_LEGACY
to netfilter.conf.
* Module signing dependencies:
Added CONFIG_MODULES=y and other required dependencies to
module_signing.conf to ensure MODULE_SIG can be properly enabled.
* Whitelist updates:
- Added CONFIG_NF_CT_PROTO_DCCP (removed in 6.18+)
- Added CONFIG_CRYPTO_SELFTESTS, CONFIG_NETFILTER_XTABLES_LEGACY,
CONFIG_IP_NF_IPTABLES_LEGACY, CONFIG_IP6_NF_IPTABLES_LEGACY
(added in 6.18+, not present in older kernels like 6.12)
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
A few minor changes to the Zensical config that makes navigation easier. Also
fixed a couple of bugs with local serving and added some quality of life
features to Zensical.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
This commit adds a Github workflow for building a Github Pages site for the markdown
files in the docs/ directory. Zensical is a new markdown-based static site generation
framework built by the creators of Material for Mkdocs. https://zensical.org/
This commit does not clean the doc structure, so site navigation is initially going to
be messy.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
Remove the agent hotplug timeout parameter from the kernel
command line. Having shifted to VFIO cold-plug, this parameter is
no longer needed.
Remove the no longer required parameter for TDX and thus align the
SNP and TDX configurations.
Add a parameter to avoid the kernel to mount the /dev tmpfs. NVRC
and later on kata-agent attempt this. While kata-agent does not
panic when mounting /dev fails, NVRC makes mounting /dev a hard
requirement.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
set_container_command() previously appended command arguments
one-by-one with
'.command += [...]'. This makes the helper non-idempotent and can
lead to unexpected command arrays when invoked multiple times.
Update the helper to set the full command array in a single yq v4
expression and print the target YAML path plus the command being
applied to simplify debugging when tests fail.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The pod config file created by new_pod_config() was generated via
mktemp using the template "pod-config.yaml.in.XXX", which produces
filenames that do not end with ".yaml" (e.g. pod-config.yaml.in.ABC).
If the random combination of special suffix with ".Csv" or ".Xml", etc.
the following operations with yq will fail.
Some helpers and tooling assume the config path ends with ".yaml".
Switch the mktemp template to place the random suffix before the
extension so the returned path always ends with ".yaml".
Fixes: #12268, #12319
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This is a suggestion from Choi, so we can easily test with a specific
kubectl version and also easily understand which kubectl version is
being used in case of failure.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This image will be used by our helm charts to verify that a
kata-containers deployment is correct.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Enhance the wait_for_migration implementation to reliably wait for
QEMU migration completion and avoid the previous `sleep(280ms)`
delay.
(1) Add an initial fast-path query to return immediately if
migration is already completed/failed/cancelled.
(2) Use a hard deadline to enforce timeouts deterministically.
(3) Implement adaptive polling with backoff and a maximum interval
to reduce QMP load while keeping responsiveness.
(4) Unify migration status handling and return clear errors on
failed/cancelled states.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Return information about current migration process. And the input
and output as below:
{ 'command': 'query-migrate', 'returns': 'MigrationInfo' }
But note that the Qemu API is valid within qapi-rs(v0.15+)
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The detailed information about the updated versions as below:
```
qapi = { version = "0.15", features = ["qmp", "async-tokio-all"] }
qapi-spec = "0.3.2"
qapi-qmp = "0.15.0"
```
and it will correct some corresonding structures.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Change the secure_storage_integrity option's default value to true.
With this, integrity protection for encrypted block device contents
will be requested from the confidential data hub by default, see the
agent's cdh_handler_trusted_storage function in rpc.rs.
This behavior can be disabled by explicitly setting the
agent.secure_storage_integrity parameter to 0 or false via kernel
command line parameters.
This will affect the trusted storage implementation for the guest-pull
mechanism, and it will affect future implementations using this code
path, such as implementations for ephemeral secure storage.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
In some builds we are seeing:
```
error: could not create temp file /opt/rustup/tmp/r2xu46kwuyc7k2kr_file: Permission denied (os error 13)
```
in the agent-ctl build, so try and port a fix from #12313 to the tools build
to try and resolve this.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Fixes deploying kata-containers using k3s. The deploy script fails with /opt/kata-artifacts/scripts/kata-deploy.sh: line 397: [: too many arguments
Signed-off-by: Federico A. Corazza <git@facorazza.com>
yamllint complains that there is only one space before the comment,
so add a second to prevent this annoying message showing up.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Create a new page for a reference implementation for Kubernetes
using QEMU, the go shim and an NVIDIA rootfs. The new page
contains information on:
- components involved in the NVIDIA (TEE) GPU scenario
- orchestration flow for GPU passthrough scenarios
- deployment guidance
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
- Apply a few structural/grouping changes and improve flow
- Group build sections together
- Move usage examples to last section
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
The following error was observed during virtiofsd static build:
```
error: could not create temp file /opt/rustup/tmp/p44enysfaxwdbvw4_file:
Permission denied (os error 13)
```
This occurs because RUSTUP_HOME and CARGO_HOME were initialized by the
root user during `docker build`, but `cargo build` is executed as a
non-root user via 'docker run --user'.
Ensure these directories are writable by adjusting the permission after
the toolchain installation is complete.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
OVMF build for Intel TDX (aka "TDVF") was disabled in favor of Ubuntu/
CentOS pre-upstream releases of Intel TDX.
See 4292c4c3b1.
It's time to re-enable the build and move runtime configurations to
use it (the latter will be done in a later commit).
This is a partial revert of 4292c4c3b with the following changes:
- Stop calling OVMF for Intel TDX "TDVF" and follow the naming distros
use for TDX enabled build: OVMF.inteltdx.fd.
- Single binary OVMF.inteltdx.fd is supported using -bios QEMU param.
- Secure Boot infrastructure is disabled since Kata does not support it.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Actually this method is indeed called, just add attribute of
`#[allow(dead_code)]` to allow UT pass. And the warning looks like:
warning: method `send_message_with_payload` is never used
|
224 | impl<R: Req> Endpoint<R> {
| ------------------------ method in this implementation
...
522 | pub fn send_message_with_payload<T: Sized, P: Sized>(
| ^^^^^^^^^^^^^^^^^^^^^^^^^
|
= note: `#[warn(dead_code)]` on by default
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
warning: unused `std::result::Result` that must be used
-->
src/dragonball/dbs_virtio_devices/src/vhost/vhost_user/net.rs:679:9
|
679 | / VirtioDevice::<Arc<GuestMemoryMmap<()>>, QueueSync,
GuestRegionMmap>::write_config(
680 | | &mut dev, 0, &config,
681 | | );
| |_________^
|
= note: this `Result` may be an `Err` variant, which should be
handled
= note: `#[warn(unused_must_use)]` on by default
help: use `let _ = ...` to ignore the resulting value
|
679 | let _ = VirtioDevice::<Arc<GuestMemoryMmap<()>>,
QueueSync, GuestRegionMmap>::write_config(
| +++++++
warning: unused `std::result::Result` that must be used
-->
src/dragonball/dbs_virtio_devices/src/vhost/vhost_user/net.rs:683:9
|
683 | / VirtioDevice::<Arc<GuestMemoryMmap<()>>, QueueSync,
GuestRegionMmap>::read_config(
684 | | &mut dev, 0, &mut data,
685 | | );
| |_________^
|
= note: this `Result` may be an `Err` variant, which should be
handled
help: use `let _ = ...` to ignore the resulting value
|
683 | let _ = VirtioDevice::<Arc<GuestMemoryMmap<()>>,
QueueSync, GuestRegionMmap>::read_config(
| +++++++
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
the WARNING looks like as:
...
warning: variable does not need to be mutable
--> src/dragonball/dbs_virtio_devices/src/vsock/csm/txbuf.rs:217:13
|
217 | let mut tmp: Vec<u8> = vec![0; TxBuf::SIZE - 2];
| ----^^^
| |
| help: remove this `mut`
|
= note: `#[warn(unused_mut)]` on by default
...
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Till k8s 1.34 we could grep by "Started containerd". From k8s 1.35
onwards the event message changed and we should, instead, grep by
"Container started".
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
QEMU v10.2.0 was released on December 24th, 2025.
The experimental GPU SNP / TDX are also pointing to v10.2.0 release with
their gpu-{snp,tdx}-20260107 branch.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
sha2 0.9.3 includes the use of cpuid-bool, which was renamed to cpufeatures
around 5 years ago. Try moving to a workspace dependency of sha2
and bumping to the latest version to remediate RUSTSEC-2021-0064
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
While the use-case of Intel QuickAssist (QAT) accelerated crypto
and/or compression with k8s and Kata Containers is still valid,
the setup instructions are outdated:
Starting with Intel Xeon Gen4 (Sapphire Rapids), QAT driver
stack moved to in-tree drivers without a separete SR-IOV VF
driver.
Drop all the setup instructions but keep the use-cases doc
for reference. Users wanting to enable the use-case, should consult
with Intel QAT Device plugins or Intel QAT DRA driver authors.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
The nontee job (run-k8s-tests-coco-nontee) for qemu-coco-dev-runtime-rs
is running well and it's time to make it required when the CI runs.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Using the built in size_of_val is easier to read and less error-prone
than doing this calculation manually
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
#[cfg(feature = "cargo-clippy")] has been deprecated for years,
so should be replaced with `#[cfg(clippy)]`
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
There are many, many null pointer dereferences in the bindgen code
when moving between rust 1.85.1 and 1.86 and no docs of the source
that it was generated from, so try and skip
these test from running until an SME can look at them @lifupan
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Clippy is recommending that format args are inlined for
better clarity, so update our code to remove these warnings
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Clippy is recommending that format args are inlined for
better clarity, so update our code to remove these warnings
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Clippy is recommending that format args are inlined for
better clarity, so update our code to remove these warnings
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
runtime-rs crates are pulled into kata-ctl and some of these have
bumped recently, so update these in kata-ctl as well
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Clippy is recommending that format args are inlined for
better clarity, so update our code to remove these warnings
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Clippy is recommending that format args are inlined for
better clarity, so update our code to remove these warnings
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Clippy is recommending that format args are inlined for
better clarity, so update our code to remove these warnings
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Clippy is recommending that format args are inlined for
better clarity, so update our code to remove these warnings
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Clippy is recommending that format args are inlined for
better clarity, so ensure our docs include this
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Clippy is recommending that format args are inlined for
better clarity, so update our code to remove these warnings
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
In #12151 the version was bumped in cargo.toml, but the update not
done, so run `cargo update -p container-device-interface` to apply it
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Since #12204 was merged, the following error has been observed:
```
bats warning: Executed 1 instead of expected 2 tests
[run_kubernetes_tests.sh:162] ERROR: Tests FAILED from suites: k8s-empty-dirs.bats
```
The cause is that `pod_logs_file` is re-declared as a local variable
in the second test before skipping, which makes it inaccessible
in `teardown()` and leads to an error.
This commit removes the re-declaration of the variable.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The Rust kata-deploy binary calls list_runtimeclasses() during NFD
setup, but the ClusterRole only granted get and patch permissions.
Add the list verb to the runtimeclasses resource permissions to fix
the RBAC error:
runtimeclasses.node.k8s.io is forbidden: User
\"system:serviceaccount:kube-system:kata-deploy-sa\" cannot list
resource \"runtimeclasses\" in API group \"node.k8s.io\" at the
cluster scope
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
KVM is not available in our ARM runners, let's skip those tests
accordingly, while making the rest test cases remain tested on machines
with KVM present and access to KVM device.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
There are test cases require interaction with KVM device, introduce
skip_if_kvm_unaccessable macro to skip them.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
Changes in NIM/RAG samples:
- update image references
- update memory requirements, timeouts, model name
- sanitize some of the probes and print-out
Further refinements can be made in the future.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
cargo test was trying to evaluate the documentation comment and failing,
so try and make the comment explicitly text to avoid this
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
A few structs in genpolicy are never constructed, so add
`#[allow(dead_code)]` to prevent this clipped warning
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
In unicode you can have multi-byte characters, so it's better to
user char_indices than enumerate the bytes
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
VirtioBlkCcwDeviceHandler and VirtioBlkCcwHandler
are only constructed on s390x, so add #[cfg(target_arch = "s390x")]
to all the code
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
We can use the new Error::other options rather than
Error:new(Error:Kind:Other and drop our own macro that did this mapping
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Fix the warning throw up:
```
warning: hiding a lifetime that's elided elsewhere is confusing
--> /root/go/src/github.com/kata-containers/kata-containers/src/libs/kata-types/src/utils/u32_set.rs:50:17
|
50 | pub fn iter(&self) -> Iter<u32> {
| ^^^^^ --------- the same lifetime is hidden here
| |
| the lifetime is elided here
|
= help: the same lifetime is referred to in inconsistent ways, making the signature confusing
= note: `#[warn(mismatched_lifetime_syntaxes)]` on by default
help: use `'_` for type paths
|
50 | pub fn iter(&self) -> Iter<'_, u32> {
| +++
```
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Update virtiofsd to its latest release.
Here we also need to update the alpine version used by the builder as we
need a version of musl-dev new enough to have wrappers for pread2 and
pwrite2. As bumping, bump to the latest.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add two attestation tests. The first one sets a resource policy that
requires CPU0 to have an affirming trust level. This is a negative test
which can run on any platform. Setting this policy without setting any
reference values should result in an attestation failure.
Next, a second test will set the same policy, but this time it will use
the journal log to find the QEMU command line from the previous test and
calculate the expected reference values. Currently this is only
supported on SNP using the sev-snp-measure tool, but the same flow
should work on other platforms.
Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
The five tests are set to the same vhost socket path, which could lead
to racing with one another. Use unique name to avoid this.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
Let's bump experimental {tdx,snp} QEMU to the tags created Today in the
Confidential Containers repo, which match with QEMU 10.2.0-rc3.
This bump is mostly for early testing what will become 10.2.0, which
will be bumped everywhere then.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
It will address the issue:
"# bats warning: Executed 0 instead of expected 1 tests"
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
As each case need such preparation of get_pod_config_dir,
a better method is directly move it into the setup_common method.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
To measure the duration for journal, we need clearly print the journal
start time and end time for each case which helps to ensure the journal
log is for the specified period for the case.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
For failure cases within CI, we need dump the kata log to help
address issues, but currently large log messages cause partial
log we can see.
We remove initdata log output and increase log level to reduce
log output.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Currently policy_settings_dir is created only when
BATS_TEST_NUMBER == "1",
but delete_tmp_policy_settings_dir "${policy_settings_dir}" is
called in teardown() for every test. This means that for tests
after the first one teardown() may attempt to delete a directory
that was already removed by a previous test, or rely on a value
that does not belong to the current test execution.
Adjust teardown logic so that policy_settings_dir is only deleted
for the first test case (BATS_TEST_NUMBER == "1") and ignored for
subsequent tests. This keeps the original optimization of running
genpolicy only once, while avoiding unnecessary or confusing cleanup
attempts in later test cases.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
the previous pod_name is set as local which can not be captured
within the teardown() function, causing failure.
This commit just remove the `local pod_name` to make it a global
variable.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Otherwise we may hit a `no space left on device` when building the rust
kata-deploy binary.
This happens mostly because of the muli-staging build used to generate a
distroless final container.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
There we ensure labels are added to better deal with ownership of the
runtimeclasses. It's not strictly needed here as helm does take care of
the ownership, but also doesn't hurt to follow what seems to be a common
practice.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Let's shamelessly duplicate the nightly job to have at least nightly
runs using the rust implementation of kata-deploy.
The reason for doing that is to be pragmatic, as pragmatic as possible,
and avoid switching away of the scripts before 3.24.0 release, while
still testing both ways till the switch happens.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Differently than the scripts, which are called as `bash -c ...`, the
kata-deploy rust binary must be invoked directly we do not even have
shell in its container.
For now, the rust version is used in the used image has the "-rust"
suffix, which will help us to have both ways being used / tested for a
little while.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
kata-deploy shell script is not THAT bad and, to be honest, it's quite
handy for quick hacks and quick changes. However, it's been
increasingly becoming harder to maintain as it's grown its scope from a
testing tool to the proper project's front door, lacking unit tests, and
with an abundacy of complex regular expressions and bashisms to be able
to properly parse the environment variables it consumes.
Morever, the fact it is a Frankstein's monster glued together using
python packages, golang binaries, and a distro dependent container makes
the situation VERY HARD to use it from a distroless container (thus,
avoiding security issues), preventing further integration with
components that require a higher standard of security than we've been
requiring.
With everything said, with the help of Cursor (mostly on generating the
tests cases), here comes the oxidized version of the script, which runs
from a distroless container image.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The ORAS cache helper needs PUSH_TO_REGISTRY to be set to 'yes' to
push new artifacts to the cache. However, this environment variable
was not being passed to the Docker container during agent, tools, and
busybox builds.
Moreover, for ghcr.io authentication, add support for using GH_TOKEN and
GITHUB_ACTOR as fallbacks when explicit credentials
(ARTEFACT_REGISTRY_USERNAME/PASSWORD) are not provided.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The GPG key used for gperf was incorrectly set to the busybox
maintainer's key (Denis Vlasenko) instead of the gperf maintainer's
key (Marcel Schaible).
Wrong key (busybox): C9E9416F76E610DBD09D040F47B70C55ACC9965B
Denis Vlasenko <vda.linux@googlemail.com>
Correct key (gperf): EDEB87A500CC0A211677FBFD93C08C88471097CD
Marcel Schaible <marcel.schaible@studium.fernuni-hagen.de>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
kata-remote is a runtime class that cloud-api-adaptor relies on to work.
kata-remote by itself does nothing, and that's the reason it's disabled
by default. We're only adding it here so cloud-api-adaptor charts can
simply do something like `--set shims.remote.enabled=true`.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When updating ephemeral storages, MS_REMOUNT is explicitly passed as,
for instance, `/dev/shm` should be remounted after memory is hotplugged.
Till now Kata Containers has been explicitly ignoring such updates,
leading to the containers' `/dev/shm` having the size of "half of the
memory allocated, during the startup time", which goes against the
expected behaviour.
Signed-off-by: Fabiano Fidêncio <fidencio@northflank.com>
We're only releasing those for amd64 as that's the only architecture
we've been building the packages for.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Let's ensure we can create a specific "tools" tarball, which will help
those who only need to pull those either for testing or production
usage.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
After runtime-rs workspace merged into root workspace, features passed
when building runtime-rs needs to be refactored to be correctly
propagated. Taking dragonball for example, runtime-rs requires runtimes
to depend on virt_conttainers feature, and virt_containers needs to
handle hypervisor features specifically.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
After the workspace integration of runtime-rs, now the output of
runtime-rs is under the repo root, instead of src/runtime-rs. Change the
TARGET_PATH accordingly to tell Makefile where to lookup output.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
Some cases in dragonball crates requires interaction with KVM module to
complete, which requires root privilege. Skip those tests under non-root
user.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
MMIODeviceInfo inside the test module of dbs_boot on aarch64 is used for
testing purpose, but `pub` attribute requires it to have documentation.
Since this is used only for testing purpose, let's allow missing_docs
for it.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
Test set of dbs_utils's tap module is missing test attribute, which
makes dev-dependencies unusable. Marking tests of tap as test module.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
This is a follow-up of 3fbe693.
Remove runtime-rs from exclude list, and make it as a member of root
workspace.
Specify shim and shim-ctl as the binary of runtime-rs package, make
runtime-rs and all its members into root workspace.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
Make runtime-rs a package produces shim and shim-ctl as its binary
product, which enables Makefile to work after it's incorporated into
root workspace.
Signed-off-by: Ruoqing He <heruoqing@iscas.ac.cn>
Init the storage options with original rootfs options.
Addition: XFS, append nouuid to the mount options if not exist.
Signed-off-by: shezhang.lau <shezhang.lau@antgroup.com>
To protect against upstream download failures for gperf and busybox,
implement ORAS-based caching to GHCR.
This adds:
- download-with-oras-cache.sh: Core helper for downloading with cache
- populate-oras-tarball-cache.sh: Script to manually populate cache
- warn() function to lib.sh for consistency
Modified build scripts to:
- Try ORAS cache first (from ghcr.io/kata-containers/kata-containers)
- Fall back to upstream download on cache miss
- Automatically push to cache when PUSH_TO_REGISTRY=yes
The cache is automatically populated during CI builds, and parallel
architecture builds check for existing versions before pushing to avoid
race conditions.
Forks benefit from upstream cache but can override with their own:
ARTEFACT_REPOSITORY=myorg/kata make agent-tarball
Generated-By: Cursor IDE with Claude
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The runtime handles the fsGroup field of the pod security context by
adding a mount option to the generated storage object [1]. This commit
changes genpolicy to expect this option.
Instead of passing another side input to
yaml::get_container_mounts_and_storages, we pass the entire PodSpec.
This reduces the necessary changes in the pod-generating resources and
allows for possible future use of other PodSpec fields.
[1]: https://github.com/kata-containers/kata-containers/blob/0c6fcde1/src/runtime/virtcontainers/kata_agent.go#L1620-L1625Fixes: #11934
Signed-off-by: Markus Rudy <mr@edgeless.systems>
I've seen this happening with the GPU SNP CI every now and then, but I
don't really understand how this was not caught by the TDX / SNP CI
themselves before.
In any case, the error seen is:
```
Error from server (Forbidden): error when applying patch:
{"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"nfd.k8s-sigs.io/v1alpha1\",\"kind\":\"NodeFeatureRule\",\"metadata\":{\"annotations\":{},\"name\":\"amd64-tee-keys\"},\"spec\":{\"rules\":[{\"extendedResources\":{\"sev-snp.amd.com/esids\":\"@cpu.security.sev.encrypted_state_ids\"},\"labels\":{\"amd.feature.node.kubernetes.io/snp\":\"true\"},\"matchFeatures\":[{\"feature\":\"cpu.security\",\"matchExpressions\":{\"sev.snp.enabled\":{\"op\":\"Exists\"}}}],\"name\":\"amd.sev-snp\"},{\"extendedResources\":{\"tdx.intel.com/keys\":\"@cpu.security.tdx.total_keys\"},\"labels\":{\"intel.feature.node.kubernetes.io/tdx\":\"true\"},\"matchFeatures\":[{\"feature\":\"cpu.security\",\"matchExpressions\":{\"tdx.enabled\":{\"op\":\"Exists\"}}}],\"name\":\"intel.tdx\"}]}}\n"}}}
to:
Resource: "nfd.k8s-sigs.io/v1alpha1, Resource=nodefeaturerules", GroupVersionKind: "nfd.k8s-sigs.io/v1alpha1, Kind=NodeFeatureRule"
Name: "amd64-tee-keys", Namespace: ""
for: "/opt/kata-artifacts/node-feature-rules/x86_64-tee-keys.yaml": error when patching "/opt/kata-artifacts/node-feature-rules/x86_64-tee-keys.yaml": nodefeaturerules.nfd.k8s-sigs.io "amd64-tee-keys" is forbidden: User "system:serviceaccount:kube-system:kata-deploy-sa" cannot patch resource "nodefeaturerules" in API group "nfd.k8s-sigs.io" at the cluster scope
```
And the fix is as simple as allowing patching and updating a
nodefeaturerule in our service account RBAC.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Since the CI issue for s390x was resolved on Dec 5th,
the nightly test result has gone green for 10 consecutive days.
This commit puts the e2e tests for s390x again into the required job list.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Let's remove the deprecated features that were marked for removal
after Kata Containers 3.23.0:
kata-deploy.sh:
- Remove non-arch-specific variable fallbacks (SHIMS, DEFAULT_SHIM,
SNAPSHOTTER_HANDLER_MAPPING, ALLOWED_HYPERVISOR_ANNOTATIONS,
PULL_TYPE_MAPPING, EXPERIMENTAL_FORCE_GUEST_PULL). Each arch now
has its own default value.
- Remove CREATE_RUNTIMECLASSES and CREATE_DEFAULT_RUNTIMECLASS
variables and associated functions (create_runtimeclasses,
delete_runtimeclasses, adjust_shim_for_nfd). RuntimeClasses are
now managed by Helm chart, not the daemonset script.
- Unsupported architectures now fail with an error instead of
falling back to non-arch-specific defaults.
Helm chart:
- Remove all deprecated env values (createRuntimeClasses,
createDefaultRuntimeClass, debug, shims, shims_*, defaultShim,
defaultShim_*, allowedHypervisorAnnotations, snapshotterHandlerMapping,
snapshotterHandlerMapping_*, agentHttpsProxy, agentNoProxy,
pullTypeMapping, pullTypeMapping_*, _experimentalSetupSnapshotter,
_experimentalForceGuestPull, _experimentalForceGuestPull_*).
- Remove backward compatibility code from _helpers.tpl that checked
for legacy env values.
- Remove legacy env.shims check from runtimeclasses.yaml.
- Remove CREATE_RUNTIMECLASSES and CREATE_DEFAULT_RUNTIMECLASS env
vars from kata-deploy.yaml and post-delete-job.yaml.
- Update RBAC to only include runtimeclasses get/patch permissions
(needed for NFD patching), removing create/delete/list/update/watch.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
- Replace generic errors in sandbox operations with typed SandboxError variants (InvalidContainerId, InitProcessNotFound, InvalidExecId).
- This enables the kata shim to handle specific failure cases differently.
Fixes#12120
Signed-off-by: Adeet Phanse <adeet.phanse@mongodb.com>
Add better error handling to runtime rs to handle when the sandbox itself is killed and recreated.
- Update the kill_process function to skip sending a signal when the process is stopped.
- Always set ProcessStatus::Stopped even when wait_process fails
- In state_process return synthetic state for sandbox container when using Sandbox API
Fixes#12120
Signed-off-by: Adeet Phanse <adeet.phanse@mongodb.com>
Align with other test logic - declare the KATA_HYPERVISOR in the
run bash script, then declare the RUNTIME_CLASS_NAME variable in
the bats files.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Now that we have a more restrictive resource policy for KBS, let
us start adopting it across all NVIDIA test cases. This policy was
previously introduced by the NVIDIA attestation test.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
It aims to upgrade rtnetlink to mitigate netlink log noise.
This commit upgrades the `rtnetlink` dependency (and corresponding
libraries like `netlink-packet-route`) to address excessive and
unnecessary netlink-related logging during sandbox startup.
Problem:
The previously used `rtnetlink v0.16` (depending on `netlink-proto
v0.11.3`) generates a high volume of DEBUG/INFO level netlink messages
during sandbox initialization. This noise:
1. Overloads the logging system, often leading to warnings like
"slog-async: logger dropped messages due to channel overflow."
2. Interferes with effective troubleshooting by distracting developers
from legitimate Kata errors.
Solution:
We upgrade to `rtnetlink v0.19` (and `netlink-proto v0.12`), as testing
confirms that the latest versions have correctly elevated the verbosity
of these netlink internal events to the TRACE level.
This change significantly enhances the log analysis experience by
suppressing unnecessary network-related logs during startup.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
With these changes, we create pod security policies when running
against NVIDIA TEE GPU handlers where AUTO_GENERATE_POLICY is set.
For the non-TEE GPU tests, the added functions bail out by design.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Following existing patterns, we adapt the common policy settings
for NVIDIA GPU CI platforms. For instance, for our CI runners, we
use containerd 2.x.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Enable auto-generate policy for qemu-nvidia-gpu-* if the user
didn't specify an AUTO_GENERATE_POLICY value.
Setting this in run_kubernetes_nv_tests.sh is too late as
gha-run.sh calls into run_tests, setup.sh, and then into
create_common_genpolicy_settings() where the rules.rego and
genpolicy-settings file are being copied to the right locations.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Add one valid test case with 2 GPUs with proper VFIO device
entries and CDI annotations.
Add seven test cases with invalid combinations of VFIO device
entries and CDI annotations.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Add rules for vfio passthrough GPUs. When creating the security
policy document, parse GPU resource limits and derive CDI
annotation patterns and VFIO device entries.
With various values for CDI annotations and device paths being
runtime-dependent, use regular expressions.
For now, this enables passthrough of NVIDIA GPUs, but the changes
are designed to allow for other VFIO device types.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Add the block device specific annotations which is dedicated within
runtime-rs for num_queues and queue_sie to the document to help
users set the two parameters.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit introduces the capability to dynamically configure
`queue_size` and `num_queues` parameters via Pod annotations.
Currently, `kata-runtime` allows for static configuration of
`queue_size` and `num_queues` for block devices through its config
file. However, a critical issue arises when a Pod is allocated fewer
CPU cores than the statically configured `num_queues` value. In such
scenarios, the Pod fails to start, leading to operational instability
and limiting flexibility in resource allocation.
To address this, this feature enables users to override the default
queue_size and num_queues parameters by specifying them in Pod
annotations.This allows for fine-grained control and dynamic adjustment
of these parameters based on the specific resource allocation of a Pod.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The runner is down for a few weeks. I may end up bringing in my personal
runner, but I'm not confident I can easily do this before the holidays,
thus I'm skipping the tests for now.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Set the attestation policy for GPU0 to affirming. This requires
the GPU, for instance, to have production properties, such as
properly signed VBIOS firmware.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
As some reasons that this CI is continuously failed, we'd like to
temporarily skip it for the s390x platform. And it will be enabled
when we addressed related issues.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
As the default enable_annotations in runtime-rs is different with
runtime-go, we should make it align with configuration in runtime-go.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit refactors the vCPU resource management within runtime's
`CpuResource` structure and related calculation logic to use
floating-point numbers (`f32`) instead of integers (`u32`).
This migration is necessary to fully support the fractional vCPU
allocation introduced in the `kata-types` library, ensuring better
precision in:
1.Allocation Tracking: `current_vcpu` now tracks the precise
fractional value (e.g., 1.5 vCPUs).
2.Resource Calculation: `calc_cpu_resources` now returns a precise
`f32` sum of container vCPU requests, including normalization logic
based on the maximum period, removing the previous integer rounding
steps in the calculation.
3.Hypervisor Interaction: The integer vCPU requirement for the
hypervisor remains, so `ceil()` is now explicitly applied only when
interacting with the hypervisor or agent APIs
(`do_update_cpu_resources`, `current_vcpu`, `online_cpu_mem`).
And key changes as below:
1. `CpuResource::current_vcpu` updated from `u32` to `f32`.
2. `calc_cpu_resources` return type changed from `u32` to `f32`.
3. CPU hotplug logic now uses `f32` for the target vCPU count and applies
4. `ceil()` before calling `hypervisor.resize_vcpu()`.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Refactors `LinuxContainerCpuResources` and `LinuxSandboxCpuResources`
to track calculated vCPU allocation using `f64` (fractional float)
instead of `u64` (milliseconds).
This ensures more precise resource calculation (`quota / period`) and
aggregation by avoiding rounding errors inherent in millisecond-based
integer tracking.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit updates the non-TEE tests to disable two specific test
cases: `k8s-number-cpus.bats` and `k8s-sandbox-vcpus-allocation.bats`.
These tests are designed to cover CPU elasticity/dynamic scaling
capabilities. In the non-TEE scenario, we are enforcing the disabling of
this capability by setting the default configuration to
`static_sandbox_resource_mgmt=true`.
Although the tests currently pass, allowing them to run is logically
inconsistent with the intended non-TEE configuration. Therefore, we are
disabling them for all non-TEE runtimes, specifically targeting:
- `qemu-coco-dev`
- `qemu-coco-dev-runtime-rs`
This change ensures that our non-TEE CI accurately reflects the static
resource management policy and prevents misleading test results.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
As runtime-rs doesn't support block device hotplug in s390 arch,
with this fact, we just disable or skip the test when it is the
s390.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
To support such feature, the item in Makefile should be enabled,
and it can be set true when make build, just like this:
`DEFSTATICRESOURCEMGMT_QEMU := false`
When users don't want this feature, they can set it with true via
the configuration.toml.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Enable the cpu hotplug tests within the k8s-number-cpus.bats for both
cloud-hypervisor and qemu-runtime-rs.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
We have support cpu hotplug features within dragonball and clh, this
commit is to enable the test within the CI.
Fixes: #8660
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
As previous failure within the case, we choose to skip it, but now
the cpu hotplug has been corrected, and it's time to re-enable it.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Adding additional cases for the IOMMUFDID method to check for
non-IOMMUFD paths are passed. The method should do the right
thing.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
Logging the QMP commands gives us a lot of flexibility to
troubleshoot issues with what is being sent to QEMU.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
An import cycle was introduced because of a mutual need
for the constant that describes the prefix of IOMMUFD files.
We need to extract this out into a higher-level package.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
The QMP commands sent to QEMU did not properly set up
IOMMUFD objects in the codepath that handles VFIO device
hot-plugging. This is mainly relevant in the Kubernetes
use-case where the VFIO devices are not available when
QEMU is first launched.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
The function assumes that the runner is a Ubuntu machine, which so far
has been true as part of our CI.
However, the new ARM runner is running on Debian, and those mirror
additions would simply break.
With this in mind, for any distro that's not ubuntu, let's just make
sure to inform the owner of the system to have bats already installed as
part of the environment provided.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This reverts commit 5a81b010f2, as we now
have all the infrastructure properly set up as part of our CI node.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Remove the existing containerd guest pull stability tests workflow
as we're going to rebuild all the VMs used for testing and introduce
new, more focused stability tests for nydus-snapshotter.
The new tests will be added soon, as part of another PR.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Now that we've bumped to QEMU 10.2.0-rc1, we can take advantage of a fix
that's present there, which fixes the double memory allocation for the
cases where GPUs are being cold-plugged.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
We've made the pods require a ridiculous amount of memory, just for the
sake of getting them running.
Now that those are running, tests are passing, CI is required, let's
work to lower the amount of mmemory needed as everything else is working
as expected.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Clean-up shellcheck warnings:
SC2030 (info): Modification of cmd_out is local (to subshell caused by (..) group).
SC2031 (info): cmd_out was modified in a subshell. That change might be lost.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Clean-up shellcheck warnings:
SC2250 (style): Prefer putting braces around variable references even
when not strictly required.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Let's add a simple backup and restore logic for the CDI configuration
file nvidia.com-pgpu.yaml in the k8s-nvidia-*.bats and
k8s-confidential-attestation.bats test files.
Althought not optimal, this is a temporary workaround needed until
NVIDIA releases what's needed for the GPU Operator to properly deal with
cold plugged devices for the Confidential Containers cases, which is
work in progress right now.
After that's released, we can revert/drop this patch.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Let's bump experimental {tdx,snp} QEMU to the tags created Today in the
Confidential Containers repo, which match with QEMU 10.2.0-rc1.
This bump is specially beneficial for us, as we can get rid of QEMU's
double memory allocation when **cold plugging** a GPU.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
If the sandbox has cold-plugged a IOMMUFD device but the
device-plugins sends us a /dev/vfio/<NUM> device we need to
check if the IOMMUFD device and the VFIO device are the same
We have the sibling.BDF we now need to extract the BDF of the
devPath that is either /dev/vfio/<NUM> or /dev/vfio/devices/vfio<NUM>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Bump the github.com/sirupsen/logrus version to 1.9.3
across our components where it is back-level to bring us
up-to-date and resolve high severity CVE-2025-65637
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Add the attestation bats test case to the NVIDIA CI and provide a
second pod manifest for the attestation test with a GPU. This will
enable composite attestation in a subsequent step.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Bump to pull in a fix for composite attestation with GPUs. The new
commit ID corresponds to the fix (change for default GPU policy),
currently being the top commit of the main branch.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
This brings two fixes:
- use the test_key variable to check against the aatest value.
- properly check the run command invocation (run w/o bash does not
seem to like the pipe which leads to ALWAYS evaluating the
status result to 1. With this, the deny-all test would ALWAYS
succeed regardless of whether aatest was actually returned or not.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
When running these tests repeatedly locally, the default policy is not
being reset after the test completes, then subsequent runs fail.
Similar to k8s-sealed-secrets.bats, we set the default policy in an if
condition.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
This allows setting a GPU0 resource policy, enabling GPU
attestation tests to not use the default resource policy.
For now, the policy requires attestation's ear status to
not be contraindicated. In a future change we will require
this to be affirming once our CI runners' vBIOS version is
properly configured.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
This enables attestation tests to figure out whether composite
attestation with a GPU can be executed.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Add the NVIDIA TEE hypervisors. With this, attestation tests can be run
against the NVIDIA handlers, for instance.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
This reverts commit e4a13b9a4a, as it
caused some issues with the GPU workflows.
Reverting it is better, as it unblocks other PRs.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
vfio-ap passthrough has been introduced for runtime-rs,
requiring that the existing test verify this new functionality.
This commit adds:
- containerd config specific to runtime-rs
- extensions to the existing test functions to cover vfio-ap
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The following have been made for the enablement:
1. Make `MediatedPci` and `MediatedAp` in `VfioDeviceType`
2. Make HostDevice without BDF for `MediatedAp`
3. Add `CCW` to VFioBusMode and set it to VfioConfig as `bus_type`
4. Return `vfio-ap` driver type for `CCW` bus type
5. Set `bus_mode` for `VfioDevice` based on `bus_type`
6. Set `vfio-ap` to the agent device's `field_type`
7. Prepare a different argument for `vfio-ap` for QMP command
8. Set None to all PCI relevant fields
Please keep in mind that `vfio-ap` does not belong to any
types of port togologies like PCI (e.g., root or switch)
because devices on s390x are controlled by CCW.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Until now, we relied on `VMROOTFSDRIVER` to determine
whether a system uses a native CCW bus.
However, this method is not canonical and can be error-prone
depending on the configuration.
This commit introduces a new function that checks
for the presence of CCW bus infrastructure in sysfs
and verifies that native mainframe drivers are available.
It replaces all previous uses of the old detection method.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Add the small and normal variants of the qemu-runtime-rs
tests to the required-tests list now that they are stable.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
An oci-spec can be passed to the runtime without annotations
(e.g., `ctr run`). In this case, runtime panics with:
```
src/runtime-rs/crates/runtimes/src/manager.rs:391: called `Option::unwrap()` on a `None` value
```
This commit checks if the annotation is None, and instantiates
the hashmap as an empty map if it is missing. It also adds a None
check for `netns`.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Currently, the protection device configuration is constructed
automatically even if `confidential_guest` is not set.
This commit puts a condition to check the flag and allows the
construction accordingly.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Updates to the shim-v2 build and the binaries.sh script.
Makeing sure that both variants "confidential" AND
"nvidia-gpu-confidential" are handled.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Create an initial version of our toolchain policy as agreed in
Architecture Committee meetings and the PTG
Fixes: #9841
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
As tags are mutable and digests are not, lets pin our image
by digest to give our CI a better chance of stability
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
- Swap out the hard-coded nginx registry and verisons for reading
the test image details for version.yaml
which can also ensure that the quay.io mirror is used
rather than the docker hub versions which can hit pull limits
- Try setting imagePullPoliycy Always to fix issues with the arm CI
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Using make tarball targets for tools locally, binaries may exist
for both debug and release builds. In this case, cryptic errors
are shown as we try to install multiple binaries.
This change require exactly one binary to be found and errors out
in other cases.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
When tests regress, the CI wait time can increase significantly
with the current kubectly_retry attempt logic. Thus, align with
other tests and remove kubectl_retry invocations. Instead, rely on
proper timeouts.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
SEV-SNP machine is failing due to nydus not being deployed in the
machine.
We cannot easily contact the maintainers due to the US Holidays, and I
think this should become a criteria for a machine not be added as
required again (different regions coverage).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
So far we've only been building the initrd for the nvidia rootfs.
However, we're also interested on having the image beind used for a few
use-cases.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
We hit a case that gatekeeper was failing due to thinking the WIP check
had failed, but since it ran the PR had been edited to remove that from
the title. We should listen to edits and unlabels of the PR to ensure that
gatekeeper doesn't get outdated in situations like this.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
When using the multiInstallSuffix we must be cautelous on using the shim
name, as qemu-nvidia-gpu* doesn't actually have a matching QEMU itself,
but should rather be mapped to:
qemu-nvidia-gpu -> qemu
qemu-nvidia-gpu-snp -> qemu-snp-experimental
qemu-nvidia-gpu-tdx -> qemu-tdx-experimental
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Fixes: #12123
`include` in #12069, introduced to choose a different runner
based on component, leads to another set of redundant jobs
where `matrix.command` is empty.
This commit gets back to the `runs-on` solution, but makes
the condition human-readable.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Containerd configuration syntax (`config.toml`) varies across versions,
requiring per-version logic for fields like `runtime`.
However, testing confirms that containerd LTS (1.7.x) and newer
versions fully support the v3 schema for the nydus remote snapshotter.
This commit changes the previous containerd v1 settings in `config.toml`.
Instead, it introduces a unified v3-style configuration for nydus, which
can be vailid for lts and active containerds.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
In the CoCo tests jobs @wainersm create a report tests step
that summarises the jobs, so they are easier to understand and
get results for. This is very useful, so let's roll it out to all the bats
tests.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
QEMU netdev_add QMP command requires the 'mq' (multi-queue) argument
to be of boolean type (`true` / `false`). In runtime-rs the virtio-net
device hotplug logic currently passes a string value (e.g. "on"/"off"),
which causes QEMU to reject the command:
```
Invalid parameter type for 'mq', expected: boolean
```
This patch modifies `hotplug_network_device` to insert 'mq' as a proper
boolean value of `true . This fixes sandbox startup failures when
multi-queue is enabled.
Fixes#12136
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
| [`runk`](src/tools/runk) | utility | Standard OCI container runtime based on the agent. |
| [`ci`](.github/workflows) | CI | Continuous Integration configuration files and scripts. |
| [`ocp-ci`](ci/openshift-ci/README.md) | CI | Continuous Integration configuration for the OpenShift pipelines. |
| [`katacontainers.io`](https://github.com/kata-containers/www.katacontainers.io) | Source for the [`katacontainers.io`](https://www.katacontainers.io) site. |
pushd"${katacontainers_repo_dir}/tools/packaging/kata-deploy"||{echo"Failed to push to ${katacontainers_repo_dir}/tools/packaging/kata-deploy";exit 125;}
As a community we want to strike a balance between having up-to-date toolchains, to receive the
latest security fixes and to be able to benefit from new features and packages, whilst not being
too bleeding edge and disrupting downstream and other consumers. As a result we have the following
guidelines (note, not hard rules) for our go and rust toolchains that we are attempting to try out:
## Go toolchain
Go is released [every six months](https://go.dev/wiki/Go-Release-Cycle) with support for the
[last two major release versions](https://go.dev/doc/devel/release#policy). We always want to
ensure that we are on a supported version so we receive security fixes. To try and make
things easier for some of our users, we aim to be using the older of the two supported major
versions, unless there is a compelling reason to adopt the newer version.
In practice this means that we bump our major version of the go toolchain every six months to
version (1.x-1) in response to a new version (1.x) coming out, which makes our current version
(1.x-2) no longer supported. We will bump the minor version whenever required to satisfy
dependency updates, or security fixes.
Our go toolchain version is recorded in [`versions.yaml`](../versions.yaml) under
`.languages.golang.version` and should match with the version in our `go.mod` files.
## Rust toolchain
Rust has a [six week](https://doc.rust-lang.org/book/appendix-05-editions.html#:~:text=The%20Rust%20language%20and%20compiler,these%20tiny%20changes%20add%20up.)
release cycle and they only support the latest stable release, so if we wanted to remain on a
supported release we would only ever build with the latest stable and bump every 6 weeks.
However feedback from our community has indicated that this is a challenge as downstream consumers
often want to get rust from their distro, or downstream fork and these struggle to keep up with
the six week release schedule. As a result the community has agreed to try out a policy of
"stable-2", where we aim to build with a rust version that is two versions behind the latest stable
version.
In practice this should mean that we bump our rust toolchain every six weeks, to version
1.x-2 when 1.x is released as stable and we should be picking up the latest point release
of that version, if there were any.
The rust-toolchain that we are using is recorded in [`rust-toolchain.toml`](../rust-toolchain.toml).
@@ -50,7 +50,7 @@ There are several kinds of Kata configurations and they are listed below.
| `io.katacontainers.config.hypervisor.default_max_vcpus` | uint32| the maximum number of vCPUs allocated for the VM by the hypervisor |
| `io.katacontainers.config.hypervisor.default_memory` | uint32| the memory assigned for a VM by the hypervisor in `MiB` |
| `io.katacontainers.config.hypervisor.default_vcpus` | float32| the default vCPUs assigned for a VM by the hypervisor |
| `io.katacontainers.config.hypervisor.disable_block_device_use` | `boolean` | disallow a block device from being used |
| `io.katacontainers.config.hypervisor.disable_block_device_use` | `boolean` | disable hotplugging host block devices to guest VMs for container rootfs |
| `io.katacontainers.config.hypervisor.disable_image_nvdimm` | `boolean` | specify if a `nvdimm` device should be used as rootfs for the guest (QEMU) |
| `io.katacontainers.config.hypervisor.disable_vhost_net` | `boolean` | specify if `vhost-net` is not available on the host |
| `io.katacontainers.config.hypervisor.enable_hugepages` | `boolean` | if the memory should be `pre-allocated` from huge pages |
@@ -97,6 +97,8 @@ There are several kinds of Kata configurations and they are listed below.
| `io.katacontainers.config.hypervisor.use_legacy_serial` | `boolean` | uses legacy serial device for guest's console (QEMU) |
| `io.katacontainers.config.hypervisor.default_gpus` | uint32 | the minimum number of GPUs required for the VM. Only used by remote hypervisor to help with instance selection |
| `io.katacontainers.config.hypervisor.default_gpu_model` | string | the GPU model required for the VM. Only used by remote hypervisor to help with instance selection |
| `io.katacontainers.config.hypervisor.block_device_num_queues` | `usize` | The number of queues to use for block devices (runtime-rs only) |
| `io.katacontainers.config.hypervisor.block_device_queue_size` | uint32 | The size of the of the queue to use for block devices (runtime-rs only) |
- [NVIDIA GPUs](NVIDIA-GPU-passthrough-and-Kata.md) and [Enabling NVIDIA GPU workloads using GPU passthrough with Kata Containers](NVIDIA-GPU-passthrough-and-Kata-QEMU.md)
- confidential computing guest components: the attestation agent,
confidential data hub and api-server-rest binaries
- CRI-O pause container (for the guest image-pull method)
- BusyBox utilities (provides a base set of libraries and binaries, and a
linker)
- some supporting files, such as file containing a list of supported GPU
device IDs which NVRC reads
#### UVM orchestration flow
When the Kata runtime asks QEMU to launch the VM, the UVM's Linux kernel
boots and mounts the root filesystem. After this, NVRC starts as the initial
process.
NVRC scans for NVIDIA GPUs on the PCI bus, loads the
NVIDIA kernel modules, waits for driver initialization, creates the device nodes,
and initializes the GPU hardware (using the `nvidia-smi` binary). NVRC also
creates the guest-side CDI specification file (using the
`nvidia-ctk cdi generate` command). This file specifies devices of
`kind: nvidia.com/gpu`, i.e., GPUs appearing to be physical GPUs on regular
bare metal systems. The guest CDI specification also contains `containerEdits`
for each device, specifying device nodes (e.g., `/dev/nvidia0`,
`/dev/nvidiactl`), library mounts, and environment variables to be mounted
into the container which receives the passthrough GPU.
Then, NVRC forks the Kata agent while continuing to run as the
init system. This allows NVRC to handle ongoing GPU management tasks
while kata-agent focuses on container lifecycle management. See the
[NVRC sources](https://github.com/NVIDIA/nvrc/blob/main/src/main.rs) for an
overview on the steps carried out by NVRC.
When the Kata runtime sends the create container request, the Kata agent
parses the inner runtime CDI annotation. For example, for the inner runtime
annotation `"cdi.k8s.io/vfio1": "nvidia.com/gpu=0"`, the agent looks up device
`0` in the guest CDI specification with `kind: nvidia.com/gpu`.
The Kata agent also reads the guest CDI specification's `containerEdits`
section and injects relevant contents into the OCI spec of the respective
container. The kata agent then creates and starts a `rustjail` container
based on the final OCI spec. The container now has relevant device nodes,
binaries and low-level libraries available, and can start a user application
linked against the CUDA runtime API (e.g., `libcudart.so` and other
libraries). When used, the CUDA runtime API in turn calls the CUDA driver
API and kernel drivers, interacting with the pass-through GPU device.
An additional step is exercised in our CI samples: when using images from an
authenticated registry, the guest-pull mechanism triggers attestation using
trustee's Key Broker Service (KBS) for secure release of the NGC API
authentication key used to access the NVCR container registry. As part of
this, the attestation agent exercises composite attestation and transitions
the GPU into `Ready` state (without this, the GPU has to explicitly be
transitioned into `Ready` state by passing the `nvrc.smi.srs=1` kernel
parameter via the shim config, causing NVRC to transition the GPU into the
`Ready` state).
## Deployment Guidance
This guidance assumes you use bare-metal machines with proper support for
Kata's non-TEE and TEE GPU workload deployment scenarios for your Kubernetes
nodes. We provide guidance based on the upstream Kata CI procedures for the
NVIDIA GPU CI validation jobs. Note that, this setup:
- uses the guest image pull method to pull container image layers
- uses the genpolicy tool to attach Kata agent security policies to the pod
manifest
- has dedicated (composite) attestation tests, a CUDA vectorAdd test, and a
NIM/RA test sample with secure API key release
A similar deployment guide and scenario description can be found in NVIDIA resources
under
[Early Access: NVIDIA GPU Operator with Confidential Containers based on Kata](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/confidential-containers.html).
### Requirements
The requirements for the TEE scenario are:
- Ubuntu 25.10 as host OS
- CPU with AMD SEV-SNP support with proper BIOS/UEFI version and settings
- CC-capable Hopper/Blackwell GPU with proper VBIOS version.
BIOS and VBIOS configuration is out of scope for this guide. Other resources,
> **Note**: If you see a message similar to the above, the BAR space of the NVIDIA
> GPU has been successfully allocated.
## NVIDIA vGPU mode with Kata Containers
NVIDIA vGPU is a licensed product on all supported GPU boards. A software license
is required to enable all vGPU features within the guest VM. NVIDIA vGPU manager
needs to be installed on the host to configure GPUs in vGPU mode. See [NVIDIA Virtual GPU Software Documentation v14.0 through 14.1](https://docs.nvidia.com/grid/14.0/) for more details.
### NVIDIA vGPU time-sliced
In the time-sliced mode, the GPU is not partitioned and the workload uses the
whole GPU and shares access to the GPU engines. Processes are scheduled in
series. The best effort scheduler is the default one and can be exchanged by
other scheduling policies see the documentation above how to do that.
Beware if you had `MIG` enabled before to disable `MIG` on the GPU if you want
to use `time-sliced` `vGPU`.
```sh
$ sudo nvidia-smi -mig 0
```
Enable the virtual functions for the physical GPU in the `sysfs` file system.
Create the GPU instances that correspond to the `vGPU` types of the `MIG-backed`
`vGPUs` that you will create [NVIDIA A100 PCIe 80GB Virtual GPU Types](https://docs.nvidia.com/grid/13.0/grid-vgpu-user-guide/index.html#vgpu-types-nvidia-a100-pcie-80gb).
```sh
# MIG 1g.10gb --> vGPU A100D-1-10C
$ sudo nvidia-smi mig -cgi 19
```
List the GPU instances and get the GPU instance id to create the compute
instance.
```sh
$ sudo nvidia-smi mig -lgi # list the created GPU instances
$ sudo nvidia-smi mig -cci -gi 9 # each GPU instance can have several compute
# instances. Instance -> Workload
```
Verify that the compute instances were created within the GPU instance
Repeat the steps after the [snippet](#list-all-available-vgpu-instances) listing
to create the corresponding `mdev` device and use the guest `OS` created in the
previous section with `time-sliced` `vGPUs`.
## Install NVIDIA Driver + Toolkit in Kata Containers Guest OS
### Build Guest OS with NVIDIA Driver and Toolkit
Consult the [Developer-Guide](https://github.com/kata-containers/kata-containers/blob/main/docs/Developer-Guide.md#create-a-rootfs-image) on how to create a
rootfs base image for a distribution of your choice. This is going to be used as
@@ -583,9 +308,12 @@ Enable the `guest_hook_path` in Kata's `configuration.toml`
guest_hook_path="/usr/share/oci/hooks"
```
As the last step one can remove the additional packages and files that were added
to the `$ROOTFS_DIR` to keep it as small as possible.
One has built a NVIDIA rootfs, kernel and now we can run any GPU container
without installing the drivers into the container. Check NVIDIA device status
with `nvidia-smi`
with `nvidia-smi`:
```sh
$ sudo ctr --debug run --runtime "io.containerd.kata.v2" --device /dev/vfio/192 --rm -t "docker.io/nvidia/cuda:11.6.0-base-ubuntu20.04" cuda nvidia-smi
for security (cryptography) and compression. These instructions cover the
steps for the latest [Ubuntu LTS release](https://ubuntu.com/download/desktop)
which already include the QAT host driver. These instructions can be adapted to
any Linux distribution. These instructions guide the user on how to download
the kernel sources, compile kernel driver modules against those sources, and
load them onto the host as well as preparing a specially built Kata Containers
kernel and custom Kata Containers rootfs.
for security (cryptography) and compression. Kata Containers can enable
these acceleration functions for containers using QAT SR-IOV with the
support from [Intel QAT Device Plugin for Kubernetes](https://github.com/intel/intel-device-plugins-for-kubernetes)
or [Intel QAT DRA Resource Driver for Kubernetes](https://github.com/intel/intel-resource-drivers-for-kubernetes).
* Download kernel sources
* Compile Kata kernel
* Compile kernel driver modules against those sources
* Download rootfs
* Add driver modules to rootfs
* Build rootfs image
## Helpful Links before starting
## More Information
[Intel® QuickAssist Technology at `01.org`](https://www.intel.com/content/www/us/en/developer/topic-technology/open/quick-assist-technology/overview.html)
@@ -26,554 +22,6 @@ kernel and custom Kata Containers rootfs.
[Intel Device Plugin for Kubernetes](https://github.com/intel/intel-device-plugins-for-kubernetes)
[Intel DRA Resource Driver for Kubernetes](https://github.com/intel/intel-resource-drivers-for-kubernetes)
[Intel® QuickAssist Technology for Crypto Poll Mode Driver](https://dpdk-docs.readthedocs.io/en/latest/cryptodevs/qat.html)
## Steps to enable Intel® QAT in Kata Containers
There are some steps to complete only once, some steps to complete with every
reboot, and some steps to complete when the host kernel changes.
## Script variables
The following list of variables must be set before running through the
scripts. These variables refer to locations to store modules and configuration
files on the host and links to the drivers to use. Modify these as
needed to point to updated drivers or different install locations.
### Set environment variables (Every Reboot)
Make sure to check [`01.org`](https://www.intel.com/content/www/us/en/developer/topic-technology/open/quick-assist-technology/overview.html) for
$ sudo sed -i 's/GRUB_CMDLINE_LINUX_DEFAULT=""/GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"/' /etc/default/grub
$ sudo update-grub
$ sudo reboot
```
### Download Intel® QAT drivers
This will download the [Intel® QAT drivers](https://www.intel.com/content/www/us/en/developer/topic-technology/open/quick-assist-technology/overview.html).
Make sure to check the website for the latest version.
```bash
$ mkdir -p $QAT_SRC
$ cd$QAT_SRC
$ curl -L $QAT_DRIVER_URL| tar zx
```
### Copy Intel® QAT configuration files and enable virtual functions
Modify the instructions below as necessary if using a different Intel® QAT hardware
platform. You can learn more about customizing configuration files at the
In addition, containerd expects the binary to be in `/usr/local/bin` so add
this small script so that it redirects to be able to use either QEMU or
Cloud Hypervisor with Kata.
```bash
$ echo'#!/usr/bin/env bash'| sudo tee /usr/local/bin/containerd-shim-kata-qemu-v2
$ echo'KATA_CONF_FILE=/opt/kata/share/defaults/kata-containers/configuration-qemu.toml /opt/kata/bin/containerd-shim-kata-v2 $@'| sudo tee -a /usr/local/bin/containerd-shim-kata-qemu-v2
$ echo'#!/usr/bin/env bash'| sudo tee /usr/local/bin/containerd-shim-kata-clh-v2
$ echo'KATA_CONF_FILE=/opt/kata/share/defaults/kata-containers/configuration-clh.toml /opt/kata/bin/containerd-shim-kata-v2 $@'| sudo tee -a /usr/local/bin/containerd-shim-kata-clh-v2
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.