Commit Graph

18387 Commits

Author SHA1 Message Date
Fabiano Fidêncio
a12e0f1204 build: cache: Take NVRC & NVAT version into consideration
Without those, we'd end up pulling the same / old rootfs that's cached
without re-building it in case of a bump in any of those components.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-08 10:14:11 +02:00
RuoqingHe
a4fb9aef54 Merge pull request #12789 from kata-containers/pin-actions-rs-toolchain
gha: Pin action for cargo-deny workflow
2026-04-08 08:36:13 +08:00
Fabiano Fidêncio
995767330d Merge pull request #12782 from pavithiran34/pavi-ras-version-update
fix: updated image-rs to v0.18.0
2026-04-07 23:32:05 +02:00
Aurélien Bombo
8916f5f301 gha: Pin action for cargo-deny workflow
The cargo-deny workflow should be the last workflow to not use a pinned version.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-04-07 15:41:09 -05:00
pavithiran34
528fa80953 fix: updated image-rs to v0.18.0
- Updated image-rs from rev 026694d4 to tag v0.18.0
- This update brings rsa 0.9.10 which fixes CVE-2026-21895
- Resolves vulnerability in indirect dependencies

Signed-off-by: pavithiran34 <pavithiran.p@ibm.com>
2026-04-07 21:40:01 +02:00
Fabiano Fidêncio
b3ae6ef99c Merge pull request #12760 from fitzthum/bump-nvat
Bump trustee and guest-components to add nvswitch / ppcie support
2026-04-07 19:07:50 +02:00
Aurélien Bombo
79fab93041 Merge pull request #12779 from rophy/fix/strip-cr-from-tty-exec
tests: strip \r from kubectl exec output for TTY containers
2026-04-07 10:19:21 -05:00
Tobin Feldman-Fitzthum
e40abcf72d nvidia: add nvrc.smi.srs=1 to default nvidia kernel params
The attestation-agent no longer sets nvidia devices to ready
automatically. Instead, we should use nvrc for this. Since this is
required for all nvidia workloads, add it to the default nv kernel
params.

With bounce buffers, the timing of attesting a device versus setting it
to ready is not so important.

Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
2026-04-07 14:28:50 +00:00
Manuel Huber
0fd4559f7e docs: Update NVIDIA GPU passthrough QEMU scenario
Updates for the NVIDIA GPU passthrough scenario for the
kata-containers release 3.29.0.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-04-07 14:58:40 +02:00
Tobin Feldman-Fitzthum
7385938c57 tests: fix default KBS Policy path
We recently moved the default policy in the Trustee repo. Now it's in
the same place as all the other policies. Update the test code to match.

Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
2026-04-07 05:46:27 +00:00
Tobin Feldman-Fitzthum
38e04bb6d8 versions: bump guest-components for switch attestation
Pick up the new version of guest-components which uses NVAT bindings
instead of NVML bindings. This will allow us to attests guests with
nvswitches.

Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
2026-04-07 05:46:27 +00:00
RuoqingHe
feaec78ad0 Merge pull request #12776 from fidencio/topic/kata-deploy-move-into-the-root-workspace
kata-deploy: Move into the root workspace
2026-04-07 12:45:26 +08:00
Fabiano Fidêncio
461907918d kata-deploy: pin nydus-snapshotter via versions.yaml
Resolve externals.nydus-snapshotter version and url in the Docker image build
with yq from the repo-root versions.yaml instead of Dockerfile ARG defaults.

Drop the redundant workflow that only enforced parity between those two sources.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-07 10:07:06 +08:00
Fabiano Fidêncio
9e1f595160 kata-deploy: add Rust binary to root workspace
Add tools/packaging/kata-deploy/binary as a workspace member, inherit shared
dependency versions from the root manifest, and refresh Cargo.lock.

Build the kata-deploy image from the repository root: copy the workspace
layout into the rust-builder stage, run cargo test/build with -p kata-deploy,
and adjust artifact and static asset COPY paths. Update the payload build
script to invoke docker buildx with -f .../Dockerfile from the repo root.

Add a repo-root .dockerignore to keep the Docker build context smaller.
Document running unit tests with cargo test -p kata-deploy from the root.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-07 10:07:06 +08:00
Rophy Tsai
f7d9024249 tests: strip \r from kubectl exec output for TTY containers
The busybox-pod.yaml test fixture sets tty: true on the second
container. When a container has a TTY, kubectl exec may return
\r\n line endings. The invisible \r causes string comparisons
to fail:

  container_name=$(kubectl exec ... -- env | grep CONTAINER_NAME)
  [ "$container_name" == "CONTAINER_NAME=second-test-container" ]

This comparison fails because $container_name contains a trailing
\r character.

Fix by piping through tr -d '\r' after grep. This is harmless
when \r is absent and fixes the mismatch when present.

Fixes: #9136

Signed-off-by: Rophy Tsai <rophy@users.noreply.github.com>
2026-04-07 01:35:10 +00:00
Alex Lyn
46a7b9e75d Merge pull request #12775 from RuoqingHe/put-libs-to-root-workspace
libs: Move libs into root workspace
2026-04-07 09:25:26 +08:00
Tobin Feldman-Fitzthum
3d60196735 versions: bump Trustee to pickup PPCIE support
Trustee is compatible with old guest components (using NVML bindings) or
new guest components (using NVAT). If we have the new version of gc, we
can attest PPCIE guests, which we need the new version of Trustee to
verify.

Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
2026-04-06 17:51:12 +00:00
Tobin Feldman-Fitzthum
0444d70704 rootfs: add runtime support for NVAT
Update NVIDIA rootfs builder to include runtime dependencies for NVAT
Rust bindings.

The nvattest package does not include the .so file, so we need to build
from source.

Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
2026-04-06 17:51:12 +00:00
Tobin Feldman-Fitzthum
78c61459f8 packaging: add built-time support for NVAT
The attestation agent will soon rely on the NVAT rust bindings, which
have some built-time dependencies.

There is currently no nvattest-dev package, so we need to build from
source to get the headers and .so file.

Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
2026-04-06 17:51:12 +00:00
Dan Mihai
9b770793ba Merge pull request #12728 from manuelh-dev/mahuber/empty-dir-fsgrou-policy
genpolicy: adjust GID after passwd GID handling and set fs_group for encrypted emptyDir volumes
2026-04-06 10:22:34 -07:00
Fabiano Fidêncio
47770daa3b helm: Align values.yaml with try-kata-nvidia-gpu.values.yaml
We've switched to nydus there, but never did for the values.yaml.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-06 18:51:54 +02:00
Fabiano Fidêncio
1300145f7a tests: add k3s/rke2 to OCI 1.3.0 drop-in overlay condition
k3s and rke2 ship containerd 2.2.2, which requires the OCI 1.3.0
drop-in overlay. Move them from the separate OCI 1.2.1 branch into
the OCI 1.3.0 condition alongside nvidia-gpu, qemu-snp, qemu-tdx,
and custom container engine versions.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-06 18:50:20 +02:00
Fabiano Fidêncio
0a739b3b55 Merge pull request #12755 from katexochen/runtime-rs-config-cleanup
runtime-rs: cleanup config
2026-04-06 13:14:58 +02:00
Ruoqing He
cb7c790dc7 libs: Specify crates explicitly in Makefile
--all option would trigger building and testing for everything within
our root workspace, which is not desired here. Let's specify the crates
of libs explicitly in our Makefile.

Signed-off-by: Ruoqing He <ruoqing.he@lingcage.com>
2026-04-06 11:03:38 +02:00
Ruoqing He
2a024f55d0 libs: Move libs into root workspace
Remove libs from exclude list, and move them explicitly into root
workspace to make sure our core components are in a consistent state.

This is a follow up of #12413.

Signed-off-by: Ruoqing He <ruoqing.he@lingcage.com>
2026-04-06 11:03:38 +02:00
Fabiano Fidêncio
9a2825a429 runtime: config: Use OVMF for the qemu-nvidia-gpu
2ba0cb0d4a7 did the ground work for using OVMF even for the
qemu-nvidia-gpu, but missed actually setting the OVMF path to be used,
which we'e fixing now.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-06 03:54:56 +02:00
Fabiano Fidêncio
e1fae11509 Merge pull request #12392 from Apokleos/enhance-tdx
runtime-rs: Enhance TDX in qemu
2026-04-05 20:54:43 +02:00
Alex Lyn
35cafe8715 runtime-rs: configure TDX machine options with kernel_irqchip=split
When TDX confidential guest support is enabled, set `kernel_irqchip=split`
for TDX CVM:
...
-machine \
   q35,accel=kvm,kernel_irqchip=split,confidential-guest-support=tdx \
...

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-04-05 10:18:47 +02:00
Fabiano Fidêncio
f074ceec6d Merge pull request #12682 from PiotrProkop/fix-direct-io-kata
runtime-rs: fix setting directio via config file
2026-04-03 16:11:57 +02:00
Fabiano Fidêncio
945aa5b43f Merge pull request #12774 from zvonkok/bump-nvrc
nvrc: Bump to the latest Release
2026-04-03 15:39:01 +02:00
Fabiano Fidêncio
ccfdf5e11b Merge pull request #12754 from llink5/fix/docker26-networking-9340
runtime: fix Docker 26+ networking by rescanning after Start
2026-04-03 13:15:38 +02:00
RuoqingHe
26bd5ad754 Merge pull request #12762 from YutingNie/fix-runtime-rs-shared-fs-typo
runtime-rs: Fix typo in share_fs error message
2026-04-03 15:24:33 +08:00
Yuting Nie
517882f93d runtime-rs: Fix typo in share_fs error message
There's a typo in the error message which gets prompted when an
unsupported share_fs was configured. Fixed shred -> shared.

Signed-off-by: Yuting Nie <yuting.nie@spacemit.com>
2026-04-03 05:23:46 +00:00
Alex Lyn
4a1c2b6620 Merge pull request #12309 from kata-containers/stale-issues-by-date
workflows: Create workflow to stale issues based on date
2026-04-03 09:31:34 +08:00
Zvonko Kaiser
3e23ee9998 nvrc: Bump to the latest Release
v0.1.4 has a bugfix for nvrc.log=trace which is now
optional.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-04-02 17:40:47 -04:00
llink5
f7878cc385 runtime: fix Docker 26+ networking by rescanning after Start
Docker 26+ configures container networking (veth pair, IP addresses,
routes) after task creation rather than before. Kata's endpoint scan
runs during CreateSandbox, before the interfaces exist, resulting in
VMs starting without network connectivity (no -netdev passed to QEMU).

Add RescanNetwork() which runs asynchronously after the Start RPC.
It polls the network namespace until Docker's interfaces appear, then
hotplugs them to QEMU and informs the guest agent to configure them
inside the VM.

Additional fixes:
- mountinfo parser: find fs type dynamically instead of hardcoded
  field index, fixing parsing with optional mount tags (shared:,
  master:)
- IsDockerContainer: check CreateRuntime hooks for Docker 26+
- DockerNetnsPath: extract netns path from libnetwork-setkey hook
  args with path traversal protection
- detectHypervisorNetns: verify PID ownership via /proc/pid/cmdline
  to guard against PID recycling
- startVM guard: rescan when len(endpoints)==0 after VM start

Fixes: #9340

Signed-off-by: llink5 <llink5@users.noreply.github.com>
2026-04-02 21:23:16 +02:00
Fabiano Fidêncio
09194d71bb Merge pull request #12767 from nubificus/fix/fc-rs
runtime-rs: Fix FC API fields
2026-04-02 18:24:35 +02:00
Manuel Huber
dd868dee6d tests: nvidia: onboard NIM service test
Onboard a test case for deploying a NIM service using the NIM
operator. We install the operator helm chart on the fly as this is
a fast operation, spinning up a single operand. Once a NIM service
is scheduled, the operator creates a deployment with a single pod.

For now, the TEE-based flow uses an allow-all policy. In future
work, we strive to support generating pod security policies for the
scenario where NIM services are deployed and the pod manifest is
being generated on the fly.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-04-02 16:58:54 +02:00
Manuel Huber
57e42b10f1 tests: nvidia: Do not use elevated privileges
Do not run the NIM containers with elevated privileges. Note that,
using hostPath requires proper host folder permissions, and that
using emptyDir requires a proper fsGroup ID.
Once issue 11162 is resolved, we can further refine the securityContext
fields for the TEE manifests.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-04-01 10:23:26 -07:00
Manuel Huber
a762b136de tests: generate policy for pod-empty-dir-fsgroup
The logic in the k8s-empty-dirs.bats file missed to add a security
policy for the pod-empty-dir-fsgroup.yaml manifest. With this change,
we add the policy annotation.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-04-01 10:23:26 -07:00
Manuel Huber
43489f6d56 genpolicy: fs_group for encrypted emptyDir volumes
The shim uses Storage.fs_group on block/scsi encrypted emptyDir while
genpolicy used fsgid= in options and null fs_group, leading to
denying CreateContainerRequest when using block-encrypted emptyDir in
combination with fsGroup. Thus, emit fs_group in that scenario and keep
fsgid= for the existing shared-fs/local emptyDir behavior.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-04-01 10:23:26 -07:00
Manuel Huber
9923f251f5 genpolicy: adjust GID after passwd GID handling
After pod runAsUser triggers passwd-based GID resolution, genpolicy
clears AdditionalGids and inserts only the primary GID.
PodSecurityContext fsGroup and supplementalGroups get cleared, so
policy enforcement would deny CreateContainer when the runtime
includes those when specified.

This change applies fsGroup/supplementalGroups once in
get_container_process via apply_pod_fs_group_and_supplemental_groups.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-04-01 10:23:25 -07:00
Steve Horsman
58101a2166 Merge pull request #12656 from stevenhorsman/actions/checkout-bump
workflows: Update actions/checkout version
2026-04-01 17:34:39 +01:00
Fabiano Fidêncio
75df4c0bd3 Merge pull request #12766 from fidencio/topic/kata-deploy-avoid-kata-pods-to-crash-after-containerd-restart
kata-deploy: Fix kata-deploy pods crashing if containerd restarts
2026-04-01 18:28:16 +02:00
Steve Horsman
2830c4f080 Merge pull request #12746 from ldoktor/ci-helm2
ci.ocp: Use helm deployment for peer-pods
2026-04-01 17:13:21 +01:00
Lukáš Doktor
55a3772032 ci.ocp: Add note about external tests to README.md
to run all the tests that are running in CI we need to enable external
tests. This can be a bit tricky so add it into our documentation.

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
2026-04-01 16:59:33 +01:00
Lukáš Doktor
3bc460fd82 ci.ocp: Use helm deployment for peer-pods
replace the deprecated CAA deployment with helm one. Note that this also
installs the CAA mutating webhook, which wasn't installed before.

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
2026-04-01 16:59:33 +01:00
PiotrProkop
67af63a540 runtime-rs: fix setting directio via config file
This fix applies the config file value as a fallback when block_device_cache_direct annotation is not explicitly set on the pod.

Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2026-04-01 16:59:04 +02:00
Anastassios Nanos
02c82b174a runtime-rs: Fix FC API fields
A FC update caused bad requests for the runtime-rs runtime when
specifying the vcpu count and block rate limiter fields.

Signed-off-by: Anastassios Nanos <ananos@nubificus.co.uk>
2026-04-01 14:50:51 +00:00
Fabiano Fidêncio
2131147360 tests: add kata-deploy lifecycle tests for restart resilience and cleanup
Add functional tests that cover two previously untested kata-deploy
behaviors:

1. Restart resilience (regression test for #12761): deploys a
   long-running kata pod, triggers a kata-deploy DaemonSet restart via
   rollout restart, and verifies the kata pod survives with the same
   UID and zero additional container restarts.

2. Artifact cleanup: after helm uninstall, verifies that RuntimeClasses
   are removed, the kata-runtime node label is cleared, /opt/kata is
   gone from the host filesystem, and containerd remains healthy.

3. Artifact presence: after install, verifies /opt/kata and the shim
   binary exist on the host, RuntimeClasses are created, and the node
   is labeled.

Host filesystem checks use a short-lived privileged pod with a
hostPath mount to inspect the node directly.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-01 15:20:53 +02:00