Commit Graph

17320 Commits

Author SHA1 Message Date
Fabiano Fidêncio
c75a46d17f tests: Do not enable NFD on s390x
As we're failing on the uninstall, which seems related to a bug on NFD
itself, but I don't have access to a s390x machine to debug, let's skip
the enablement for now and enable it back once we've experimented it
better on s390x.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-31 16:30:13 +01:00
Fabiano Fidêncio
67e38e0f92 tests: Do not enable NFD on cbl-mariner
As we're failing to install NFD on CBL Mariner, let's skip the
enablement there, and enable it once we've experimented it better there.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-31 16:30:13 +01:00
Fabiano Fidêncio
1bc873397b tests: Use NFD as part of the tests
As we have the ability to deploy NFD as a sub-chart of our chart, let's
make sure we test it during our CI.

We had to increase the timeout values, where we had timeouts set, to
deploy / undeploy kata, as now NFD is also deployed / undeployed.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-31 16:30:13 +01:00
Fabiano Fidêncio
ebe15d154e kata-deploy: Add NFD as a dependency
Let's ensure that we add NFD as a weak dependency of the kata-deploy
helm chart.

What we're doing for now is leaving it up to the user / admin to enable
it, and if enabled then we do a explicit check for virtualization
support (x86_64 only for now).

In case NFD is already deployed, we fail the installation (in case it's
enabled on the kata-deploy helm chart) with a clear error message to the
user.

While I know that kata-remote **DOES NOT** require virtualization, I've
left this out (with a comment for when we add a peer-pods dependency on
kata-deploy) in order to simplify things for now, as kata-remote is not
a deployed shim by default.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-31 16:30:13 +01:00
Fabiano Fidêncio
be05e1370c kata-deploy: Allow setting the default runtime class name
As Kata Containers can be consumed by other helm-charts, hard coding the
default runtime class name to `kata` is not optimal.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-31 16:14:53 +01:00
Fabiano Fidêncio
820e6d6351 kata-deploy: Add more per-arch options
All the options that take a specific shim as an argument MUST have
specific per arch settings, as not all the shims are available for all
the arches, leading to issues when setting up multi-arch deployments.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-31 16:14:53 +01:00
Zvonko Kaiser
94abe4fc00 osbuilder: nvrc: Consume NVRC release instead of building it
Let's ensure that we consume NVRC releases straight from GitHub instead
of building the binaries ourselves.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-31 12:10:20 +01:00
Zvonko Kaiser
69c76971f3 gpu: Handle VFIO and IOMMUFD
We have here either /dev/vfio/<num> or /dev/vfio/devices/vfio<num>,
for IOMMUFD format /dev/vfio/devices/vfio<num>, strip "vfio" prefix

/dev/vfio/123 - basename "123" - vfioNum = "123" - cdi.k8s.io/vfio123
/dev/vfio/devices/vfio123 - basename "vfio123" - strip - vfioNum = "123" - cdi.k8s.io/vfio123

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2025-10-31 09:46:07 +01:00
Saul Paredes
26396881cf webhook: allow privileged containers
This allows us to test privileged containers when using the webhook.
We can do this because kata-deploy sets privileged_without_host_devices = true for kata runtime by default.

Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
2025-10-30 14:59:26 -07:00
Fabiano Fidêncio
e30e2b5f45 tests: k8s: Remove tests running on GitHub provided runner
We have 2 tests running on GitHub provided runners:
* devmapper
* CRI-O

- devmapper situation

For devmapper, we're currently testing devmapper with s390x as part of
one of its jobs.

More than that, this test has been failing here due to a lack of space
in the machine for quite some time, and no-action was taken to bring it
back either via GARM or some other way.

With that said, let's rely on the s390x CI to test devmapper and avoid
one extra failure on our CI by removing this one.

- cri-o situation

CRI-O is being tested with a fixed version of kubernetes that's already
reached its EOL, and a CRI-O version that matches that k8s version.

There has been attempts to raise issues, and also to provide a PR that
does at least part of the work ... leaving the debugging part for the
maintainers of the CI. However, there was no action on those from the
maintainers.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-30 11:46:59 +01:00
Alex Lyn
fa521220a9 Merge pull request #11816 from jiuyi123/rs-vm-template-kata-ctl-merge
kata-ctl: add factory subcommands for VM template management
2025-10-30 18:21:12 +08:00
ssc
551caad4b1 docs: add guide on VM templating usage in runtime-rs
- Explained the concept and benefits of VM templating
- Provided step-by-step instructions for enabling VM templating
- Detailed the setup for using snapshotter in place of VirtioFS for template-based VM creation
- Added performance test results comparing template-based and direct VM creation

Signed-off-by: ssc <741026400@qq.com>
2025-10-30 15:18:31 +08:00
ssc
5a586e13a1 kata-ctl: add factory subcommands for VM template management
- init: initialize the VM template factory
- status: check the current factory status
- destroy: clean up and remove factory resources
These commands provide basic lifecycle management for VM templates.

Signed-off-by: ssc <741026400@qq.com>
2025-10-30 10:27:17 +08:00
RuoqingHe
8878c46e8f Merge pull request #11867 from spectator333/update-rust-vmm-deps
dragonball: Bump kvm-ioctls to fix security issue
2025-10-30 00:17:29 +08:00
Siyu Tao
dd444d23b3 dragonball: Bump kvm-ioctls to fix security issue
Use `ioctl_with_mut_ref` instead of `ioctl_with_ref` in the
`create_device` method as it needs to write to the `kvm_create_device`
struct passed to it, which was released in v0.12.1.

Signed-off-by: Siyu Tao <taosiyu2024@163.com>
2025-10-29 14:03:29 +00:00
Steve Horsman
0e19a2bf91 Merge pull request #11993 from zvonkok/vectorAdd
gpu: Add libs for CC
2025-10-29 13:42:34 +00:00
stevenhorsman
555926ea1a libs: Fix formatting issue
Fix the cargo fmt issues and then we can make the libs tests required
again to avoid this regression happening again.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2025-10-29 13:13:50 +01:00
Steve Horsman
dbdd1009af Merge pull request #11933 from kata-containers/topic/kata-deploy-nfd-dependency-part-I
kata-deploy: Automatically deploy NodeFeatureRules for TEEs
2025-10-29 09:50:38 +00:00
Fabiano Fidêncio
103f80c7f5 readme: install: Drop outdated documentation
kata-deploy helm chart is *THE* way to deploy kata-containers on
kubernetes environments, and kubernetes environments is basically the
only reliably tested deployment we have.

For now, let's just drop documentation that is outdated / incorrect, and
in the future let's ensure we update the linked docs, as we work on
update / upgrade for the helm chart.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-29 09:41:57 +01:00
Zvonko Kaiser
5ff218823c gpu: Remove unneeded libraries
The libs in question were added when moving to developer.nvidia.com
but switching back to ubuntu only based builds they are not needed.
Remove them to keep the rootfs as minimal as possible.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2025-10-29 08:03:36 +01:00
Zvonko Kaiser
6d9b4059f5 gpu: Add libs for CC
In the case of CC we need additional libraries in the rootfs.
Add them conditionally if type == confidential.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2025-10-29 08:03:36 +01:00
Xuewei Niu
55d181beb1 Merge pull request #11828 from jiuyi123/rs-vm-template-runtime-rs
runtime-rs: introduce VM template lifecycle and integration
2025-10-29 14:03:46 +08:00
Xuewei Niu
8aca32dfa9 Merge pull request #11862 from StevenFryto/rootless_clh
runtime-rs: supporting the CLH VMM process running in non-root mode
2025-10-29 13:31:53 +08:00
ssc
16e8cf1a09 runtime-rs: boot vm from template
Add build_vm_from_template() that flips boot_from_template flag,
wires factory.template_path/{memory,state} into the hypervisor config,
and returns ready-to-use hypervisor & agent instances.
When factory.template is enabled, VirtContainer bypasses normal creation
and directly boots the VM by restoring the template through incoming migration,
completing the "create → save → clone" loop.

Fixes: #11413

Signed-off-by: ssc <741026400@qq.com>
2025-10-29 12:38:28 +08:00
ssc
550615285c runtime-rs: add factory, template and vm modules for VM template lifecycle
Introduced factory::FactoryConfig with init/destroy/status commands to manage template pools.
Added template::Template to fetch, create and persist base VMs.
Introduced vm::{VM, VMConfig} exposing create, pause, save, resume, stop,
disconnect and migration helpers for sandbox integration.
Extended QemuInner to executes QMP incoming migration, pause/resume and status tracking.

Fixes: #11413

Signed-off-by: ssc <741026400@qq.com>
2025-10-29 12:38:28 +08:00
ssc
135c84b6cb kata-types: add VM template and factory configuration
Added new fields in Hypervisor struct to support VM template creation,
template boot, memory and device state paths, shared path, and store
paths. Introduced a Factory struct in config to manage template path,
cache endpoint, cache number, and template enable flag. Integrated
Factory into TomlConfig for runtime configuration parsing.

Fixes: #11413
Signed-off-by: ssc <741026400@qq.com>
2025-10-29 11:49:08 +08:00
stevenfryto
2ceadc5fa3 runtime-rs: supporting the CLH VMM process running in non-root mode
This change enables to run the Cloud Hypervisor VMM using a non-root user
when rootless flag is set true in configuration.

Fixes: #11414

Signed-off-by: stevenfryto <sunzitai_1832@bupt.edu.cn>
2025-10-29 01:55:10 +00:00
stevenfryto
2ddbae3aa6 runtime-rs: pass the tuntap fds down to Cloud Hypervisor
Pass the file descriptors of the tuntap device to the Cloud Hypervisor VMM process
so that the process could open the device without cap_net_admin

Signed-off-by: stevenfryto <sunzitai_1832@bupt.edu.cn>
2025-10-29 01:55:10 +00:00
Fabiano Fidêncio
59883a2d99 actions: Remove unused USING_NFD
There's no reason to keep the env var / input as it's never been used
and now kata-deploy detects automatically whether NFD is deployed or
not.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-28 21:24:27 +01:00
Fabiano Fidêncio
f9825b4e6e kata-deploy: Automatically deploy NodeFeatureRules for TEEs
When the NodeFeatureRule CRD is detected kata-deploy will:
* Create the specific NodeFeatureRules for the x86_64 TEEs
* Adapt the TEEs runtime classes to take into account the amount of keys
  available in the system when spawning the podsandbox.

Note, we still do not have NFD as sub-dependency of the helm chart, and
I'm not even sure if we will have. However, it's important to integrate
better with the scenarios where the NFD is already present.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-28 21:24:27 +01:00
Manuel Huber
8dc78057d6 ci: Refactor NVIDIA NIM test
Change NIM bats file logic to allow skipping test cases which
require multiple GPUs. This can be helpful for test clusters where
there is only one node with a single GPU, or for local test
environments with a single-node cluster with a single GPU.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2025-10-28 19:12:16 +01:00
Manuel Huber
be32b77baf ci: Add NVIDIA CUDA vectoradd test
This change adds a CUDA vectoradd test case and makes enabling NVRC
tracing optional and idempotent.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2025-10-28 19:12:16 +01:00
Fabiano Fidêncio
a164693e1a release: Bump version to 3.22.0
Bump VERSION and helm-chart versions

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
3.22.0
2025-10-28 16:28:18 +01:00
Steve Horsman
1b46cf43c4 Merge pull request #11989 from Amulyam24/actionpz-ppc64le
revert: Enable new ibm runners for ppc64le
2025-10-28 12:09:03 +00:00
Amulyam24
c603094584 revert: Enable new ibm runners for ppc64le
Temporarily disables the new runners for building artifacts jobs. Will be re-enabled once they are stable.

Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
2025-10-28 17:09:26 +05:30
Hyounggyu Choi
7d2fe5e187 revert: Enable new ibm runners for s390x
This partially reverts 8dcd91c for the s390x because the
CI jobs are currently blocking the release. The new runners
will be re-introduced once they are stable and no longer
impact critical paths.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-10-28 11:11:51 +01:00
Fabiano Fidêncio
754e832cfa kata-deploy: Allow passing shims / defaultShim per arch
This allows us to do a full multi-arch deployment, as the user can
easily select which shim can be deployed per arch, as some of the VMMs
are not supported on all architectures, which would lead to a broken
installation.

Now, passing shims per arch we can easily have an heterogenous
deployment where, for instance, we can set qemu-se-runtime-rs for s390x,
qemu-cca for aarch64, and qemu-snp / qemu-tdx for x86_64 and call all of
those a default kata-confidential ... and have everything working with
the same deployment.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-27 22:42:37 +01:00
Greg Kurz
ffdc80733a Merge pull request #11966 from zvonkok/gpu-cc-fix
gpu: rootfs fixes
2025-10-27 10:18:13 +01:00
Alex Lyn
418d5f724e Merge pull request #11971 from lifupan/fupan_blk_ratelimit
runtime-rs: Support disk rate limiter for dragonball
2025-10-27 17:12:47 +08:00
Alex Lyn
f86ac595a8 Merge pull request #11973 from Apokleos/enhance-oci-spec
runtime-rs: Enhancements for items within OCI Spec
2025-10-27 16:15:00 +08:00
Alex Lyn
690dad5528 runtime-rs: Ensure complete cleanup of stale Device Cgroups
The previous procedure failed to reliably ensure that all unused Device
Cgroups were completely removed, a failure consistently verified by CI
tests.

This change introduces a more robust and thorough cleanup mechanism. The
goal is to prevent previous issues—likely stemming from improper use of
Rust mutable references—that caused the modifications to be ineffective
or incomplete.

This ensures a clean environment and reliable CI test execution.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2025-10-27 12:47:48 +08:00
Alex Lyn
25ab615da5 Merge pull request #11913 from Apokleos/dedicated-error-rs
CI: Add dedicated expected error message for runtime-rs
2025-10-27 10:47:07 +08:00
Zvonko Kaiser
39848e0983 gpu: rootfs fixes
Build only from Ubuntu repositories do not mix with developer.nvidia.com

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>

Update tools/osbuilder/rootfs-builder/nvidia/nvidia_chroot.sh

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-26 19:36:55 +01:00
stevenhorsman
aec0ceb860 gatekeeper: Update mariner tests name
In https://github.com/kata-containers/kata-containers/pull/11972
the auto-generate-policy: yes matrix parameter was removed
which updates the name of the name, so sync this change
in required-tests.yaml

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2025-10-25 17:51:31 +02:00
Kevin Zhao
e2dbe87a99 tests: Fix cca test failure on arm64 and other architectures
Fix the wrong test with appendProtectionDevice on arm64

Signed-off-by: Kevin Zhao <kevin.zhao@linaro.org>
2025-10-25 13:54:35 +02:00
dependabot[bot]
99ae3607dc build(deps): bump astral-tokio-tar in /src/tools/agent-ctl
Bumps [astral-tokio-tar](https://github.com/astral-sh/tokio-tar) from 0.5.5 to 0.5.6.
- [Release notes](https://github.com/astral-sh/tokio-tar/releases)
- [Changelog](https://github.com/astral-sh/tokio-tar/blob/main/CHANGELOG.md)
- [Commits](https://github.com/astral-sh/tokio-tar/compare/v0.5.5...v0.5.6)

---
updated-dependencies:
- dependency-name: astral-tokio-tar
  dependency-version: 0.5.6
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-10-25 13:53:24 +02:00
Dan Mihai
61ee4d7f8b Merge pull request #11951 from burgerdev/watchable
genpolicy: allow non-watchable ConfigMaps
2025-10-24 08:38:55 -07:00
Steve Horsman
ac601ecd45 Merge pull request #11964 from Amulyam24/k8s-ppc64le
github: migrate k8s job to a different runner on ppc64le
2025-10-24 15:55:59 +01:00
Dan Mihai
ac3ea973ee Merge pull request #11958 from microsoft/danmihai1/policy-tests-upstream5
tests: k8s: auto-generate policy for additional tests
2025-10-24 07:18:00 -07:00
Amulyam24
9876cbffd6 github: migrate k8s job to a different runner on ppc64le
Migrate the k8s job to a different runner and use a long running cluster
instead of creating the cluster on every run.

Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
2025-10-24 18:20:11 +05:30