Commit Graph

18349 Commits

Author SHA1 Message Date
Anastassios Nanos
96ff12fdab runtime-rs: Split FC spawn sequence
To account for network hooks and the lack of device hotplugging on FC, we split the
spawn process into start and boot. This way, the VMM is already up, but the VM sandbox
is not -- we can add devices and then we can kick-off the VM with InstanceStart.

Signed-off-by: Anastassios Nanos <ananos@nubificus.co.uk>
2026-04-03 09:33:07 +00:00
Anastassios Nanos
91af99b7a5 runtime-rs: Fix FC spawn with network
Refactor network link detection so that FC can boot properly.

Signed-off-by: Anastassios Nanos <ananos@nubificus.co.uk>
2026-04-03 09:33:07 +00:00
RuoqingHe
26bd5ad754 Merge pull request #12762 from YutingNie/fix-runtime-rs-shared-fs-typo
runtime-rs: Fix typo in share_fs error message
2026-04-03 15:24:33 +08:00
Yuting Nie
517882f93d runtime-rs: Fix typo in share_fs error message
There's a typo in the error message which gets prompted when an
unsupported share_fs was configured. Fixed shred -> shared.

Signed-off-by: Yuting Nie <yuting.nie@spacemit.com>
2026-04-03 05:23:46 +00:00
Alex Lyn
4a1c2b6620 Merge pull request #12309 from kata-containers/stale-issues-by-date
workflows: Create workflow to stale issues based on date
2026-04-03 09:31:34 +08:00
Fabiano Fidêncio
09194d71bb Merge pull request #12767 from nubificus/fix/fc-rs
runtime-rs: Fix FC API fields
2026-04-02 18:24:35 +02:00
Manuel Huber
dd868dee6d tests: nvidia: onboard NIM service test
Onboard a test case for deploying a NIM service using the NIM
operator. We install the operator helm chart on the fly as this is
a fast operation, spinning up a single operand. Once a NIM service
is scheduled, the operator creates a deployment with a single pod.

For now, the TEE-based flow uses an allow-all policy. In future
work, we strive to support generating pod security policies for the
scenario where NIM services are deployed and the pod manifest is
being generated on the fly.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-04-02 16:58:54 +02:00
Steve Horsman
58101a2166 Merge pull request #12656 from stevenhorsman/actions/checkout-bump
workflows: Update actions/checkout version
2026-04-01 17:34:39 +01:00
Fabiano Fidêncio
75df4c0bd3 Merge pull request #12766 from fidencio/topic/kata-deploy-avoid-kata-pods-to-crash-after-containerd-restart
kata-deploy: Fix kata-deploy pods crashing if containerd restarts
2026-04-01 18:28:16 +02:00
Steve Horsman
2830c4f080 Merge pull request #12746 from ldoktor/ci-helm2
ci.ocp: Use helm deployment for peer-pods
2026-04-01 17:13:21 +01:00
Lukáš Doktor
55a3772032 ci.ocp: Add note about external tests to README.md
to run all the tests that are running in CI we need to enable external
tests. This can be a bit tricky so add it into our documentation.

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
2026-04-01 16:59:33 +01:00
Lukáš Doktor
3bc460fd82 ci.ocp: Use helm deployment for peer-pods
replace the deprecated CAA deployment with helm one. Note that this also
installs the CAA mutating webhook, which wasn't installed before.

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
2026-04-01 16:59:33 +01:00
Anastassios Nanos
02c82b174a runtime-rs: Fix FC API fields
A FC update caused bad requests for the runtime-rs runtime when
specifying the vcpu count and block rate limiter fields.

Signed-off-by: Anastassios Nanos <ananos@nubificus.co.uk>
2026-04-01 14:50:51 +00:00
Fabiano Fidêncio
2131147360 tests: add kata-deploy lifecycle tests for restart resilience and cleanup
Add functional tests that cover two previously untested kata-deploy
behaviors:

1. Restart resilience (regression test for #12761): deploys a
   long-running kata pod, triggers a kata-deploy DaemonSet restart via
   rollout restart, and verifies the kata pod survives with the same
   UID and zero additional container restarts.

2. Artifact cleanup: after helm uninstall, verifies that RuntimeClasses
   are removed, the kata-runtime node label is cleared, /opt/kata is
   gone from the host filesystem, and containerd remains healthy.

3. Artifact presence: after install, verifies /opt/kata and the shim
   binary exist on the host, RuntimeClasses are created, and the node
   is labeled.

Host filesystem checks use a short-lived privileged pod with a
hostPath mount to inspect the node directly.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-01 15:20:53 +02:00
Fabiano Fidêncio
b4b62417ed kata-deploy: skip cleanup on pod restart to avoid crashing kata pods
When a kata-deploy DaemonSet pod is restarted (e.g. due to a label
change or rolling update), the SIGTERM handler runs cleanup which
unconditionally removes kata artifacts and restarts containerd. This
causes containerd to lose the kata shim binary, crashing all running
kata pods on the node.

Fix this by implementing a three-stage cleanup decision:

1. If this pod's owning DaemonSet still exists (exact name match via
   DAEMONSET_NAME env var), this is a pod restart — skip all cleanup.
   The replacement pod will re-run install, which is idempotent.

2. If this DaemonSet is gone but other kata-deploy DaemonSets still
   exist (multi-install scenario), perform instance-specific cleanup
   only (snapshotters, CRI config, artifacts) but skip shared
   resources (node label removal, CRI restart) to avoid disrupting
   the other instances.

3. If no kata-deploy DaemonSets remain, perform full cleanup including
   node label removal and CRI restart.

The Helm chart injects a DAEMONSET_NAME environment variable with the
exact DaemonSet name (including any multi-install suffix), ensuring
instance-aware lookup rather than broadly matching any DaemonSet
containing "kata-deploy".

Fixes: #12761

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-01 15:20:52 +02:00
Fabiano Fidêncio
28414a614e kata-deploy: detect k3s/rke2 via systemd services instead of version string
Newer k3s releases (v1.34+) no longer include "k3s" in the containerd
version string at all (e.g. "containerd://2.2.2-bd1.34" instead of
"containerd://2.1.5-k3s1"). This caused kata-deploy to fall through to
the default "containerd" runtime, configuring and restarting the system
containerd service instead of k3s's embedded containerd — leaving the
kata runtime invisible to k3s.

Fix by detecting k3s/rke2 via their systemd service names (k3s,
k3s-agent, rke2-server, rke2-agent) rather than parsing the containerd
version string. This is more robust and works regardless of how k3s
formats its containerd version.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-01 14:24:55 +02:00
Fabiano Fidêncio
8b9ce3b6cb tests: remove k3s/rke2 V3 containerd template workaround
Remove the workaround that wrote a synthetic containerd V3 config
template for k3s/rke2 in CI. This was added to test kata-deploy's
drop-in support before the upstream k3s/rke2 patch shipped. Now that
k3s and rke2 include the drop-in imports in their default template,
the workaround is no longer needed and breaks newer versions.

Removed:
- tests/containerd-config-v3.tmpl (synthetic Go template)
- _setup_containerd_v3_template_if_needed() and its k3s/rke2 wrappers
- Calls from deploy_k3s() and deploy_rke2()

This reverts the test infrastructure part of a2216ec05.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-01 14:24:55 +02:00
Steve Horsman
9a3f6b075e Merge pull request #12753 from stevenhorsman/remove-agent-Cargo-
agent: Remove Cargo.lock
2026-04-01 13:22:57 +01:00
Steve Horsman
0d38d88b07 Merge pull request #12484 from Amulyam24/runtime-rs-ppc64le
runtime-rs: add QEMU support for ppc64le
2026-04-01 12:54:40 +01:00
Fabiano Fidêncio
6555350625 Merge pull request #12765 from fidencio/topic/kata-deploy-nydus-fix-systemd-unit
kata-deploy: Make nydus a soft dep of containerd
2026-04-01 13:40:42 +02:00
Fabiano Fidêncio
b823184cf7 Merge pull request #12580 from manuelh-dev/mahuber/gpu-ci-storage
tests: gpu: use the container image layer storage feature
2026-04-01 12:23:56 +02:00
Fabiano Fidêncio
fe1f804543 kata-deploy: Restart nydus-snapshotter in case of failure
Let's ensure that in case nydus-snapshotter crashes for one reason or
another, the service is restarted.

This follows containerd approach, and avoids manual intervention in the
node.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-01 11:00:21 +02:00
Fabiano Fidêncio
789abe6fdf kata-deploy: Make nydus a soft dep of containerd
Let's relax our RequiredBy and use a WantedBy in the nydus systemd unit
file as, in case of a nydus crash, containerd would also be put down,
causing the node to become NotReady.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-04-01 10:52:29 +02:00
Amulyam24
bf74f683d7 runtime-rs: align memory size with desired block size on ppc64le
couldn't initialise QMP: Connection reset by peer (os error 104)
Caused by:
    Connection reset by peer (os error 104)

qemu stderr: "qemu-system-ppc64: Maximum memory size 0x80000000 is not aligned to 256 MiB”

When the default max memory was assigned according to the
available host memory, it failed with the above error

Align the memory values with the block size of 256 MB on ppc64le.

Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
2026-04-01 09:36:45 +01:00
Amulyam24
dcb7d025c7 runtime-rs: Use libc::TUNSETIFF instead of wrapper TUNSETIFF()
While attaching the tap device, it fails on ppc64le with EBADF

"cannot create tap device. File descriptor in bad state (os error 77)\"): unknown”

Refactor the ioctl call to use the standard libc::TUNSETIFF constant.

Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
2026-04-01 09:36:45 +01:00
Amulyam24
8d25ff2c36 runtime-rs: implement set_capabilities for qemu
After the qemu VM is booted, while storing the guest details,
it fails to set capabilities as it is not yet implemented
for QEMU, this change adds a default implementation for it.

Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
2026-04-01 09:36:45 +01:00
Amulyam24
778524467b runtime-rs: enable building runtime-rs on ppc64le
Adds changes in Makefile to build runtime-rs on ppc64le
with QEMU support.

Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
2026-04-01 09:36:45 +01:00
Manuel Huber
177f5c308e tests: gpu: use container image layer storage
Use the container image layer storage feature for the
k8s-nvidia-nim.bats test pod manifests. This reduces the pods'
memory requirements.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-04-01 10:22:26 +02:00
Manuel Huber
b6cf00a374 tests: parametrize storage parameters
- trusted-storage.yaml.in: use $PV_STORAGE_CAPACITY and
  $PVC_STORAGE_REQUEST so that PV/PVC size can vary per test.
- confidential_common.sh: add optional size (MB) argument to
  create_loop_device.
- k8s-guest-pull-image.bats: pass PV_STORAGE_CAPACITY and
  PVC_STORAGE_REQUEST when generating storage config.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-04-01 10:22:26 +02:00
Fabiano Fidêncio
f756966a8e Merge pull request #12757 from BbolroC/fix-parsing-imcompatibility-containerd-config
tests: Configure devmapper properly regardless of containerd version
2026-04-01 10:21:50 +02:00
stevenhorsman
5390e470d3 agent: Remove Cargo.lock
Following on from #12690, the agent is part of the repo workspace,
so no longer needs a lock file.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-04-01 09:11:28 +01:00
Hyounggyu Choi
11cd5f2808 tests: Configure devmapper properly regardless of containerd version
The follow differences are observed between container 1.x and 2.x:

```
[plugins.'io.containerd.snapshotter.v1.devmapper']
snapshotter = 'overlayfs'
```
and
```
[plugins."io.containerd.snapshotter.v1.devmapper"]
snapshotter = "overlayfs"
```

The current devmapper configuration only works with double quotes.
Make it work with both single and double quotes via tomlq.

In the default configuration for containerd 2.x, the following
configuration block is missing:

```
[[plugins.'io.containerd.transfer.v1.local'.unpack_config]]
  platform = "linux/s390x"     # system architecture
  snapshotter = "devmapper"
```

Ensure the configuration block is added for containerd 2.x.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-04-01 07:14:52 +02:00
Fabiano Fidêncio
7e371ac325 Merge pull request #12758 from fidencio/topic/kata-deploy-skip-k3s-tests
gatekeeper: unrequire kata-deploy k3s tests
2026-04-01 07:13:51 +02:00
Fabiano Fidêncio
3a1683ccdc gatekeeper: unrequire kata-deploy k3s tests
Those are breaking, and I need time to investigate why.
For now, unrequire those tests.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-31 18:32:17 +02:00
Steve Horsman
468abea97a Merge pull request #12719 from kata-containers/sprt/env-no-deploy
gha: Avoid noisy deployment logs in PRs
2026-03-31 17:12:07 +01:00
Aurélien Bombo
78289d19f7 gha: Pin actionlint version
Pin to the latest released version as a security measure.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-31 10:51:17 -05:00
Aurélien Bombo
3122fa651e gha: Avoid noisy deployment logs in PRs
GitHub recently announced that developers can now use environments without
auto-deployment, which allows us to avoid the noisy deployment logs in our
PRs:

https://github.blog/changelog/2026-03-19-github-actions-late-march-2026-updates/#github-actions-now-allows-developers-to-use-environments-without-auto-deployment

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-31 10:51:13 -05:00
stevenhorsman
99eaa8fcb1 workflows: Create workflow to stale issues based on date
The standard stale/action is intended to be run regularly with
a date offset, but we want to have one we can run against a specific
date in order to run the stale bot against issues created since a particular
release milestone, so calculate the offset in one step and use it in the next.

At the moment we want to run this to stale issues before 9th October 2022 when Kata 3.0 was release, so default to this.

Note the stale action only processes a few issues at a time to avoid rate limiting, so why we want a cron job to it can get through
the backlog, but also to stale/unstale issues that are commented on.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-31 15:57:37 +01:00
RuoqingHe
a0e99a86cf Merge pull request #12690 from Jiahao1226/put-agent-into-root-workspace
build: Move agent to root workspace
2026-03-30 18:07:28 +08:00
stevenhorsman
12578b41f2 govmm: Delete old files
The govmm workflow isn't run by us and it and the other CI files
are just legacy from when it was a separate repo, so let's clean up
this debt rather than having to update it frequently.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-30 10:45:28 +01:00
stevenhorsman
b3179bdd8e workflows: Update actions/checkout version
Update the action to resolve the following warning in GHA:
> Node.js 20 actions are deprecated. The following actions are running
> on Node.js 20 and may not work as expected:
> actions/checkout@11bd71901b.
> Actions will be forced to run with Node.js 24 by default starting June 2nd, 2026.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-30 10:45:28 +01:00
Steve Horsman
012bf4b333 Merge pull request #12635 from Apokleos/update-docs-rs
runtime-rs: Update docs for runtime-rs
2026-03-30 10:42:31 +01:00
Alex Lyn
7dce05b5fc docs: Update the pictures of kata 4.0 with mermaid codes
It becomes simple and flexible with mermaid codes to update
the pic or diagrams. And it also remove the legacy PNG pictures
to reduce the kata-statics release file size.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
3c584a474f docs: Update libs README with complete library documentation
Add all 9 library crates which are missing in workspace including:
(1) kata-types with annotations, hypervisor configs, and K8s utilities.
(2) kata-sys-util with all sub-modules: cpu, device, fs, hooks, k8s,
    mount, netns, numa, pcilibs, protection, spec, validate.
(3) protocols with ttrpc bindings: agent, health, remote, csi, oci,
    confidential_data_hub.
(4) runtime-spec with OCI container state types and namespace constants.
(5) shim-interface with RESTful API and Unix socket path.
(6) logging with slog framework features: JSON, journal, filtering.
(7) safe-path with security-focused path resolution utilities.
(8) mem-agent with memory management: memcg, compact, psi.
(9) test-utils with privilege and KVM test macros.
And one more thing, uniformly adopt TOCTOU in place of the redundant
TOCTTOU abbreviation.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
48ef2220e8 docs: Update runtime-rs README with accurate architecture documentation
Add comprehensive hypervisor support table (Dragonball, QEMU,
Cloud Hypervisor, Firecracker, Remote). Document all runtime handlers
(VirtContainer, LinuxContainer, WasmContainer) and resource types.

List all configuration files including CoCo variants (TDX, SNP, SE).
Add shim-ctl crate to crates table for development tooling reference.
Add Feature Flags section documenting dragonball and cloud-hypervisor
options.

Simplify and restructure content for clarity while preserving technical
accuracy.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
c96b2034dc docs: Update kata-types README with comprehensive module documentation
Add detailed module documentation table describing all available
modules including:
  - annotations
  - capabilities
  - config
  - container
  ...

Document configuration module features including TOML-based loading,
drop-in files, and hypervisor-specific configurations (QEMU,
Cloud Hypervisor, Firecracker, Dragonball, Remote).

Improve formatting with Markdown tables and structured sections for
better readability.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
b8576ef476 docs: Update kata-sys-util README with comprehensive feature documentation
Expand README.md to include detailed documentation for all modules:
- File system operations (fs)
- Mount operations (mount)
- CPU utilities (cpu)
- NUMA support (numa)
- Device management (device)
- Kubernetes support (k8s)
- Network namespace (netns)
- OCI specification utilities (spec)
- Validation (validate)
- Hooks (hooks)
- Guest protection (protection)
- Random generation (rand)
- PCI device management (pcilibs)

Add supported architectures list and improve overview section.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
a747b9f774 docs: Improve and refine hypervisor README documentation
Enhance documentation in the hypervisor README.md file with:

(1) Standardized terminology and formatting (VMM capitalization)
(2) Improved paragraph transitions and logical flow
(3) Fixed punctuation errors in code blocks

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
302b2c8d75 docs: Restructure and modernize virtualization design document
Comprehensive rewrite of docs/design/virtualization.md to improve
clarity, completeness, and usability.

This document now serves as the authoritative guide for
understanding and selecting hypervisors in Kata Containers deployments.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
7fa68ffd52 docs: Consolidate hypervisor documentation in virtualization.md
Add 'Choose a Hypervisor', 'Hypervisor Configuration Files', and
'Hypervisor Versions' sections to virtualization.md.

Key changes:
- Integrate hypervisor comparison table from hypervisors.md
- Add configuration file reference table for both go and rust runtimes
- Add current hypervisor versions from versions.yaml:
  - Cloud Hypervisor: v51.1
  - Firecracker: v1.12.1
  - QEMU: v10.2.1
  - StratoVirt: v2.3.0
  - Dragonball: builtin (part of rust runtime)
- Preserve original structure documenting each hypervisor's device model
  and features
- Add reference links for all hypervisors

This consolidates hypervisor selection guidance and version information
into a single comprehensive virtualization design document.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00