Compare commits

...

149 Commits

Author SHA1 Message Date
Manuel Huber
24fd3ef4d5 genpolicy: reproduce fsGroup error 2026-03-30 15:00:19 -07:00
RuoqingHe
a0e99a86cf Merge pull request #12690 from Jiahao1226/put-agent-into-root-workspace
build: Move agent to root workspace
2026-03-30 18:07:28 +08:00
Steve Horsman
012bf4b333 Merge pull request #12635 from Apokleos/update-docs-rs
runtime-rs: Update docs for runtime-rs
2026-03-30 10:42:31 +01:00
Alex Lyn
7dce05b5fc docs: Update the pictures of kata 4.0 with mermaid codes
It becomes simple and flexible with mermaid codes to update
the pic or diagrams. And it also remove the legacy PNG pictures
to reduce the kata-statics release file size.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
3c584a474f docs: Update libs README with complete library documentation
Add all 9 library crates which are missing in workspace including:
(1) kata-types with annotations, hypervisor configs, and K8s utilities.
(2) kata-sys-util with all sub-modules: cpu, device, fs, hooks, k8s,
    mount, netns, numa, pcilibs, protection, spec, validate.
(3) protocols with ttrpc bindings: agent, health, remote, csi, oci,
    confidential_data_hub.
(4) runtime-spec with OCI container state types and namespace constants.
(5) shim-interface with RESTful API and Unix socket path.
(6) logging with slog framework features: JSON, journal, filtering.
(7) safe-path with security-focused path resolution utilities.
(8) mem-agent with memory management: memcg, compact, psi.
(9) test-utils with privilege and KVM test macros.
And one more thing, uniformly adopt TOCTOU in place of the redundant
TOCTTOU abbreviation.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
48ef2220e8 docs: Update runtime-rs README with accurate architecture documentation
Add comprehensive hypervisor support table (Dragonball, QEMU,
Cloud Hypervisor, Firecracker, Remote). Document all runtime handlers
(VirtContainer, LinuxContainer, WasmContainer) and resource types.

List all configuration files including CoCo variants (TDX, SNP, SE).
Add shim-ctl crate to crates table for development tooling reference.
Add Feature Flags section documenting dragonball and cloud-hypervisor
options.

Simplify and restructure content for clarity while preserving technical
accuracy.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
c96b2034dc docs: Update kata-types README with comprehensive module documentation
Add detailed module documentation table describing all available
modules including:
  - annotations
  - capabilities
  - config
  - container
  ...

Document configuration module features including TOML-based loading,
drop-in files, and hypervisor-specific configurations (QEMU,
Cloud Hypervisor, Firecracker, Dragonball, Remote).

Improve formatting with Markdown tables and structured sections for
better readability.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
b8576ef476 docs: Update kata-sys-util README with comprehensive feature documentation
Expand README.md to include detailed documentation for all modules:
- File system operations (fs)
- Mount operations (mount)
- CPU utilities (cpu)
- NUMA support (numa)
- Device management (device)
- Kubernetes support (k8s)
- Network namespace (netns)
- OCI specification utilities (spec)
- Validation (validate)
- Hooks (hooks)
- Guest protection (protection)
- Random generation (rand)
- PCI device management (pcilibs)

Add supported architectures list and improve overview section.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
a747b9f774 docs: Improve and refine hypervisor README documentation
Enhance documentation in the hypervisor README.md file with:

(1) Standardized terminology and formatting (VMM capitalization)
(2) Improved paragraph transitions and logical flow
(3) Fixed punctuation errors in code blocks

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
302b2c8d75 docs: Restructure and modernize virtualization design document
Comprehensive rewrite of docs/design/virtualization.md to improve
clarity, completeness, and usability.

This document now serves as the authoritative guide for
understanding and selecting hypervisors in Kata Containers deployments.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
7fa68ffd52 docs: Consolidate hypervisor documentation in virtualization.md
Add 'Choose a Hypervisor', 'Hypervisor Configuration Files', and
'Hypervisor Versions' sections to virtualization.md.

Key changes:
- Integrate hypervisor comparison table from hypervisors.md
- Add configuration file reference table for both go and rust runtimes
- Add current hypervisor versions from versions.yaml:
  - Cloud Hypervisor: v51.1
  - Firecracker: v1.12.1
  - QEMU: v10.2.1
  - StratoVirt: v2.3.0
  - Dragonball: builtin (part of rust runtime)
- Preserve original structure documenting each hypervisor's device model
  and features
- Add reference links for all hypervisors

This consolidates hypervisor selection guidance and version information
into a single comprehensive virtualization design document.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
119a145923 docs: Upgrade architecture documentation from 3.0 to 4.0
Replace Kata 3.0 architecture docs with Kata 4.0 (Rust Runtime)
documentation.

Key changes:
- Remove deprecated architecture 3.0 documentation
- Add comprehensive Kata 4.0 architecture guide covering:
  - Unified single-binary architecture
  - Built-in Dragonball VMM integration
  - Async I/O model with Tokio
  - Layered architecture design
  - Modular resource manager
  - Extensible framework for multiple container types

The new documentation reflects the production-ready Rust runtime
with improved performance and reduced resource consumption.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
9f6bce9517 docs: Remove containerd settings from crio dedicated document
As the document is just for CRI-O, we need remove containerd related
settings from it and make it clear for users.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
b04260f926 docs: Rename run-kata-with-k8s with adding crio
As previous document of run-kata-with-k8s.md is not clear for new comers
to quickly find the way to run kata with k8s/crio. In this commit, it
just rename the document name and make it clear.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
26d41b8f6e docs: Remove the dedicated installation guide for runtime-rs
When runtime-rs becomes default runtime, everything just for runtime-rs
will be changed, and the dedicated installation for runtime-rs will
be deprecated.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
004333ed71 docs: Update containerd-kata.md with clear settings
In this commit:
(1) Update containerd config with kata configurations
(2) Add more comments to guide how to use containerd/kata with default
setting and customized configure setting;
(3) Update the usage of containerd cmd tool ctr with explicitly
specified runtime-config-path options to make it work.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
8dae67794a docs: switch to blockfile snapshotter for SEV-SNP in runtime-rs
Updated the configuration guide to use `shared_fs = "none"`. This
change reflects that `virtio-9p` is deprecated in `runtime-rs` and
recommends the blockfile snapshotter as a stable alternative to
the buggy `virtio-fs` in SEV-SNP QEMU versions.

But this's limited in the nerdctl or ctr tools.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
65b2a75aca runtime-rs: Fix typo USE_BUILDIN_DB with USE_BUILTIN_DB
Corrects the typo 'BUILDIN' to the standard 'BUILTIN' across the
codebase to improve code quality and documentation consistency.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
75ecfe3fe2 docs: Fix volume type and fs type
Correct the volume type with `volume-type` and fix the fs type
with `fstype`.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Alex Lyn
a923bb2917 docs: Add document for how-to-use passthroughfd-IO within runtime-rs
This document describes the Passthrough-FD (pass-fd) technology
implemented in Kata Containers to optimize IO performance. By bypassing
the intermediate proxy layers, this technology significantly reduces
latency and CPU overhead for container IO streams.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-03-29 19:17:03 +02:00
Jiahao Wang
1163b6581f agent: Change TARGET_PATH to root workspace
After agent was moved to root workspace, the products are now under the
repo root. Change the TARGET_PATH accordingly to tell Makefile where to
lookup output.

Signed-off-by: Jiahao Wang <jiahao.wang@lingcage.com>
2026-03-29 06:35:54 +00:00
Jiahao Wang
29e5d5d951 build: Move agent to root workspace
This commit adds kata agent to the root workspace, as a follow up work
of #12413.

Remove agent from exclude list, and make it as a member of root
workspace.

Signed-off-by: Jiahao Wang <jiahao.wang@lingcage.com>
2026-03-29 06:35:38 +00:00
Fabiano Fidêncio
0cf3243801 Merge pull request #12683 from PiotrProkop/logical-physical-block-size
runtime: allow specifying logical/physical sector size for block devices
2026-03-27 22:55:29 +01:00
PiotrProkop
64735222c6 runtime: allow specifying logical/physical sector size for block devices
Add two new configuration knobs that control the logical and physical
sector sizes advertised by virtio-blk devices to the guest:

  block_device_logical_sector_size  (config file)
  block_device_physical_sector_size (config file)

  io.katacontainers.config.hypervisor.blk_logical_sector_size  (annotation)
  io.katacontainers.config.hypervisor.blk_physical_sector_size (annotation)

The annotation names are abbreviated relative to the config file keys
because Kubernetes enforces a 63-character limit on annotation name
segments, and the full names would exceed it.

Both settings default to 0 (let QEMU decide). When set, they are passed
as logical_block_size and physical_block_size in the QMP device_add
command during block device hotplug.

Setting logical_sector_size smaller then container filesystem
block size will cause EINVAL on mount. The physical_sector_size can
always be set independently.

Values must be 0 or a power of 2 in the range [512, 65536]; other
values are rejected with an error at sandbox creation time.

Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2026-03-27 18:56:54 +01:00
Aurélien Bombo
30e030e18e Merge pull request #12679 from microsoft/user/romoh/gpu-fix
clh: Add VFIO device cold-plug support
2026-03-27 11:12:51 -05:00
Hyounggyu Choi
8cebcf0113 Merge pull request #12742 from BbolroC/remove-skipped-emptydir-tests-for-ibm-sel
tests: Remove skip condition for emptyDir-related tests on IBM SEL
2026-03-27 14:35:48 +01:00
Fabiano Fidêncio
237729d728 Merge pull request #12739 from fidencio/topic/kata-deploy-nydus-use-a-different-namespace
kata-deploy: rename nydus-snapshotter to nydus-for-kata-tee
2026-03-27 14:32:58 +01:00
Fabiano Fidêncio
f0ad9f1709 tests: snp: policy: Adjust to containerd 2.3.0
As the AMD maintainers switched to the 2.3.0-beta.0 containerd (due to
the nydus fixes that landed there).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-27 11:14:54 +01:00
Fabiano Fidêncio
1b8189731a tests: hand nydus snapshotter setup over to kata-deploy
Now that kata-deploy deploys and manages nydus-for-kata-tee on all
platforms, the separate standalone nydus-snapshotter DaemonSet deployment
is no longer needed.

- Short-circuit deploy_nydus_snapshotter and cleanup_nydus_snapshotter
  to no-ops with an explanatory message.
- Add qemu-snp to the workaround case so AMD SEV-SNP baremetal runners
  also get USE_EXPERIMENTAL_SETUP_SNAPSHOTTER=true and kata-deploy picks
  up the snapshotter setup on every run.
- Drop the x86_64 arch guard and the hypervisor sub-case from the
  EXPERIMENTAL_SETUP_SNAPSHOTTER block, allowing any architecture and
  hypervisor to use the kata-deploy-managed path when the flag is set.

Made-with: Cursor
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-27 11:14:54 +01:00
Fabiano Fidêncio
4fad88499c kata-deploy: rename nydus-snapshotter to nydus-for-kata-tee
Rename all host-visible names of the nydus-snapshotter instance managed
by kata-deploy from the generic "nydus-snapshotter" to "nydus-for-kata-tee".
This covers the systemd service name, the containerd proxy plugin key,
the runtime class snapshotter field, the data directory
(/var/lib/nydus-for-kata-tee), the socket path (/run/nydus-for-kata-tee/),
and the host install subdirectory.

The rename makes it immediately clear that this nydus-snapshotter instance
is the one deployed and managed by kata-deploy specifically for Kata TEE
use cases, rather than any general-purpose nydus-snapshotter that might
be present on the host.

Because the old code operated under a completely separate set of paths
(nydus-snapshotter.*), any previously deployed installation continues
to run without interference during the transition to this new naming.
CI pipelines and operators can upgrade kata-deploy on their own schedule
without having to coordinate an atomic cutover: the old service keeps
serving its existing workloads until it is explicitly replaced, and the
new deployment lands cleanly alongside it.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-27 11:14:54 +01:00
Fabiano Fidêncio
fb77c357f4 Merge pull request #12743 from BbolroC/enable-trusted-ephemeral-storage-ibm-sel
runtime: Set emptydir_mode to DEFEMPTYDIRMODE_COCO for IBM SEL
2026-03-27 09:48:28 +01:00
Hyounggyu Choi
de3afd3076 tests: Remove skip condition for s390x in trusted ephemeral storage test
Remove the skip condition for s390x in k8s-trusted-ephemeral-data-storage.bats.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-03-26 18:58:13 +01:00
Hyounggyu Choi
cd931d4905 runtime: Set emptydir_mode to DEFEMPTYDIRMODE_COCO for IBM SEL
The enablement of the trusted ephemeral storage for IBM SEL was
missed in #10559. Set the emptydir_mode properly for the TEE.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-03-26 15:55:30 +01:00
Hyounggyu Choi
911aee5ad7 tests: Remove skip condition for emptyDir-related tests on IBM SEL
Fixes: #10002

Since #11537 resolves the issue, remove the skip conditions for
the k8s e2e tests involving emptyDir volume mounts.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-03-26 15:39:33 +01:00
Roaa Sakr
858620d2e7 clh: Add VFIO device cold-plug support
Enable VFIO device pass-through at VM creation time on Cloud Hypervisor,
in addition to the existing hot-plug path.

Signed-off-by: Roaa Sakr <romoh@microsoft.com>
2026-03-25 16:39:25 -07:00
Steve Horsman
8c2b7ed619 Merge pull request #12729 from fidencio/topic/kata-deploy-nydus-dont-touch-data-dir-on-install
kata-deploy: nydus: never remove the data dir
2026-03-25 10:28:50 +00:00
Steve Horsman
af7fdd5cd1 Merge pull request #12725 from kata-containers/sprt/cargo-check-fix
build: Don't fail `cargo check` on a dirty tree
2026-03-25 10:21:16 +00:00
Steve Horsman
0d8186ae16 Merge pull request #12730 from fidencio/topic/bump-nydus-snapshotter
versions: Bump nydus-snapshotter to v0.15.13
2026-03-25 10:20:23 +00:00
Steve Horsman
7e0f5e533a Merge pull request #12733 from fidencio/topic/unrequire-nvidia-gpu-snp-tests-till-we-fix-auth-issues
gatekeeper: Unrequire NVIDIA GPU SNP tests till auth is fixed
2026-03-25 10:11:10 +00:00
Fabiano Fidêncio
bcfb2354e0 gatekeeper: Unrequire NVIDIA GPU SNP tests till auth is fixed
SSIA, the NIM tests are breaking due to authentication issues, and those
issues are blocking other PRs.

Let's unrequire the test for now, and mark it as required again once we
fixed the auth issues.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-25 10:23:53 +01:00
Fabiano Fidêncio
caf6b244e6 versions: Bump nydus-snapshotter to v0.15.13
As this brings in a fix for using images with too many layers.
https://github.com/containerd/nydus-snapshotter/releases/tag/v0.15.13

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-25 08:31:48 +01:00
Fabiano Fidêncio
fb5482f647 kata-deploy: nydus: never remove the data directory
Removing /var/lib/nydus-snapshotter during install or uninstall creates
a split-brain state: the nydus backend starts empty while containerd's
BoltDB (meta.db) still holds snapshot records from the previous run.
Any subsequent image pull then fails with:

  "unable to prepare extraction snapshot:
   target snapshot \"sha256:...\": already exists"

An earlier attempt cleaned up containerd's BoltDB via `ctr snapshots rm`
before wiping the directory, but that cleanup is inherently fragile:

- It requires the nydus gRPC service to be reachable at cleanup time.
  If the service is stopped, crashed, or not yet running, every `ctr`
  call silently fails and the stale records remain.
- Any workload still actively using a snapshot blocks the entire
  cleanup, making it impossible to guarantee a clean state.

The correct invariant is that meta.db and the nydus backend always
agree.  Preserving the data directory unconditionally guarantees this:

  - Fresh install: data directory does not exist, nydus starts empty.
  - Reinstall: existing snapshots and nydus.db are preserved, meta.db
    and backend remain in sync, new binary starts cleanly.
  - After uninstall: containerd is reconfigured without the nydus
    proxy_plugins entry and restarted, so the snapshot records in
    meta.db are completely dormant — nothing will use them.  If nydus
    is reinstalled later, the data directory is still present and both
    sides remain in sync, so no split-brain can occur.

Any stale snapshots from previous workloads are garbage-collected by
containerd once the images referencing them are removed.

This also removes the cleanup_containerd_nydus_snapshots,
cleanup_nydus_snapshots, and cleanup_nydus_containers helpers that
were introduced by the earlier (fragile) attempt.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-03-25 07:06:41 +01:00
Alex Lyn
46aa318b74 Merge pull request #12716 from lifupan/bump_dragonball_kernel
kernel: Bump the kernel to v6.18.15 for dragonball
2026-03-25 11:04:44 +08:00
Aurélien Bombo
ec9c57c595 Merge pull request #12467 from ldoktor/gk-output
tools.gatekeeper: Improve output
2026-03-24 17:03:55 -05:00
Fabiano Fidêncio
8950f1caeb Merge pull request #12706 from fidencio/topic/ci-tdx-nydus-snapshotter
tests: Use the helm chart to setup nydus for TDX
2026-03-24 22:37:38 +01:00
Fabiano Fidêncio
814ae53d77 tests: Use the helm chart to setup nydus for TDX
Now that containerd 2.3.0-beta.0 has been released, it brings fixes for
multi-snapshotters that allows us to test the baremetal machines in the
same way we test the non-baremetal ones.

Let's start doing the switch for TDX as timezone is friendlier with
Mikko.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-24 19:13:59 +01:00
Fabiano Fidêncio
27dfb0d06f Merge pull request #12724 from fidencio/topic/kata-deploy-properly-cleanup-nydus-snapshotter-on-uninstall
kata-deploy: nydus: clean containerd metadata before cleaning up the backend
2026-03-24 19:13:25 +01:00
Aurélien Bombo
7ae2282a99 build: Don't fail cargo check on a dirty tree
`cargo check` was introduced in 3f1533a to check that Cargo.lock is in sync
with Cargo.toml. However, if there are uncommitted changes in the working
tree, the current invocation will immediately fail because of the `git
diff` call, which is frustrating for local development.

As it turns out, `cargo clippy` is a superset of `cargo check`, so we can
simply pass `--locked` to `cargo clippy` to detect Cargo.lock issues.

This is tested with the following change:

diff --git a/src/agent/Cargo.lock b/src/agent/Cargo.lock
index 96b6c676d..e1963af00 100644
--- a/src/agent/Cargo.lock
+++ b/src/agent/Cargo.lock
@@ -4305,6 +4305,7 @@ checksum = "8f50febec83f5ee1df3015341d8bd429f2d1cc62bcba7ea2076759d315084683"
 name = "test-utils"
 version = "0.1.0"
 dependencies = [
- "libc",
  "nix 0.26.4",
 ]

which results in the following output:

$ make -C src/agent check
make: Entering directory '/kata-containers/src/agent'
standard rust check...
cargo fmt -- --check
cargo clippy --all-targets --all-features --release --locked \
        -- \
        -D warnings
error: the lock file /kata-containers/src/agent/Cargo.lock needs to be updated but --locked was passed to prevent this
If you want to try to generate the lock file without accessing the network, remove the --locked flag and use --offline instead.
make: *** [../../utils.mk:184: standard_rust_check] Error 101
make: Leaving directory '/kata-containers/src/agent'

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-24 11:22:14 -05:00
Fabiano Fidêncio
fd583d833b kata-deploy: nydus: clean containerd metadata before wiping backend
When /var/lib/nydus-snapshotter is removed, containerd's BoltDB
(meta.db at /var/lib/containerd/) still holds snapshot records for
the nydus snapshotter.  On the next install these stale records cause
image pulls to fail with:

  "unable to prepare extraction snapshot:
   target snapshot \"sha256:...\": already exists"

The failure path in core/unpack/unpacker.go:
1. sn.Prepare() → metadata layer finds the target chainID in BoltDB
   → returns AlreadyExists without touching the nydus backend.
2. sn.Stat()    → metadata layer finds the BoltDB record, then calls
   s.Snapshotter.Stat(bkey) on the nydus gRPC backend → NotFound
   (backend was wiped).
3. The unpacker treats NotFound as a transient key-collision race and
   retries 3 times; all 3 attempts hit the same dead end, and the
   pull is aborted.

The commit message of 62ad0814c ("nydus: Always start from a clean
state") assumed "containerd will re-pull/re-unpack when it finds non-
existent snapshots", but that is not what happens: the metadata layer
intercepts the Prepare call in BoltDB before the backend is ever
consulted.

Fix: call cleanup_containerd_nydus_snapshots() before stopping the
nydus service (and thus before wiping its data directory) in both
install_nydus_snapshotter and uninstall_nydus_snapshotter.

The cleanup must run while the service is still up because ctr
snapshots rm goes through the metadata layer which calls the nydus
gRPC backend to physically remove the snapshot; if the service is
already stopped the backend call fails and the BoltDB record remains.

The cleanup:
- Discovers all containerd namespaces via `ctr namespaces ls -q`
  (falls back to k8s.io if that fails).
- Removes containers whose Snapshotter field matches the nydus plugin
  name; these become dangling references once snapshots are gone and
  can confuse container reconciliation after an aborted CI run.
- Removes snapshots round by round (leaf-first) until either the list
  is empty or no progress can be made (see below).

Note: containerd's GC cannot substitute for this explicit cleanup.
The image record (a GC root) references content blobs which reference
the snapshots via gc.ref labels, keeping the entire chain alive in
the GC graph even after the nydus backend is wiped.

Snapshot removal rounds
-----------------------
Snapshot chains are linear: an image with N layers produces a chain
of N snapshots, each parented on the previous.  Only the current leaf
can be removed each round, so N layers require exactly N rounds.
There is no fixed round cap — the loop terminates when either the
list reaches zero (success) or a round removes nothing at all
(all remaining snapshots are actively in use by running workloads).

Active workload safety
----------------------
If active workloads still hold nydus snapshots (e.g. during a live
upgrade), no progress is made in a round and cleanup_nydus_snapshots
returns false.  Both install_nydus_snapshotter and
uninstall_nydus_snapshotter gate the fs::remove_dir_all on that
return value:

  - true  → proceed as before: stop service, wipe data dir.
  - false → stop service, skip data dir removal, log a warning.
            The new nydus instance starts on the existing backend
            state; running containers are left intact.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
2026-03-24 16:44:25 +01:00
Fabiano Fidêncio
eb4ce0e98b Merge pull request #12676 from manuelh-dev/mahuber/gpu-ci-data-storage
tests: gpu: use container data storage feature
2026-03-24 09:59:13 +01:00
Fupan Li
6a832dd1f3 kernel: Bump the kernel to v6.18.15 for dragonball
Bump the dragonball supported kernel to v6.18.15.

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
2026-03-24 06:46:43 +08:00
Manuel Huber
79efe3e041 tests: gpu: use container data storage feature
Use the container data storage feature for the k8s-nvidia-nim.bats
test pod manifests. This reduces the pods' memory requirements.
For this, enable the block-encrypted emptydir_mode for the NVIDIA
GPU TEE handlers.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-23 11:43:11 -07:00
Steve Horsman
2728b493d5 Merge pull request #12681 from manuelh-dev/mahuber/ci-pip-py-venv
tests: cc: setup function for python venv
2026-03-23 14:33:30 +00:00
Fabiano Fidêncio
1ec97d25e7 Merge pull request #12704 from stevenhorsman/security-fixes-23-mar-26
Security fixes 23 mar 26
2026-03-23 15:27:07 +01:00
Fabiano Fidêncio
aa6890eae1 Merge pull request #12675 from manuelh-dev/mahuber/cdh-storage-options
agent: add mkfs_opts parameter to cdh_secure_mount
2026-03-23 15:18:38 +01:00
Fabiano Fidêncio
fe817bb47b Merge pull request #12705 from fidencio/topic/tests-nginx-connectibity-2nd-try
tests: nginx-connectivity: Use `-O index.html` to override the downloaded file
2026-03-23 13:08:51 +01:00
Fabiano Fidêncio
514a2b1a7c Merge pull request #12264 from fidencio/topic/nvidia-gpu-cc-use-nydus-snapshotter
nvidia: cc: Use nydus-snapshotter
2026-03-23 12:50:15 +01:00
stevenhorsman
2edb588ed9 kata-ctl: Pin micro_http
the micro_http crate was just pointing the the main branch and hadn't been updated for
around 3 years, so pin to the latest for stability and update to remediate RUSTSEC-2024-0002

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-23 10:34:28 +00:00
stevenhorsman
9871256771 versions: Bump cloud-hypervisor to v51
In v51 the license was added, so try bumping to this version
to solve the cargo deny issue

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-23 10:34:28 +00:00
dependabot[bot]
8de7f29981 agent-ctl: Bump aws-lc-rs to 1.16.2
Bump aws-lc-rs, so that aws-lc-sys updates to 0.39.0 to remediate
RUSTSEC-2026-0044 and https://osv.dev/vulnerability/RUSTSEC-2026-0048

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-23 10:34:28 +00:00
dependabot[bot]
1c63738b80 build(deps): bump aws-lc-fips-sys in /src/tools/agent-ctl
Bumps [aws-lc-fips-sys](https://github.com/aws/aws-lc-rs) from 0.13.12 to 0.13.13.
- [Release notes](https://github.com/aws/aws-lc-rs/releases)
- [Commits](https://github.com/aws/aws-lc-rs/compare/aws-lc-fips-sys/v0.13.12...aws-lc-fips-sys/v0.13.13)

---
updated-dependencies:
- dependency-name: aws-lc-fips-sys
  dependency-version: 0.13.13
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-23 10:34:28 +00:00
dependabot[bot]
6e79a9d6ad build(deps): bump rustls-webpki in /src/tools/agent-ctl
Bumps [rustls-webpki](https://github.com/rustls/webpki) from 0.103.3 to 0.103.10.
- [Release notes](https://github.com/rustls/webpki/releases)
- [Commits](https://github.com/rustls/webpki/compare/v/0.103.3...v/0.103.10)

---
updated-dependencies:
- dependency-name: rustls-webpki
  dependency-version: 0.103.10
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-23 10:34:27 +00:00
dependabot[bot]
8df9cf35df build(deps): bump rustls-webpki in /tools/packaging/kata-deploy/binary
Bumps [rustls-webpki](https://github.com/rustls/webpki) from 0.103.8 to 0.103.10.
- [Release notes](https://github.com/rustls/webpki/releases)
- [Commits](https://github.com/rustls/webpki/compare/v/0.103.8...v/0.103.10)

---
updated-dependencies:
- dependency-name: rustls-webpki
  dependency-version: 0.103.10
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-23 10:34:27 +00:00
dependabot[bot]
ef32923461 build(deps): bump tar from 0.4.44 to 0.4.45
Bumps [tar](https://github.com/alexcrichton/tar-rs) from 0.4.44 to 0.4.45.
- [Commits](https://github.com/alexcrichton/tar-rs/compare/0.4.44...0.4.45)

---
updated-dependencies:
- dependency-name: tar
  dependency-version: 0.4.45
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-23 10:34:27 +00:00
stevenhorsman
85e17c2e77 deps: Bump rustls-webpki
Bump rusttls-webpki to 0.103.10 to remediate RUSTSEC-2026-0049

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-23 10:34:27 +00:00
stevenhorsman
c3868f8e60 deps: Bump aws-lc-rs to 1.16.2
Bump aws-lc-rs, so that aws-lc-sys updates to 0.39.0 to remediate
RUSTSEC-2026-0044 and https://osv.dev/vulnerability/RUSTSEC-2026-0048

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-23 10:34:27 +00:00
stevenhorsman
27417d9d15 ci: Add more crates to dependabot groups
Add aws-lc, and rustls-webpki, so that in future the different
component bumps are all done together

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-23 10:34:27 +00:00
Fabiano Fidêncio
83f37f4beb tests: nginx-connectivity: Override index.html (2nd try)
We need to explicitly pass `-O index.html` as the busybox' wget has a
different behaviour than GNU's wget.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-23 11:11:44 +01:00
Fabiano Fidêncio
e44dfccf7a Revert "tests: nginx-connectivity: Allow overriding the downloded file"
This reverts commit 4403289123.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-23 11:06:23 +01:00
Hyounggyu Choi
1035504492 Merge pull request #12701 from fidencio/topic/tests-arm-nginx-connectivity
tests: nginx-connectivity: Allow overriding the downloded file
2026-03-23 10:37:25 +01:00
Steve Horsman
20cb65b1fb Merge pull request #12624 from lifupan/bump_rust_vmms
runtime-rs: Bump rust vmms for dragonball
2026-03-23 08:56:47 +00:00
Fabiano Fidêncio
864f181faf Merge pull request #12694 from manuelh-dev/mahuber/nv-test-timeout
tests: nvidia: Increase run test timeout
2026-03-23 09:13:20 +01:00
Fabiano Fidêncio
642b5661ff Merge pull request #12651 from manuelh-dev/mahuber/doc-update-nvidia-gpu-op
docs: Update NVIDIA GPU passthrough QEMU scenario
2026-03-23 09:01:02 +01:00
Fabiano Fidêncio
4403289123 tests: nginx-connectivity: Allow overriding the downloded file
In case a wget fails for one reason or another, it'll leave behind an
'index.html' file.  Let's make sure we allow overriding that file so the
retry loop doesn't fail for no reason.

Fixes: #12670

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-23 04:08:24 +01:00
Alex Lyn
d2c2ec6e23 Merge pull request #12633 from LandonTClipp/docs_materialx
docs: Move to mkdocs-material, port Helm to docs site
2026-03-23 09:29:25 +08:00
Fupan Li
608f378bff dragonball: make sure the nydus's worker thread access network
Since the dragonball's vmm thread had been joined in the pod's
netns, which wouldn't access the network, thus we should make
sure the nydus's worker thread join into the runD's main thread's
netns which would access the network.

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
2026-03-22 22:44:24 +08:00
Fabiano Fidêncio
f14895bdc4 Merge pull request #12673 from manuelh-dev/mahuber/release-doc-update
docs: Update release process notes
2026-03-22 13:11:03 +01:00
Fabiano Fidêncio
fd716c017d Merge pull request #12567 from agamdua/ebpf-confs
kernel: Add debug kernel with eBPF configs to static tarball builds
2026-03-22 13:06:54 +01:00
Fabiano Fidêncio
740d380b8e tests: nvidia: cc: Use nydus-snapshotter
So we can test what we just changed in the config files.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-22 10:10:34 +01:00
Fabiano Fidêncio
6194510e90 nvidia: cc: Use nydus-snapshotter
We've been using `experimental_force_guest_pull`, but now that we have a
containerd release that should work more reliably with the multi
snapshotter setup, we want to give it a try.

Note: We need containerd 2.2.2+.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-03-22 10:10:34 +01:00
Agam Dua
7e3fd74779 kernel: bump config version
With debug/ebpf updates in place, let's bump the kata config version.

Signed-off-by: Agam Dua <agam_dua@apple.com>
Co-authored-by: Eric Ernst <eric_ernst@apple.com>
2026-03-20 15:04:15 -07:00
Agam Dua
f6319da73d tests: Add eBPF and dwarves to spell check dictionary
Add missing terms to the spell check dictionary to fix CI failures
for kernel debug documentation:

- eBPF
- dwarves: Linux package with DWARF/BTF tools (pahole) required for
  CONFIG_DEBUG_INFO_BTF kernel option

Also fix the casing of "ebpf" to "eBPF" in the kernel README to match
the official naming convention.

Signed-off-by: Agam Dua <agam_dua@apple.com>
2026-03-20 15:04:08 -07:00
Agam Dua
91d6c39f06 kernel: Fix debug build and add debug symbols to installation
Fixed a bug with the debug kernel build where common/ was repeated
after the common path variable, resulting in the debug
confs never being picked up.

This exposed a subsequent bug where the debug conf
was included in other builds, this is also fixed by creating a
separate directory for debug confs with one file at the moment,
debug.conf that contains debug configurations and bpf specific
configs.

To enable kernel builds (specifically for bpf) the dwarves package was added
to the kernel dockerfile for the pahole package.

Signed-off-by: Agam Dua <agam_dua@apple.com>
2026-03-20 14:50:23 -07:00
Agam Dua
5ab0744c25 ci: Add pipeline for building and distributing the debug kernel
Add the debug kernel to the kata tarball alongside the other kernels.

Also update the kernel README documentation to describe the new debug
kernel build process.

Signed-off-by: Agam Dua <agam_dua@apple.com>
2026-03-20 14:50:23 -07:00
Agam Dua
e905b74267 kernel: Add eBPF configs for debug builds
Adds a BPF section in the debug.conf kernel configuration options
 to enable eBPF and BTF support for debug kernel builds.

Signed-off-by: Agam Dua <agam_dua@apple.com>
2026-03-20 14:50:23 -07:00
LandonTClipp
5333e45313 docs: Fix static-checks.sh when running locally
This fixes the test_dir variable in static-checks.sh so that
when a --repo-path is provided, the test_dir variable uses that
for the location instead of the GOPATH location.

Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
2026-03-20 14:51:45 -05:00
LandonTClipp
795869152d docs: Move to mkdocs-material, port Helm to docs site
This supersedes https://github.com/kata-containers/kata-containers/pull/12622.
I replaced Zensical with mkdocs-materialx. Materialx is a fork of mkdocs-material
created after mkdocs-material was put into maintenance mode. We'll use this
platform until Zensical is more feature complete.

Added a few of the existing docs into the site to make a more user-friendly flow.

Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
2026-03-20 14:51:39 -05:00
Manuel Huber
8903b12d34 tests: nvidia: Increase run test timeout
Increase the timeout as a few new features and tests are going to be
onboarded for the NVIDIA GPU CI.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-20 11:12:52 -07:00
Manuel Huber
476f550977 docs: Update NVIDIA GPU passthrough QEMU scenario
With the upcoming GPU operator 26.3 relase and recent changes to
kata-containers, we adapt this documentation with notes on multi
GPU passthrough, support for TDX, changed deployment instructions,
and with various other minor improvements.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-20 10:53:14 -07:00
Manuel Huber
ae59cf26a0 kata-deploy: Check kata-tarball size limits
For kata tarballs we eventually release to GitHub, check their
size against the GitHub size limit. With this, we fail in case of
an ongoing release process in 'CI | Publish Kata Containers payload'
instead of only later on in the 'Release Kata Containers' action,
and we fail during PR builds, avoiding this situation at all.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-20 10:40:55 -07:00
RuoqingHe
cfc1836a31 Merge pull request #12672 from stevenhorsman/agent-security-fixes
agent: Bump tracing-subscriber
2026-03-20 17:37:16 +08:00
Steve Horsman
7ab6e11e10 Merge pull request #12678 from kata-containers/dependabot/go_modules/src/runtime/google.golang.org/grpc-1.79.3
build(deps): bump google.golang.org/grpc from 1.72.0 to 1.79.3 in /src/runtime
2026-03-20 08:49:35 +00:00
Steve Horsman
e475fb2116 Merge pull request #12680 from kata-containers/dependabot/go_modules/src/tools/csi-kata-directvolume/google.golang.org/grpc-1.79.3
build(deps): bump google.golang.org/grpc from 1.63.2 to 1.79.3 in /src/tools/csi-kata-directvolume
2026-03-20 08:49:27 +00:00
RuoqingHe
f62a6b6ab2 Merge pull request #12677 from stevenhorsman/cspell-action
Switch to use cspell for spell checking
2026-03-20 13:27:36 +08:00
Manuel Huber
4afb55154a docs: Update release process notes
Update the Release-Process.md file with some clarifications on the
release process.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-19 15:14:23 -07:00
stevenhorsman
38a655487f vsock-exporter: Switch bincode for serde_json
bincode is not maintained, so switch to serde_json to
resolve RUSTSEC-2025-0141

Assisted-By: Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-19 10:45:17 +00:00
stevenhorsman
e1d7d5bef8 agent: Remove async-std
It's a dev-dependency that doesn't seem to be used, so
remove it and resolve RUSTSEC-2025-0052

Assisted-By: Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-19 10:45:17 +00:00
stevenhorsman
e4eda5e1d8 agent: Bump tracing-subscriber
- Bump tracing-subscriber to 0.3.20 to resolve RUSTSEC-2025-0055
- Switch deprecated `slog_info!` for `slog::info!`

Generated-By: Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-19 10:45:17 +00:00
stevenhorsman
e62df07b6a static-checks: Delete kata-spell-check
The old hunspell based spell-check was causing contributors
challenges and proving a barrier to doc updates. We've replaced
it with a cspell based-solution, so clean up the old approach.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-19 10:22:54 +00:00
stevenhorsman
44ec815f77 gatekeeper: Add check-spelling to required
Add the new job to the list of static checks that runs on doc updates.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-19 10:22:54 +00:00
stevenhorsman
c2cedd7c02 workflows: Add spellcheck workflow
Add a separate spellcheck workflow, so we can replace
the complex hunspell approach embedded in static-checks

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-19 10:22:54 +00:00
stevenhorsman
d06dadd8ef docs: Spelling updates
Either fixing typos, or including program/repo name in
backticks

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-19 10:22:54 +00:00
stevenhorsman
829a32ee67 spellcheck: Add cspell files
Add cspell config and initial dictionary

Assisted-by: Bob (dictionary ordering and catergorisation)
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-19 10:22:54 +00:00
dependabot[bot]
2f5415d8f5 build(deps): bump google.golang.org/grpc
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.63.2 to 1.79.3.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.63.2...v1.79.3)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.79.3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-19 10:03:45 +00:00
dependabot[bot]
3876a80208 build(deps): bump google.golang.org/grpc in /src/runtime
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.72.0 to 1.79.3.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.72.0...v1.79.3)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-version: 1.79.3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-03-19 10:03:30 +00:00
Alex Lyn
de2ddf6ed9 Merge pull request #12637 from stevenhorsman/bump-rust-to-1.92
versions: Bump rust to 1.92
2026-03-19 10:02:18 +08:00
Manuel Huber
5765bc97b4 tests: cc: setup function for python venv
We recently had a failure on a new CI runner where
${HOME}/.cicd/venv/bin/activate was not present. The relevant call
originated from ensure_sev_snp_measure. Thus, add a function
ensure_cicd_python_venv before callers to pip install.
Currently, the NVIDIA NIM test and the confidential attestation
tests use pip to install dependencies.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-18 17:07:47 -07:00
Manuel Huber
62d74bb1fd agent: add mkfs_opts parameter to cdh_secure_mount
Add an mkfs_opts parameter to cdh_secure_mount so that its users
can parametrize these options depending on their needs. For now,
there is two users providing explicit values (container image
layer storage and container data storage features).

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-18 15:37:32 -07:00
Aurélien Bombo
352b4cdad2 Merge pull request #12660 from LandonTClipp/ci_docs
ci: Don't run CI builds on doc PRs
2026-03-17 12:19:11 -05:00
stevenhorsman
56b6917adf versions: Bump rust to 1.92
Now that 1.94 has release, in compliance with our toolchain guidance
we should bump to rust 1.92

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-17 16:04:58 +00:00
stevenhorsman
2a4227e02e kata-ctl: Try fixing unused_assignement error
`allow(unused_assignments)` isn't working as it's
in macro generated code, so referencing the command
in the error, to use it

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-17 16:04:58 +00:00
stevenhorsman
ca7cdcd732 kata-ctl: Rewrite path_join test
This test was failing clippy by calling .unwrap() after
an .is_ok(), but after I looked at it, it seemed a bit messy,
so I split it up and tried rewriting it to make it more readable
IMHO.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-17 16:04:58 +00:00
stevenhorsman
501578cc5a agent: Remove non-idiomatic unwrap
Calling .unwrap() after an .is_some() check is considered non-idiomatic in
as it performs redundant work and makes the code more verbose.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-17 16:04:58 +00:00
Alex Lyn
833b72470c Merge pull request #12647 from sprt/gp-improve
genpolicy: Improve emptyDir storage options and mount point validation
2026-03-17 13:56:42 +08:00
Manuel Huber
660e3bb653 gpu: Obsolete the NVIDIA initrd build
As the NVIDIA stack has shifted to using an image for both the
confidential and non-confidential variants, we retire the initrd
build.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-16 21:29:58 -04:00
Aurélien Bombo
f8e234c6f9 Merge pull request #12650 from kata-containers/sprt/remove-csi
ci: Stop building/deploying CSI driver
2026-03-16 16:53:02 -05:00
Steve Horsman
294c367063 Merge pull request #12668 from manuelh-dev/release/3.28.0
release: Bump version to 3.28.0
2026-03-16 19:47:12 +00:00
Manuel Huber
5210584f95 release: Bump version to 3.28.0
Bump VERSION and helm-charts versions.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-16 09:52:35 -07:00
Manuel Huber
e13748f46d tests: Adapt trusted ephemeral storage test
With the new CDH version, the LUKS header is moved off of the disk
into guest memory. We hence adapt the test's filesystem type checks.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-16 09:43:17 -07:00
Manuel Huber
5bbc0abb81 tests: use pre-created, signed sealed secrets
With signature support for sealed secret, use pre-created signed
sealed secrets and provision the signing public key to the KBS.

Add instructions for re-creating these signed secrets.

Improve k8s-sealed-secrets.bats by reducing repeated kubectl logs
calls. A test run showed a SIGPIPE error one one of the grep-logs
while the printouts of the initial kubectl logs invocation showed
that the expected values were actually in the logs.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-16 09:43:17 -07:00
Manuel Huber
a9b222f91e gpu: Update chiseled rootfs with new CDH deps
With CDH requiring libcryptsetup, mkfs.ext4, dd, and their
dependencies, we will need to update the chiseled NVIDIA rootfs
accordingly.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-16 09:43:17 -07:00
Manuel Huber
169f92ff09 agent: cdh: Update CDH and API
With the new CDH version, the secure_mount API changes.
Further, the new CDH version no longer uses the luks-encrypt-storage
script but utilizes libcryptsetup as well as mkfs.ext4 and dd. Hence, adapt
some of the CDH and Kata components build steps

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-16 09:43:17 -07:00
Alex Lyn
ef5db0a01f Merge pull request #12607 from zvonkok/system-map
kernel: Ship System.map as part of the kernel build
2026-03-16 09:37:44 +08:00
Zvonko Kaiser
99f32de1e5 kata-deploy: Update RuntimeClass PodOverhead
Align the podOverhead with the default_memory updated
in the previous commit.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
6a853a9684 gpu: Bump NVRC
We have a new release add this one to the next
Kata release.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>

Potential fix for pull request finding

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
8ff5d164c6 runtime: make CDI annotation vendor-agnostic with lookup table
Replace hardcoded NVIDIA vendor ID (0x10de) and class (0x030) checks
with a vendor-agnostic lookup table (cdiDeviceKind) that maps PCI
vendor/class pairs to CDI device kinds. This makes it straightforward
to add support for new device types by adding entries to the table.

Refactor siblingAnnotation to resolve device BDFs once upfront and
reuse them for both CDI type detection and sibling matching, eliminating
redundant sysfs reads. Devices not in the lookup table (e.g. NVSwitches)
are skipped with errNoSiblingFound, while known device types that fail
to match a sibling produce a hard error.

Consolidate the hot-plug and cold-plug device loops into a single loop
over extracted container paths, removing duplicated filtering logic.

Export GetPCIDeviceProperty from the device drivers package to allow
vendor/class lookup from sysfs in the container annotation path.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
d4c21f50b5 gpu: Bump default memory to 8G for GPU runtimes
We need enough inital memory to prepare more complex
platforms like HGX H100 or HGX B200 systems.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
5c9683f006 gpu: Remove devtmpfs.mount=0
With the newest NVRC release this is solved and does
not need to be overriden.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
d22c314e91 gpu: Increase dial_timeout=1200
For cold-plug when running with nerdctl the timeouts in the config
are being used, increase the dial_timeout (e.g. for CreateSandbox) to match
create_container_timeout.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Zvonko Kaiser
7fe84c8038 gpu: HGX Rootfs Fixes
Various smaller fixes to enable HGX systems.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-15 09:53:32 -07:00
Joji Mekkattuparamban
1fd66db271 nvidia-gpu: add missing libraries to rootfs
Added the missing packages to the nvidia rootfs.

Fixes #12534

Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>
2026-03-13 16:24:32 -07:00
Dan Mihai
9332b75c04 Merge pull request #12661 from stevenhorsman/runtime-go-1.25.8
runtime: bump go.mod version
2026-03-13 14:06:08 -07:00
Zvonko Kaiser
d382379571 kernel: Ship System.map as part of the kernel build
Some use-cases need the System.map of the running kernel,
ship it via kernel-artifact.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-13 19:27:18 +00:00
Manuel Huber
4a7022d2f4 tests: nvidia: call genpolicy auth for all tests
Call the setup_genpolicy_registry_auth in run_kubernetes_nv_tests.sh.
Authenticate before exercising any tests.

Recently, we have seen UnauthorizedError messages for the CUDA
vectorAdd image. While this image is not gated behind authentication,
rate limiting may be a possible issue.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-13 09:03:01 -07:00
LandonTClipp
9a8932412d docs: remove URL and markdown reference checks
This URL check performed a CURL command to see if it was real. This will
not work in the mkdocs world because the docs might reference a link that
is not yet built on the main page. This is a chicken-and-egg problem.

For reference:

```
ERROR: Invalid URL 'https://kata-containers.github.io/kata-containers/installation/#helm-chart' found in the following files:

tools/packaging/kata-deploy/helm-chart/README.md
```

The markdown reference requirement was put in place for the old docs system, but this
will not apply anymore in the new mkdocs system. I'm removing this
entirely because it will only get in the way and cause confusion.

Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
2026-03-12 15:48:35 -05:00
LandonTClipp
d5d741f4e3 ci: Don't run CI builds on doc PRs
We disable the Kata artifact builds and testing if the PR is only
related to documentation. Regular static checks will remain.

Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
2026-03-12 15:48:05 -05:00
Zvonko Kaiser
4c450a5b01 Merge pull request #12648 from manuelh-dev/mahuber/trustee-upgrade
versions: bump trustee to latest version
2026-03-12 14:09:15 -04:00
Steve Horsman
7d2e18575c Merge pull request #12343 from zvonkok/release-model
doc: Release model update
2026-03-12 14:44:51 +00:00
Zvonko Kaiser
7f662662cf lint: Fix 80 char column size
Make markdownlint happy

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2026-03-12 12:03:29 +00:00
Zvonko Kaiser
6e03a95730 doc: Update Release Process
Add how Kata is doing the rolling release.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-03-12 12:03:29 +00:00
stevenhorsman
f25fa6ab25 runtime: bump go.mod version
Update the runtime's go.mod go version to 1.25.8 to
keep in sync with versions.yaml

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2026-03-12 08:53:40 +00:00
Manuel Huber
0926c92aa0 versions: bump trustee to latest version
Ingest various recent fixes and dependency updates.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-03-11 14:45:42 -07:00
Aurélien Bombo
32444737b5 gatekeeper: Remove csi-kata-directvolume build from required tests
Since we don't build that anymore.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-11 12:55:23 -05:00
Aurélien Bombo
64aed13d5f Revert "ci: Add no-op step to compile CSI driver"
This reverts commit e43c59a2c6.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-11 12:55:23 -05:00
Aurélien Bombo
dd2c4c0db3 Revert "coco: ci: Add no-op steps to deploy CSI driver"
This reverts commit 5e4990bcf5.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-11 12:55:23 -05:00
Aurélien Bombo
d598e0baf1 Revert "ci: Implement build step for CSI driver"
This partially reverts commit fb87bf221f.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-11 12:55:23 -05:00
Aurélien Bombo
2a15cfc5ec genpolicy: Improve emptyDir storage options and mount point validation
These are two changes following a Copilot review on #10559:

1. Restore the p_storage.driver != "blk" check in allow_storage_options():
   - An early version of #10599 hardcoded p_storage.driver to "blk".
   - Hence that check needed to be removed to validate "blk" storage options.
   - The final version of #10599 hardcodes p_storage.driver to "" to
     account for both "blk" and "scsi", and checks storage options in
     allow_block_storage().
   - Hence that check should be restored to preserve the original behavior.

https://github.com/kata-containers/kata-containers/pull/10559#discussion_r2907646552

2. Don't use a regex to validate emptyDir storage mount points:
   - It's risky to use a regex to validate a path that has base64-encoded
     components.
   - We can infer the exact path anyway so the regex is redundant.

https://github.com/kata-containers/kata-containers/pull/10559#discussion_r2907646582

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-03-10 11:22:10 -05:00
Lukáš Doktor
ce65d17276 tools.gatekeeper: Add support for GITHUB_STEP_SUMMARY
this should produce a table of failed/running jobs as a table along with
links to them. On pass it should only produce simple line with how many
jobs passed.

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
2026-03-06 12:19:26 -03:00
Lukáš Doktor
27bebfb438 tools.gatekeeper: Print link to the results in status output
to simplify analyzing failures let's print the link to the job result
next to the status.

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
2026-03-06 12:19:26 -03:00
276 changed files with 12444 additions and 7749 deletions

37
.cspell.yaml Normal file
View File

@@ -0,0 +1,37 @@
# yaml-language-server: $schema=https://raw.githubusercontent.com/streetsidesoftware/cspell/main/cspell.schema.json
version: "0.2"
language: en,en-GB
dictionaryDefinitions:
- name: kata-terms
path: ./tests/spellcheck/kata-dictionary.txt
addWords: true
dictionaries:
- en-GB
- en_US
- bash
- git
- golang
- k8s
- python
- rust
- companies
- mnemonics
- peopleNames
- softwareTerms
- networking-terms
- kata-terms
ignoreRegExpList:
- /@[a-z\d](?:[a-z\d]|-(?=[a-z\d])){0,38}/gi # Ignores github handles
# Ignore code blocks
- /^\s*`{3,}[\s\S]*?^\s*`{3,}/gm
- /`[^`\n]+`/g
ignorePaths:
- "**/vendor/**" # vendor files aren't owned by us
- "**/src/runtime/virtcontainers/pkg/cloud-hypervisor/client/**" # Generated files
- "**/requirements.txt"
useGitignore: true

View File

@@ -37,9 +37,9 @@ updates:
# create groups for common dependencies, so they can all go in a single PR
# We can extend this as we see more frequent groups
groups:
bit-vec:
aws-libcrypto:
patterns:
- bit-vec
- aws-lc-*
bumpalo:
patterns:
- bumpalo
@@ -67,6 +67,9 @@ updates:
rustix:
patterns:
- rustix
rustls-webpki:
patterns:
- rustls-webpki
slab:
patterns:
- slab

View File

@@ -47,6 +47,7 @@ jobs:
- coco-guest-components
- firecracker
- kernel
- kernel-debug
- kernel-dragonball-experimental
- kernel-nvidia-gpu
- nydus
@@ -168,8 +169,6 @@ jobs:
- rootfs-image-nvidia-gpu-confidential
- rootfs-initrd
- rootfs-initrd-confidential
- rootfs-initrd-nvidia-gpu
- rootfs-initrd-nvidia-gpu-confidential
steps:
- name: Login to Kata Containers quay.io
if: ${{ inputs.push-to-registry == 'yes' }}
@@ -349,6 +348,16 @@ jobs:
./tools/packaging/kata-deploy/local-build/kata-deploy-merge-builds.sh kata-artifacts versions.yaml
env:
RELEASE: ${{ inputs.stage == 'release' && 'yes' || 'no' }}
- name: Check kata tarball size (GitHub release asset limit)
run: |
# https://docs.github.com/en/repositories/releasing-projects-on-github/about-releases#storage-and-bandwidth-quotas
GITHUB_ASSET_MAX_BYTES=2147483648
tarball_size=$(stat -c "%s" kata-static.tar.zst)
if [[ "${tarball_size}" -ge "${GITHUB_ASSET_MAX_BYTES}" ]]; then
echo "::error::tarball size (${tarball_size} bytes) >= GitHub release asset limit (${GITHUB_ASSET_MAX_BYTES} bytes)"
exit 1
fi
echo "tarball size: ${tarball_size} bytes"
- name: store-artifacts
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
with:
@@ -367,7 +376,6 @@ jobs:
matrix:
asset:
- agent-ctl
- csi-kata-directvolume
- genpolicy
- kata-ctl
- kata-manager
@@ -450,6 +458,16 @@ jobs:
./tools/packaging/kata-deploy/local-build/kata-deploy-merge-builds.sh kata-tools-artifacts versions.yaml kata-tools-static.tar.zst
env:
RELEASE: ${{ inputs.stage == 'release' && 'yes' || 'no' }}
- name: Check kata-tools tarball size (GitHub release asset limit)
run: |
# https://docs.github.com/en/repositories/releasing-projects-on-github/about-releases#storage-and-bandwidth-quotas
GITHUB_ASSET_MAX_BYTES=2147483648
tarball_size=$(stat -c "%s" kata-tools-static.tar.zst)
if [[ "${tarball_size}" -ge "${GITHUB_ASSET_MAX_BYTES}" ]]; then
echo "::error::tarball size (${tarball_size} bytes) >= GitHub release asset limit (${GITHUB_ASSET_MAX_BYTES} bytes)"
exit 1
fi
echo "tarball size: ${tarball_size} bytes"
- name: store-artifacts
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
with:

View File

@@ -45,6 +45,7 @@ jobs:
- cloud-hypervisor
- firecracker
- kernel
- kernel-debug
- kernel-dragonball-experimental
- kernel-nvidia-gpu
- kernel-cca-confidential
@@ -152,7 +153,6 @@ jobs:
- rootfs-image
- rootfs-image-nvidia-gpu
- rootfs-initrd
- rootfs-initrd-nvidia-gpu
steps:
- name: Login to Kata Containers quay.io
if: ${{ inputs.push-to-registry == 'yes' }}
@@ -327,6 +327,16 @@ jobs:
./tools/packaging/kata-deploy/local-build/kata-deploy-merge-builds.sh kata-artifacts versions.yaml
env:
RELEASE: ${{ inputs.stage == 'release' && 'yes' || 'no' }}
- name: Check kata tarball size (GitHub release asset limit)
run: |
# https://docs.github.com/en/repositories/releasing-projects-on-github/about-releases#storage-and-bandwidth-quotas
GITHUB_ASSET_MAX_BYTES=2147483648
tarball_size=$(stat -c "%s" kata-static.tar.zst)
if [[ "${tarball_size}" -ge "${GITHUB_ASSET_MAX_BYTES}" ]]; then
echo "::error::tarball size (${tarball_size} bytes) >= GitHub release asset limit (${GITHUB_ASSET_MAX_BYTES} bytes)"
exit 1
fi
echo "tarball size: ${tarball_size} bytes"
- name: store-artifacts
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
with:

View File

@@ -262,6 +262,16 @@ jobs:
./tools/packaging/kata-deploy/local-build/kata-deploy-merge-builds.sh kata-artifacts versions.yaml
env:
RELEASE: ${{ inputs.stage == 'release' && 'yes' || 'no' }}
- name: Check kata tarball size (GitHub release asset limit)
run: |
# https://docs.github.com/en/repositories/releasing-projects-on-github/about-releases#storage-and-bandwidth-quotas
GITHUB_ASSET_MAX_BYTES=2147483648
tarball_size=$(stat -c "%s" kata-static.tar.zst)
if [[ "${tarball_size}" -ge "${GITHUB_ASSET_MAX_BYTES}" ]]; then
echo "::error::tarball size (${tarball_size} bytes) >= GitHub release asset limit (${GITHUB_ASSET_MAX_BYTES} bytes)"
exit 1
fi
echo "tarball size: ${tarball_size} bytes"
- name: store-artifacts
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
with:

View File

@@ -350,6 +350,16 @@ jobs:
./tools/packaging/kata-deploy/local-build/kata-deploy-merge-builds.sh kata-artifacts versions.yaml
env:
RELEASE: ${{ inputs.stage == 'release' && 'yes' || 'no' }}
- name: Check kata tarball size (GitHub release asset limit)
run: |
# https://docs.github.com/en/repositories/releasing-projects-on-github/about-releases#storage-and-bandwidth-quotas
GITHUB_ASSET_MAX_BYTES=2147483648
tarball_size=$(stat -c "%s" kata-static.tar.zst)
if [[ "${tarball_size}" -ge "${GITHUB_ASSET_MAX_BYTES}" ]]; then
echo "::error::tarball size (${tarball_size} bytes) >= GitHub release asset limit (${GITHUB_ASSET_MAX_BYTES} bytes)"
exit 1
fi
echo "tarball size: ${tarball_size} bytes"
- name: store-artifacts
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
with:

View File

@@ -4,17 +4,18 @@ on:
branches:
- main
permissions: {}
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
deploy-docs:
name: deploy-docs
build:
runs-on: ubuntu-24.04
name: Build docs
permissions:
contents: read
pages: write
id-token: write
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
steps:
- uses: actions/configure-pages@983d7736d9b0ae728b81ab479565c72886d7745b # v5.0.0
- uses: actions/checkout@93cb6efe18208431cddfb8368fd83d5badbf9bfd # v5.0.1
@@ -23,10 +24,30 @@ jobs:
- uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: 3.x
- run: pip install zensical
- run: zensical build --clean
- run: pip install -r docs/requirements.txt
- run: python3 -m mkdocs build --config-file ./mkdocs.yaml --site-dir site/
id: build
- uses: actions/upload-pages-artifact@7b1f4a764d45c48632c6b24a0339c27f5614fb0b # v4.0.0
with:
path: site
- uses: actions/deploy-pages@d6db90164ac5ed86f2b6aed7e0febac5b3c0c03e # v4.0.5
id: deployment
with:
path: site/
name: github-pages
deploy:
needs: build
runs-on: ubuntu-24.04
name: Deploy docs
permissions:
pages: write
id-token: write
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- name: Deploy to GitHub Pages
uses: actions/deploy-pages@d6db90164ac5ed86f2b6aed7e0febac5b3c0c03e # v4.0.5
id: deployment
with:
artifact_name: github-pages

View File

@@ -49,6 +49,8 @@ jobs:
KATA_HYPERVISOR: ${{ matrix.environment.vmm }}
KUBERNETES: kubeadm
KBS: ${{ matrix.environment.name == 'nvidia-gpu-snp' && 'true' || 'false' }}
SNAPSHOTTER: ${{ matrix.environment.name == 'nvidia-gpu-snp' && 'nydus' || '' }}
USE_EXPERIMENTAL_SNAPSHOTTER_SETUP: ${{ matrix.environment.name == 'nvidia-gpu-snp' && 'true' || 'false' }}
K8S_TEST_HOST_TYPE: baremetal
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@@ -98,7 +100,7 @@ jobs:
run: bash tests/integration/kubernetes/gha-run.sh install-bats
- name: Run tests ${{ matrix.environment.vmm }}
timeout-minutes: 30
timeout-minutes: 60
run: bash tests/integration/kubernetes/gha-run.sh run-nv-tests
env:
NGC_API_KEY: ${{ secrets.NGC_API_KEY }}

View File

@@ -110,10 +110,6 @@ jobs:
timeout-minutes: 10
run: bash tests/integration/kubernetes/gha-run.sh install-kbs-client
- name: Deploy CSI driver
timeout-minutes: 5
run: bash tests/integration/kubernetes/gha-run.sh deploy-csi-driver
- name: Run tests
timeout-minutes: 100
run: bash tests/integration/kubernetes/gha-run.sh run-tests
@@ -134,10 +130,6 @@ jobs:
[[ "${KATA_HYPERVISOR}" == "qemu-tdx" ]] && echo "ITA_KEY=${GH_ITA_KEY}" >> "${GITHUB_ENV}"
bash tests/integration/kubernetes/gha-run.sh delete-coco-kbs
- name: Delete CSI driver
timeout-minutes: 5
run: bash tests/integration/kubernetes/gha-run.sh delete-csi-driver
# Generate jobs for testing CoCo on non-TEE environments
run-k8s-tests-coco-nontee:
name: run-k8s-tests-coco-nontee
@@ -235,10 +227,6 @@ jobs:
timeout-minutes: 10
run: bash tests/integration/kubernetes/gha-run.sh install-kbs-client
- name: Deploy CSI driver
timeout-minutes: 5
run: bash tests/integration/kubernetes/gha-run.sh deploy-csi-driver
- name: Run tests
timeout-minutes: 80
run: bash tests/integration/kubernetes/gha-run.sh run-tests
@@ -257,11 +245,6 @@ jobs:
timeout-minutes: 10
run: bash tests/integration/kubernetes/gha-run.sh delete-coco-kbs
- name: Delete CSI driver
if: always()
timeout-minutes: 5
run: bash tests/integration/kubernetes/gha-run.sh delete-csi-driver
# Extensive matrix: autogenerated policy tests (nydus + experimental-force-guest-pull) on k0s, k3s, rke2, microk8s with qemu-coco-dev / qemu-coco-dev-runtime-rs
run-k8s-tests-coco-nontee-extensive-matrix:
if: ${{ inputs.extensive-matrix-autogenerated-policy == 'yes' }}
@@ -365,10 +348,6 @@ jobs:
timeout-minutes: 10
run: bash tests/integration/kubernetes/gha-run.sh install-kbs-client
- name: Deploy CSI driver
timeout-minutes: 5
run: bash tests/integration/kubernetes/gha-run.sh deploy-csi-driver
- name: Run tests
timeout-minutes: 80
run: bash tests/integration/kubernetes/gha-run.sh run-tests
@@ -387,11 +366,6 @@ jobs:
timeout-minutes: 10
run: bash tests/integration/kubernetes/gha-run.sh delete-coco-kbs
- name: Delete CSI driver
if: always()
timeout-minutes: 5
run: bash tests/integration/kubernetes/gha-run.sh delete-csi-driver
# Generate jobs for testing CoCo on non-TEE environments with erofs-snapshotter
run-k8s-tests-coco-nontee-with-erofs-snapshotter:
name: run-k8s-tests-coco-nontee-with-erofs-snapshotter
@@ -478,10 +452,6 @@ jobs:
timeout-minutes: 20
run: bash tests/integration/kubernetes/gha-run.sh deploy-kata
- name: Deploy CSI driver
timeout-minutes: 5
run: bash tests/integration/kubernetes/gha-run.sh deploy-csi-driver
- name: Run tests
timeout-minutes: 80
run: bash tests/integration/kubernetes/gha-run.sh run-tests
@@ -494,8 +464,3 @@ jobs:
if: always()
timeout-minutes: 15
run: bash tests/integration/kubernetes/gha-run.sh cleanup
- name: Delete CSI driver
if: always()
timeout-minutes: 5
run: bash tests/integration/kubernetes/gha-run.sh delete-csi-driver

30
.github/workflows/spellcheck.yaml vendored Normal file
View File

@@ -0,0 +1,30 @@
name: Spelling check
on: ["pull_request"]
permissions: {}
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
check-spelling:
name: check-spelling
runs-on: ubuntu-24.04
steps:
- name: Checkout code
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
persist-credentials: false
- name: Check Spelling
uses: streetsidesoftware/cspell-action@9cd41bb518a24fefdafd9880cbab8f0ceba04d28 # 8.3.0
with:
files: |
**/*.md
**/*.rst
**/*.txt
incremental_files_only: true
config: ".cspell.yaml"

View File

@@ -138,7 +138,7 @@ jobs:
go-version: ${{ env.GO_VERSION }}
- name: Install system dependencies
run: |
sudo apt-get update && sudo apt-get -y install moreutils hunspell hunspell-en-gb hunspell-en-us pandoc
sudo apt-get update && sudo apt-get -y install moreutils
- name: Install open-policy-agent
run: |
cd "${GOPATH}/src/github.com/${GITHUB_REPOSITORY}"

1467
Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -6,6 +6,12 @@ rust-version = "1.88"
[workspace]
members = [
# kata-agent
"src/agent",
"src/agent/rustjail",
"src/agent/policy",
"src/agent/vsock-exporter",
# Dragonball
"src/dragonball",
"src/dragonball/dbs_acpi",
@@ -41,7 +47,6 @@ resolver = "2"
# TODO: Add all excluded crates to root workspace
exclude = [
"src/agent",
"src/tools",
"src/libs",
@@ -104,6 +109,7 @@ wasm_container = { path = "src/runtime-rs/crates/runtimes/wasm_container" }
kata-sys-util = { path = "src/libs/kata-sys-util" }
kata-types = { path = "src/libs/kata-types", features = ["safe-path"] }
logging = { path = "src/libs/logging" }
mem-agent = { path = "src/libs/mem-agent" }
protocols = { path = "src/libs/protocols", features = ["async"] }
runtime-spec = { path = "src/libs/runtime-spec" }
safe-path = { path = "src/libs/safe-path" }
@@ -112,35 +118,65 @@ test-utils = { path = "src/libs/test-utils" }
# Local dependencies from `src/agent`
kata-agent-policy = { path = "src/agent/policy" }
rustjail = { path = "src/agent/rustjail" }
vsock-exporter = { path = "src/agent/vsock-exporter" }
# Outside dependencies
actix-rt = "2.7.0"
anyhow = "1.0"
async-recursion = "0.3.2"
async-trait = "0.1.48"
capctl = "0.2.0"
cfg-if = "1.0.0"
cgroups = { package = "cgroups-rs", git = "https://github.com/kata-containers/cgroups-rs", rev = "v0.3.5" }
clap = { version = "4.5.40", features = ["derive"] }
const_format = "0.2.30"
containerd-shim = { version = "0.10.0", features = ["async"] }
containerd-shim-protos = { version = "0.10.0", features = ["async"] }
derivative = "2.2.0"
futures = "0.3.30"
go-flag = "0.1.0"
hyper = "0.14.20"
hyperlocal = "0.8.0"
ipnetwork = "0.17.0"
lazy_static = "1.4"
libc = "0.2"
libc = "0.2.94"
log = "0.4.14"
netlink-packet-core = "0.7.0"
netlink-packet-route = "0.19.0"
netlink-sys = { version = "0.7.0", features = ["tokio_socket"] }
netns-rs = "0.1.0"
# Note: nix needs to stay sync'd with libs versions
nix = "0.26.4"
oci-spec = { version = "0.8.1", features = ["runtime"] }
opentelemetry = { version = "0.17.0", features = ["rt-tokio"] }
procfs = "0.12.0"
prometheus = { version = "0.14.0", features = ["process"] }
protobuf = "3.7.2"
rand = "0.8.4"
regex = "1.10.5"
rstest = "0.18.0"
rtnetlink = "0.14.0"
scan_fmt = "0.2.6"
scopeguard = "1.0.0"
serde = { version = "1.0.145", features = ["derive"] }
serde_json = "1.0.91"
serial_test = "0.10.0"
sha2 = "0.10.9"
slog = "2.5.2"
slog-scope = "4.4.0"
slog-stdlog = "4.0.0"
slog-term = "2.9.0"
strum = { version = "0.24.0", features = ["derive"] }
strum_macros = "0.26.2"
tempfile = "3.19.1"
thiserror = "1.0"
thiserror = "1.0.26"
tokio = "1.46.1"
tokio-vsock = "0.3.4"
toml = "0.5.8"
tracing = "0.1.41"
tracing-opentelemetry = "0.18.0"
tracing-subscriber = "0.3.20"
ttrpc = "0.8.4"
url = "2.5.4"
which = "4.3.0"

View File

@@ -49,8 +49,11 @@ docs-url-alive-check:
build-and-publish-kata-debug:
bash tools/packaging/kata-debug/kata-debug-build-and-upload-payload.sh ${KATA_DEBUG_REGISTRY} ${KATA_DEBUG_TAG}
docs-serve:
docker run --rm -p 8000:8000 -v ./docs:/docs:ro -v ${PWD}/zensical.toml:/zensical.toml:ro zensical/zensical serve --config-file /zensical.toml -a 0.0.0.0:8000
docs-build:
docker build -t kata-docs:latest -f ./docs/Dockerfile ./docs
docs-serve: docs-build
docker run --rm -p 8000:8000 -v ${PWD}:/docs:ro kata-docs:latest serve --config-file /docs/mkdocs.yaml -a 0.0.0.0:8000
.PHONY: \
all \
@@ -59,4 +62,5 @@ docs-serve:
default \
static-checks \
docs-url-alive-check \
docs-build \
docs-serve

View File

@@ -74,7 +74,7 @@ See the [official documentation](docs) including:
- [Developer guide](docs/Developer-Guide.md)
- [Design documents](docs/design)
- [Architecture overview](docs/design/architecture)
- [Architecture 3.0 overview](docs/design/architecture_3.0/)
- [Architecture 4.0 overview](docs/design/architecture_4.0/)
## Configuration

View File

@@ -1 +1 @@
3.27.0
3.28.0

View File

@@ -378,7 +378,7 @@ that is used in the test" section. From there you can see exactly what you'll
have to use when deploying kata-deploy in your local cluster.
> [!NOTE]
> TODO: WAINER TO FINISH THIS PART BASED ON HIS PR TO RUN A LOCAL CI
> TODO: @wainersm TO FINISH THIS PART BASED ON HIS PR TO RUN A LOCAL CI
## Adding new runners

View File

@@ -98,7 +98,7 @@ Let's say the OCP pipeline passed running with
but failed running with
``quay.io/kata-containers/kata-deploy-ci:kata-containers-9f512c016e75599a4a921bd84ea47559fe610057-amd64``
and you'd like to know which PR caused the regression. You can either run with
all the 60 tags between or you can utilize the [bisecter](https://github.com/ldoktor/bisecter)
all the 60 tags between or you can utilize the [`bisecter`](https://github.com/ldoktor/bisecter)
to optimize the number of steps in between.
Before running the bisection you need a reproducer script. Sample one called

18
docs/.nav.yml Normal file
View File

@@ -0,0 +1,18 @@
# https://lukasgeiter.github.io/mkdocs-awesome-nav/
nav:
- Home: index.md
- Getting Started:
- prerequisites.md
- installation.md
- Configuration:
- helm-configuration.md
- runtime-configuration.md
- Platform Support:
- hypervisors.md
- Guides:
- Use Cases:
- NVIDIA GPU Passthrough: use-cases/NVIDIA-GPU-passthrough-and-Kata-QEMU.md
- NVIDIA vGPU: use-cases/NVIDIA-GPU-passthrough-and-Kata.md
- Intel Discrete GPU: use-cases/Intel-Discrete-GPU-passthrough-and-Kata.md
- Misc:
- Architecture: design/architecture/

View File

@@ -522,10 +522,18 @@ $ sudo kata-runtime check
If your system is *not* able to run Kata Containers, the previous command will error out and explain why.
# Run Kata Containers with Containerd
Refer to the [How to use Kata Containers and Containerd](how-to/containerd-kata.md) how-to guide.
# Run Kata Containers with Kubernetes
Refer to the [Run Kata Containers with Kubernetes](how-to/run-kata-with-k8s.md) how-to guide.
- Containerd
Refer to the [How to use Kata Containers and Containerd with Kubernetes](how-to/how-to-use-k8s-with-containerd-and-kata.md) how-to guide.
- CRI-O
Refer to the [How to use Kata Containers and CRI-O with Kubernetes](how-to/how-to-use-k8s-with-crio-and-kata.md) how-to guide.
# Troubleshoot Kata Containers
@@ -730,7 +738,7 @@ sudo sed -i -e 's/^kernel_params = "\(.*\)"/kernel_params = "\1 agent.debug_cons
##### Connecting to the debug console
Next, connect to the debug console. The VSOCKS paths vary slightly between each
Next, connect to the debug console. The VSOCK paths vary slightly between each
VMM solution.
In case of cloud-hypervisor, connect to the `vsock` as shown:

11
docs/Dockerfile Normal file
View File

@@ -0,0 +1,11 @@
# Copyright 2026 Kata Contributors
#
# SPDX-License-Identifier: Apache-2.0
#
FROM python:3.12-slim
WORKDIR /
COPY ./requirements.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
ENTRYPOINT ["python3", "-m", "mkdocs"]

View File

@@ -188,15 +188,14 @@ and compare them with standard tools (e.g. `diff(1)`).
# Spelling
Since this project uses a number of terms not found in conventional
dictionaries, we have a
[spell checking tool](https://github.com/kata-containers/kata-containers/tree/main/tests/cmd/check-spelling)
that checks both dictionary words and the additional terms we use.
dictionaries, we have a [kata-dictionary](../tests/spellcheck/kata-dictionary.txt)
that contains some project specific terms we use.
Run the spell checking tool on your document before raising a PR to ensure it
You can run the `cspell` checking tool on your document before raising a PR to ensure it
is free of mistakes.
If your document introduces new terms, you need to update the custom
dictionary used by the spell checking tool to incorporate the new words.
dictionary to incorporate the new words.
# Names

View File

@@ -1,59 +1,69 @@
# How to do a Kata Containers Release
This document lists the tasks required to create a Kata Release.
## Requirements
- GitHub permissions to run workflows.
## Versioning
## Release Model
The Kata Containers project uses [semantic versioning](http://semver.org/) for all releases.
Semantic versions are comprised of three fields in the form:
Kata Containers follows a rolling release model with monthly snapshots.
New features, bug fixes, and improvements are continuously integrated into
`main`. Each month, a snapshot is tagged as a new `MINOR` release.
```
MAJOR.MINOR.PATCH
```
### Versioning
When `MINOR` increases, the new release adds **new features** but *without changing the existing behavior*.
Releases use the `MAJOR.MINOR.PATCH` scheme. Monthly snapshots increment
`MINOR`; `PATCH` is typically `0`. Major releases are rare (years apart) and
signal significant architectural changes that may require updates to container
managers (Containerd, CRI-O) or other infrastructure. Breaking changes in
`MINOR` releases are avoided where possible, but may occasionally occur as
features are deprecated or removed.
When `MAJOR` increases, the new release adds **new features, bug fixes, or
both** and which **changes the behavior from the previous release** (incompatible with previous releases).
### No Stable Branches
A major release will also likely require a change of the container manager version used,
-for example Containerd or CRI-O. Please refer to the release notes for further details.
**Important** : the Kata Containers project doesn't have stable branches (see
[this issue](https://github.com/kata-containers/kata-containers/issues/9064) for details).
Bug fixes are released as part of `MINOR` or `MAJOR` releases only. `PATCH` is always `0`.
The Kata Containers project does not maintain stable branches (see
[#9064](https://github.com/kata-containers/kata-containers/issues/9064)).
Bug fixes land on `main` and ship in the next monthly snapshot rather than
being backported. Downstream projects that need extended support or compliance
certifications should select a monthly snapshot as their stable base and manage
their own validation and patch backporting from there.
## Release Process
### Bump the `VERSION` and `Chart.yaml` file
### Lock the `main` branch and announce release process
When the `kata-containers/kata-containers` repository is ready for a new release,
first create a PR to set the release in the [`VERSION`](./../VERSION) file and update the
`version` and `appVersion` in the
[`Chart.yaml`](./../tools/packaging/kata-deploy/helm-chart/kata-deploy/Chart.yaml) file and
have it merged.
### Lock the `main` branch
In order to prevent any PRs getting merged during the release process, and slowing the release
process down, by impacting the payload caches, we have recently trailed setting the `main`
branch to read only whilst the release action runs.
In order to prevent any PRs getting merged during the release process, and
slowing the release process down, by impacting the payload caches, we have
recently trialed setting the `main` branch to read-only.
Once the `kata-containers/kata-containers` repository is ready for a new
release, lock the main branch until the release action has completed.
Notify the #kata-dev Slack channel about the ongoing release process.
Ideally, CI usage by others should be reduced to a minimum during the
ongoing release process.
> [!NOTE]
> Admin permission is needed to complete this task.
> Admin permission is needed to lock/unlock the `main` branch.
### Bump the `VERSION` and `Chart.yaml` file
Create a PR to set the release in the [`VERSION`](./../VERSION) file and to
update the `version` and `appVersion` fields in the
[`Chart.yaml`](./../tools/packaging/kata-deploy/helm-chart/kata-deploy/Chart.yaml)
file. Temporarily unlock the main branch to merge the PR.
### Wait for the `VERSION` bump PR payload publish to complete
To reduce the chance of need to re-run the release workflow, check the
[CI | Publish Kata Containers payload](https://github.com/kata-containers/kata-containers/actions/workflows/payload-after-push.yaml)
To reduce the chance of need to re-run the release workflow, check the [CI |
Publish Kata Containers
payload](https://github.com/kata-containers/kata-containers/actions/workflows/payload-after-push.yaml)
once the `VERSION` PR bump has merged to check that the assets build correctly
and are cached, so that the release process can just download these artifacts
rather than needing to build them all, which takes time and can reveal errors in infra.
rather than needing to build them all, which takes time and can reveal errors in
infra.
### Check GitHub Actions
### Trigger the `Release Kata Containers` GitHub Action
We make use of [GitHub actions](https://github.com/features/actions) in the
[release](https://github.com/kata-containers/kata-containers/actions/workflows/release.yaml)
@@ -63,11 +73,10 @@ release artifacts.
> [!NOTE]
> Write permissions to trigger the action.
The action is manually triggered and is responsible for generating a new
release (including a new tag), pushing those to the
`kata-containers/kata-containers` repository. The new release is initially
created as a draft. It is promoted to an official release when the whole
workflow has completed successfully.
The action is manually triggered and is responsible for generating a new release
(including a new tag), pushing those to the `kata-containers/kata-containers`
repository. The new release is initially created as a draft. It is promoted to
an official release when the whole workflow has completed successfully.
Check the [actions status
page](https://github.com/kata-containers/kata-containers/actions) to verify all
@@ -75,12 +84,13 @@ steps in the actions workflow have completed successfully. On success, a static
tarball containing Kata release artifacts will be uploaded to the [Release
page](https://github.com/kata-containers/kata-containers/releases).
If the workflow fails because of some external environmental causes, e.g. network
timeout, simply re-run the failed jobs until they eventually succeed.
If the workflow fails because of some external environmental causes, e.g.
network timeout, simply re-run the failed jobs until they eventually succeed.
If for some reason you need to cancel the workflow or re-run it entirely, go first
to the [Release page](https://github.com/kata-containers/kata-containers/releases) and
delete the draft release from the previous run.
If for some reason you need to cancel the workflow or re-run it entirely, go
first to the [Release
page](https://github.com/kata-containers/kata-containers/releases) and delete
the draft release from the previous run.
### Unlock the `main` branch
@@ -90,9 +100,8 @@ an admin to do it.
### Improve the release notes
Release notes are auto-generated by the GitHub CLI tool used as part of our
release workflow. However, some manual tweaking may still be necessary in
order to highlight the most important features and bug fixes in a specific
release.
release workflow. However, some manual tweaking may still be necessary in order
to highlight the most important features and bug fixes in a specific release.
With this in mind, please, poke @channel on #kata-dev and people who worked on
the release will be able to contribute to that.

View File

Before

Width:  |  Height:  |  Size: 710 B

After

Width:  |  Height:  |  Size: 710 B

View File

@@ -231,12 +231,6 @@ Run the
[markdown checker](https://github.com/kata-containers/kata-containers/tree/main/tests/cmd/check-markdown)
on your documentation changes.
### Spell check
Run the
[spell checker](https://github.com/kata-containers/kata-containers/tree/main/tests/cmd/check-spelling)
on your documentation changes.
## Finally
You may wish to read the documentation that the

View File

@@ -32,4 +32,4 @@ runtime. Refer to the following guides on how to set up Kata
Containers with Kubernetes:
- [How to use Kata Containers and containerd](../../how-to/containerd-kata.md)
- [Run Kata Containers with Kubernetes](../../how-to/run-kata-with-k8s.md)
- [Run Kata Containers with Kubernetes](../../how-to/how-to-use-k8s-with-crio-and-kata.md)

View File

@@ -1,168 +0,0 @@
# Kata 3.0 Architecture
## Overview
In cloud-native scenarios, there is an increased demand for container startup speed, resource consumption, stability, and security, areas where the present Kata Containers runtime is challenged relative to other runtimes. To achieve this, we propose a solid, field-tested and secure Rust version of the kata-runtime.
Also, we provide the following designs:
- Turn key solution with builtin `Dragonball` Sandbox
- Async I/O to reduce resource consumption
- Extensible framework for multiple services, runtimes and hypervisors
- Lifecycle management for sandbox and container associated resources
### Rationale for choosing Rust
We chose Rust because it is designed as a system language with a focus on efficiency.
In contrast to Go, Rust makes a variety of design trade-offs in order to obtain
good execution performance, with innovative techniques that, in contrast to C or
C++, provide reasonable protection against common memory errors (buffer
overflow, invalid pointers, range errors), error checking (ensuring errors are
dealt with), thread safety, ownership of resources, and more.
These benefits were verified in our project when the Kata Containers guest agent
was rewritten in Rust. We notably saw a significant reduction in memory usage
with the Rust-based implementation.
## Design
### Architecture
![architecture](./images/architecture.png)
### Built-in VMM
#### Current Kata 2.x architecture
![not_builtin_vmm](./images/not_built_in_vmm.png)
As shown in the figure, runtime and VMM are separate processes. The runtime process forks the VMM process and interacts through the inter-process RPC. Typically, process interaction consumes more resources than peers within the process, and it will result in relatively low efficiency. At the same time, the cost of resource operation and maintenance should be considered. For example, when performing resource recovery under abnormal conditions, the exception of any process must be detected by others and activate the appropriate resource recovery process. If there are additional processes, the recovery becomes even more difficult.
#### How To Support Built-in VMM
We provide `Dragonball` Sandbox to enable built-in VMM by integrating VMM's function into the Rust library. We could perform VMM-related functionalities by using the library. Because runtime and VMM are in the same process, there is a benefit in terms of message processing speed and API synchronization. It can also guarantee the consistency of the runtime and the VMM life cycle, reducing resource recovery and exception handling maintenance, as shown in the figure:
![builtin_vmm](./images/built_in_vmm.png)
### Async Support
#### Why Need Async
**Async is already in stable Rust and allows us to write async code**
- Async provides significantly reduced CPU and memory overhead, especially for workloads with a large amount of IO-bound tasks
- Async is zero-cost in Rust, which means that you only pay for what you use. Specifically, you can use async without heap allocations and dynamic dispatch, which greatly improves efficiency
- For more (see [Why Async?](https://rust-lang.github.io/async-book/01_getting_started/02_why_async.html) and [The State of Asynchronous Rust](https://rust-lang.github.io/async-book/01_getting_started/03_state_of_async_rust.html)).
**There may be several problems if implementing kata-runtime with Sync Rust**
- Too many threads with a new TTRPC connection
- TTRPC threads: reaper thread(1) + listener thread(1) + client handler(2)
- Add 3 I/O threads with a new container
- In Sync mode, implementing a timeout mechanism is challenging. For example, in TTRPC API interaction, the timeout mechanism is difficult to align with Golang
#### How To Support Async
The kata-runtime is controlled by TOKIO_RUNTIME_WORKER_THREADS to run the OS thread, which is 2 threads by default. For TTRPC and container-related threads run in the `tokio` thread in a unified manner, and related dependencies need to be switched to Async, such as Timer, File, Netlink, etc. With the help of Async, we can easily support no-block I/O and timer. Currently, we only utilize Async for kata-runtime. The built-in VMM keeps the OS thread because it can ensure that the threads are controllable.
**For N `tokio` worker threads and M containers**
- Sync runtime(both OS thread and `tokio` task are OS thread but without `tokio` worker thread) OS thread number: 4 + 12*M
- Async runtime(only OS thread is OS thread) OS thread number: 2 + N
```shell
├─ main(OS thread)
├─ async-logger(OS thread)
└─ tokio worker(N * OS thread)
├─ agent log forwarder(1 * tokio task)
├─ health check thread(1 * tokio task)
├─ TTRPC reaper thread(M * tokio task)
├─ TTRPC listener thread(M * tokio task)
├─ TTRPC client handler thread(7 * M * tokio task)
├─ container stdin io thread(M * tokio task)
├─ container stdout io thread(M * tokio task)
└─ container stderr io thread(M * tokio task)
```
### Extensible Framework
The Kata 3.x runtime is designed with the extension of service, runtime, and hypervisor, combined with configuration to meet the needs of different scenarios. At present, the service provides a register mechanism to support multiple services. Services could interact with runtime through messages. In addition, the runtime handler handles messages from services. To meet the needs of a binary that supports multiple runtimes and hypervisors, the startup must obtain the runtime handler type and hypervisor type through configuration.
![framework](./images/framework.png)
### Resource Manager
In our case, there will be a variety of resources, and every resource has several subtypes. Especially for `Virt-Container`, every subtype of resource has different operations. And there may be dependencies, such as the share-fs rootfs and the share-fs volume will use share-fs resources to share files to the VM. Currently, network and share-fs are regarded as sandbox resources, while rootfs, volume, and cgroup are regarded as container resources. Also, we abstract a common interface for each resource and use subclass operations to evaluate the differences between different subtypes.
![resource manager](./images/resourceManager.png)
## Roadmap
- Stage 1 (June): provide basic features (current delivered)
- Stage 2 (September): support common features
- Stage 3: support full features
| **Class** | **Sub-Class** | **Development Stage** | **Status** |
| -------------------------- | ------------------- | --------------------- |------------|
| Service | task service | Stage 1 | ✅ |
| | extend service | Stage 3 | 🚫 |
| | image service | Stage 3 | 🚫 |
| Runtime handler | `Virt-Container` | Stage 1 | ✅ |
| Endpoint | VETH Endpoint | Stage 1 | ✅ |
| | Physical Endpoint | Stage 2 | ✅ |
| | Tap Endpoint | Stage 2 | ✅ |
| | `Tuntap` Endpoint | Stage 2 | ✅ |
| | `IPVlan` Endpoint | Stage 2 | ✅ |
| | `MacVlan` Endpoint | Stage 2 | ✅ |
| | MACVTAP Endpoint | Stage 3 | 🚫 |
| | `VhostUserEndpoint` | Stage 3 | 🚫 |
| Network Interworking Model | Tc filter | Stage 1 | ✅ |
| | `MacVtap` | Stage 3 | 🚧 |
| Storage | Virtio-fs | Stage 1 | ✅ |
| | `nydus` | Stage 2 | 🚧 |
| | `device mapper` | Stage 2 | 🚫 |
| `Cgroup V2` | | Stage 2 | 🚧 |
| Hypervisor | `Dragonball` | Stage 1 | 🚧 |
| | QEMU | Stage 2 | 🚫 |
| | Cloud Hypervisor | Stage 3 | 🚫 |
| | Firecracker | Stage 3 | 🚫 |
## FAQ
- Are the "service", "message dispatcher" and "runtime handler" all part of the single Kata 3.x runtime binary?
Yes. They are components in Kata 3.x runtime. And they will be packed into one binary.
1. Service is an interface, which is responsible for handling multiple services like task service, image service and etc.
2. Message dispatcher, it is used to match multiple requests from the service module.
3. Runtime handler is used to deal with the operation for sandbox and container.
- What is the name of the Kata 3.x runtime binary?
Apparently we can't use `containerd-shim-v2-kata` because it's already used. We are facing the hardest issue of "naming" again. Any suggestions are welcomed.
Internally we use `containerd-shim-v2-rund`.
- Is the Kata 3.x design compatible with the containerd shimv2 architecture?
Yes. It is designed to follow the functionality of go version kata. And it implements the `containerd shim v2` interface/protocol.
- How will users migrate to the Kata 3.x architecture?
The migration plan will be provided before the Kata 3.x is merging into the main branch.
- Is `Dragonball` limited to its own built-in VMM? Can the `Dragonball` system be configured to work using an external `Dragonball` VMM/hypervisor?
The `Dragonball` could work as an external hypervisor. However, stability and performance is challenging in this case. Built in VMM could optimise the container overhead, and it's easy to maintain stability.
`runD` is the `containerd-shim-v2` counterpart of `runC` and can run a pod/containers. `Dragonball` is a `microvm`/VMM that is designed to run container workloads. Instead of `microvm`/VMM, we sometimes refer to it as secure sandbox.
- QEMU, Cloud Hypervisor and Firecracker support are planned, but how that would work. Are they working in separate process?
Yes. They are unable to work as built in VMM.
- What is `upcall`?
The `upcall` is used to hotplug CPU/memory/MMIO devices, and it solves two issues.
1. avoid dependency on PCI/ACPI
2. avoid dependency on `udevd` within guest and get deterministic results for hotplug operations. So `upcall` is an alternative to ACPI based CPU/memory/device hotplug. And we may cooperate with the community to add support for ACPI based CPU/memory/device hotplug if needed.
`Dbs-upcall` is a `vsock-based` direct communication tool between VMM and guests. The server side of the `upcall` is a driver in guest kernel (kernel patches are needed for this feature) and it'll start to serve the requests once the kernel has started. And the client side is in VMM , it'll be a thread that communicates with VSOCK through `uds`. We have accomplished device hotplug / hot-unplug directly through `upcall` in order to avoid virtualization of ACPI to minimize virtual machine's overhead. And there could be many other usage through this direct communication channel. It's already open source.
https://github.com/openanolis/dragonball-sandbox/tree/main/crates/dbs-upcall
- The URL below says the kernel patches work with 4.19, but do they also work with 5.15+ ?
Forward compatibility should be achievable, we have ported it to 5.10 based kernel.
- Are these patches platform-specific or would they work for any architecture that supports VSOCK?
It's almost platform independent, but some message related to CPU hotplug are platform dependent.
- Could the kernel driver be replaced with a userland daemon in the guest using loopback VSOCK?
We need to create device nodes for hot-added CPU/memory/devices, so it's not easy for userspace daemon to do these tasks.
- The fact that `upcall` allows communication between the VMM and the guest suggests that this architecture might be incompatible with https://github.com/confidential-containers where the VMM should have no knowledge of what happens inside the VM.
1. `TDX` doesn't support CPU/memory hotplug yet.
2. For ACPI based device hotplug, it depends on ACPI `DSDT` table, and the guest kernel will execute `ASL` code to handle during handling those hotplug event. And it should be easier to audit VSOCK based communication than ACPI `ASL` methods.
- What is the security boundary for the monolithic / "Built-in VMM" case?
It has the security boundary of virtualization. More details will be provided in next stage.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 95 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 66 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 136 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 139 KiB

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,433 @@
# Kata Containers 4.0 Architecture (Rust Runtime)
## Overview
Kata Containers 4.0 represents a significant architectural evolution, moving beyond the limitations of legacy multi-process container runtimes. Driven by a modern Rust-based stack, this release transitions to an asynchronous, unified architecture that drastically reduces resource consumption and latency.
By consolidating the entire runtime into a single, high-performance binary, Kata 4.0 eliminates the overhead of cross-process communication and streamlines the container lifecycle. The result is a secure, production-tested runtime capable of handling high-density workloads with efficiency. With built-in support for diverse container abstractions and optimized hypervisor integration, Kata 4.0 delivers the agility and robustness required by modern, cloud-native infrastructure.
---
## 1. Architecture Overview
The Kata Containers Rust Runtime is designed to minimize resource overhead and startup latency. It achieves this by shifting from traditional process-based management to a more integrated, Rust-native control flow.
```mermaid
graph TD
containerd["containerd"] --> shimv2["containerd-shim-kata-v2 (shimv2)"]
subgraph BuiltIn["Built-in VMM (Integrated Mode)"]
direction TD
subgraph shimv2_bi["shimv2 process (Single Process)"]
runtime_bi["shimv2 runtime"]
subgraph dragonball["Dragonball VMM (library)"]
helpers_bi["virtiofs / nydus\n(BuiltIn)"]
end
runtime_bi -->|"direct function calls"| dragonball
end
subgraph guestvm_bi["Guest VM"]
agent_bi["kata-agent"]
end
shimv2_bi -->|"hybrid-vsock"| guestvm_bi
end
subgraph OptionalVMM["Optional VMM (External Mode)"]
direction TD
shimv2_ext["shimv2 process"]
imagesrvd_ext["virtiofsd / nydusd\n(Independent Process)"]
ext_vmm["External VMM process\n(QEMU / Cloud-Hypervisor / Firecracker)"]
subgraph guestvm_ext["Guest VM"]
agent_ext["kata-agent"]
end
shimv2_ext -->|"fork + IPC/RPC"| ext_vmm
shimv2_ext -->|"manages"| imagesrvd_ext
ext_vmm -->|"vsock / hybrid-vsock"| guestvm_ext
end
shimv2 --> BuiltIn
shimv2 --> OptionalVMM
classDef process fill:#d0e8ff,stroke:#336,stroke-width:1px
classDef vm fill:#d4edda,stroke:#155724,stroke-width:1px
classDef agent fill:#fff3cd,stroke:#856404,stroke-width:1px
class shimv2,runtime_bi,shimv2_ext,helpers_bi,imagesrvd_ext,ext_vmm process
class guestvm_bi,guestvm_ext vm
class agent_bi,agent_ext agent
```
The runtime employs a **flexible VMM strategy**, supporting both `built-in` and `optional` VMMs. This allows users to choose between a tightly integrated VMM (e.g., Dragonball) for peak performance, or external options (e.g., QEMU, Cloud-Hypervisor, Firecracker) for enhanced compatibility and modularity.
### A. Built-in VMM (Integrated Mode)
The built-in VMM mode is the default and recommended configuration for users, as it offers superior performance and resource efficiency.
In this mode, the VMM (`Dragonball`) is **deeply integrated** into the `shimv2`'s lifecycle. This eliminates the overhead of IPC, enabling lower-latency message processing and tight API synchronization. Moreover, it ensures the runtime and VMM share a unified lifecycle, simplifying exception handling and resource cleanup.
* **Integrated Management**: The `shimv2` directly controls the VMM and its critical helper services (`virtiofsd` or `nydusd`).
* **Performance**: By eliminating external process overhead and complex inter-process communication (IPC), this mode achieves faster container startup and higher resource density.
* **Core Technology**: Primarily utilizes **Dragonball**, the native Rust-based VMM optimized and dedicated for cloud-native scenarios.
> **Note**: The built-in VMM mode is the default and recommended configuration for users, as it offers superior performance and resource efficiency.
### B. Optional VMM (External Mode)
The optional VMM mode is available for users with specific requirements that necessitate external hypervisor support.
In this mode, the runtime and the VMM operate as separate, decoupled processes. The runtime forks the VMM process and interacts with it via inter-process RPC. And the `containerd-shim-kata-v2`(short of `shimv2`) manages the VMM as an **external process**.
* **Decoupled Lifecycle**: The `shimv2` communicates with the VMM (e.g., QEMU, Cloud-Hypervisor, or Firecracker) via vsock/hybrid vsock.
* **Flexibility**: Ideal for environments that require specific hypervisor hardware emulation or legacy compatibility.
> **Note**: This approach (Optional VMM) introduces overhead due to context switching and cross-process communication. Furthermore, managing resources across process boundaries—especially during abnormal conditions—introduces significant complexity in error detection and recovery.
---
## Core Architectural Principles
* **Safety via Rust**: Leveraging Rust's ownership and type systems to eliminate memory-related vulnerabilities (buffer overflows, dangling pointers) by design.
* **Performance via Async**: Utilizing Tokio to handle high-concurrency I/O, reducing the OS thread footprint by an order of magnitude.
* **Built-in VMM**: A modular, library-based approach to virtualization, enabling tighter integration with the runtime.
* **Pluggable Framework**: A clean abstraction layer allowing seamless swapping of hypervisors, network interfaces, and storage backends.
---
## Design Deep Dive
### Built-in VMM Integration (Dragonball)
The legacy Kata 2.x architecture relied on inter-process communication (IPC) between the runtime and the VMM. This introduced context-switching latency and complex error-recovery requirements across process boundaries. In contrast, the built-in VMM approach embeds the VMM directly within the runtime's process space. This eliminates IPC overhead, allowing for direct function calls and shared memory access, resulting in significantly reduced startup times and improved performance.
```mermaid
graph LR
subgraph HostProcess["Host Process:containerd-shim-kata-v2 (shimv2)"]
shimv2["shimv2 runtime"]
end
imagesrvd["virtiofsd / nydusd\n(Independent Process)"]
subgraph ExtVMMProc["External VMM Process (e.g., QEMU)"]
vmm["VMM\n(QEMU / Cloud-Hypervisor\n/ Firecracker)"]
end
subgraph GuestVM["Guest VM"]
agent["kata-agent"]
end
shimv2 -->|"fork + IPC / RPC"| vmm
shimv2 -->|"manages"| imagesrvd
vmm -->|"vsock / hybrid-vsock"| GuestVM
classDef proc fill:#d0e8ff,stroke:#336,stroke-width:1px
classDef vm fill:#d4edda,stroke:#155724,stroke-width:1px
classDef ag fill:#fff3cd,stroke:#856404,stroke-width:1px
class shimv2,imagesrvd,vmm proc
class agent ag
```
```mermaid
graph LR
subgraph SingleProcess["Single Process: containerd-shim-kata-v2 (shimv2)"]
shimv2["shimv2 runtime"]
subgraph dragonball["Dragonball VMM (library)"]
helpers["virtiofs / nydus\n(BuiltIn)"]
end
shimv2 -->|"direct function calls"| dragonball
end
subgraph GuestVM["Guest VM"]
agent["kata-agent"]
end
dragonball -->|"hybrid-vsock"| GuestVM
classDef proc fill:#d0e8ff,stroke:#336,stroke-width:1px
classDef vm fill:#d4edda,stroke:#155724,stroke-width:1px
classDef ag fill:#fff3cd,stroke:#856404,stroke-width:1px
class shimv2,helpers proc
class agent ag
```
By integrating Dragonball directly as a library, we eliminate the need for heavy IPC.
* **API Synchronization**: Direct function calls replace RPCs, reducing latency.
* **Unified Lifecycle**: The runtime and VMM share a single process lifecycle, significantly simplifying resource cleanup and fault isolation.
### Layered Architecture
The Kata 4.0 runtime utilizes a highly modular, layered architecture designed to decouple high-level service requests from low-level infrastructure execution. This design facilitates extensibility, allowing the system to support diverse container types and dragonball within a single, unified Rust binary and also support other hypervisors as optional VMMs.
```mermaid
graph TD
subgraph L1["Layer 1 — Service & Orchestration Layer"]
TaskSvc["Task Service"]
ImageSvc["Image Service"]
OtherSvc["Other Services"]
Dispatcher["Message Dispatcher"]
TaskSvc --> Dispatcher
ImageSvc --> Dispatcher
OtherSvc --> Dispatcher
end
subgraph L2["Layer 2 — Management & Handler Layer"]
subgraph RuntimeHandler["Runtime Handler"]
SandboxMgr["Sandbox Manager"]
ContainerMgr["Container Manager"]
end
subgraph ContainerAbstractions["Container Abstractions"]
LinuxContainer["LinuxContainer"]
VirtContainer["VirtContainer"]
WasmContainer["WasmContainer"]
end
end
subgraph L3["Layer 3 — Infrastructure Abstraction Layer"]
subgraph HypervisorIface["Hypervisor Interface"]
Qemu["Qemu"]
CloudHV["Cloud Hypervisor"]
Firecracker["Firecracker"]
Dragonball["Dragonball"]
end
subgraph ResourceMgr["Resource Manager"]
Sharedfs["Sharedfs"]
Network["Network"]
Rootfs["Rootfs"]
Volume["Volume"]
Cgroup["Cgroup"]
end
end
subgraph L4["Layer 4 — Built-in Dragonball VMM Layer"]
BuiltinDB["Builtin Dragonball"]
end
Dispatcher --> RuntimeHandler
RuntimeHandler --> ContainerAbstractions
ContainerAbstractions --> HypervisorIface
ContainerAbstractions --> ResourceMgr
Dragonball --> BuiltinDB
classDef svc fill:#cce5ff,stroke:#004085,stroke-width:1px
classDef handler fill:#d4edda,stroke:#155724,stroke-width:1px
classDef infra fill:#fff3cd,stroke:#856404,stroke-width:1px
classDef builtin fill:#f8d7da,stroke:#721c24,stroke-width:1px
class TaskSvc,ImageSvc,OtherSvc,Dispatcher svc
class SandboxMgr,ContainerMgr,LinuxContainer,VirtContainer,WasmContainer handler
class Qemu,CloudHV,Firecracker,Dragonball,Sharedfs,Network,Rootfs,Volume,Cgroup infra
class BuiltinDB builtin
```
#### Service & Orchestration Layer
* **Service Layer**: The entry point for the runtime, providing specialized interfaces for external callers (e.g., `containerd`). It includes:
* **Task Service**: Manages the lifecycle of containerized processes.
* **Image Service**: Handles container image operations.
* **Other Services**: An extensible framework allowing for custom modules.
* **Message Dispatcher**: Acts as a centralized traffic controller. It parses requests from the Service layer and routes them to the appropriate **Runtime Handler**, ensuring efficient message multiplexing.
#### Management & Handler Layer
* **Runtime Handler**: The core processing engine. It abstracts the underlying workload, enabling the runtime to handle various container types through:
* **Sandbox Manager**: Orchestrates the lifecycle of the entire Pod (Sandbox).
* **Container Manager**: Manages individual containers within a Sandbox.
* **Container Abstractions**: The framework is agnostic to the container implementation, with explicit support paths for:
* **LinuxContainer** (Standard/OCI)
* **VirtContainer** (Virtualization-based)
* **WasmContainer** (WebAssembly-based)
#### Infrastructure Abstraction Layer
This layer provides standardized interfaces for hardware and resource management, regardless of the underlying backend.
* **Hypervisor Interface**: A pluggable architecture supporting multiple virtualization backends, including **Qemu**, **Cloud Hypervisor**, **Firecracker**, and **Dragonball**.
* **Resource Manager**: A unified interface for managing critical infrastructure components:
* **Sharedfs, Network, Rootfs, Volume, and cgroup management**.
#### Built-in Dragonball VMM Layer
Representing the core of the high-performance runtime, the `Builtin Dragonball` block demonstrates deep integration between the runtime and the hypervisor.
#### Key Architectural Advantages
* **Uniformity**: By consolidating these layers into a single binary, the runtime ensures a consistent state across all sub-modules, preventing the "split-brain" scenarios common in multi-process runtimes.
* **Modularity**: The clear separation between the **Message Dispatcher** and the **Runtime Handler** allows developers to introduce new container types (e.g., WASM) or hypervisors without modifying existing core logic.
* **Efficiency**: The direct integration of `Dragonball` as a library allows for "Zero-Copy" resource management and direct API access, which drastically improves performance compared to traditional RPC-based hypervisor interaction.
### Extensible Framework
The Kata Rust runtime features a modular design that supports diverse services, runtimes, and hypervisors. We utilize a registration mechanism to decouple service logic from the core runtime. At startup, the runtime resolves the required runtime handler and hypervisor types based on configuration.
```mermaid
graph LR
API["API"]
subgraph Services["Configurable Services"]
TaskSvc["Task Service"]
ImageSvc["Image Service"]
OtherSvc["Other Service"]
end
Msg(["Message Dispatcher"])
subgraph Handlers["Configurable Runtime Handlers"]
WasmC["WasmContainer"]
VirtC["VirtContainer"]
LinuxC["LinuxContainer"]
end
subgraph HVs["Configurable Hypervisors"]
DB["Dragonball"]
QEMU["QEMU"]
CH["Cloud Hypervisor"]
FC["Firecracker"]
end
API --> Services
Services --> Msg
Msg --> Handlers
Handlers --> HVs
classDef api fill:#d0e8ff,stroke:#336,stroke-width:1px
classDef svc fill:#e2d9f3,stroke:#6610f2,stroke-width:1px
classDef msg fill:#fff3cd,stroke:#856404,stroke-width:1px
classDef handler fill:#d4edda,stroke:#155724,stroke-width:1px
classDef hv fill:#f8d7da,stroke:#721c24,stroke-width:1px
class API api
class TaskSvc,ImageSvc,OtherSvc svc
class Msg msg
class WasmC,VirtC,LinuxC handler
class DB,QEMU,CH,FC hv
```
### Modular Resource Manager
Managing diverse resources—from Virtio-fs volumes to Cgroup V2—is handled by an abstracted resource manager. Each resource type implements a common trait, enabling uniform lifecycle hooks and deterministic dependency resolution.
```mermaid
graph LR
RM["Resource Manager"]
subgraph SandboxRes["Sandbox Resources"]
Network["Network Entity"]
SharedFs["Shared FS"]
end
subgraph ContainerRes["Container Resources"]
Rootfs["Rootfs"]
Cgroup["Cgroup"]
Volume["Volume"]
end
RM --> Network
RM --> SharedFs
RM --> Rootfs
RM --> Cgroup
RM --> Volume
Network --> Endpoint["endpoint\n(veth / physical)"]
Network --> NetModel["model\n(tcfilter / route)"]
SharedFs --> InlineVirtioFs["inline virtiofs"]
SharedFs --> StandaloneVirtioFs["standalone virtiofs"]
Rootfs --> RootfsTypes["block / virtiofs / nydus"]
Cgroup --> CgroupVers["v1 / v2"]
Volume --> VolumeTypes["sharefs / shm / local\nephemeral / direct / block"]
classDef rm fill:#e2d9f3,stroke:#6610f2,stroke-width:2px
classDef sandbox fill:#d0e8ff,stroke:#336,stroke-width:1px
classDef container fill:#d4edda,stroke:#155724,stroke-width:1px
classDef impl fill:#fff3cd,stroke:#856404,stroke-width:1px
class RM rm
class Network,SharedFs sandbox
class Rootfs,Cgroup,Volume container
class Endpoint,NetModel,InlineVirtioFs,StandaloneVirtioFs,RootfsTypes,CgroupVers,VolumeTypes impl
```
### Asynchronous I/O Model
Synchronous runtimes are often limited by "thread bloat," where each container or connection spawns multiple OS threads.
#### Why Async Rust?
**The Rust async ecosystem is stable and highly efficient, providing several key benefits:**
- Reduced Overhead: Significantly lower CPU and memory consumption, particularly for I/O-bound workloads.
- Zero-Cost Abstractions: Rust's async model allows developers to "pay only for what they use," avoiding heap allocations and dynamic dispatch where possible.
- For further reading, see [Why Async?](https://rust-lang.github.io/async-book/01_getting_started/02_why_async.html) and [The State of Asynchronous Rust](https://rust-lang.github.io/async-book/01_getting_started/03_state_of_async_rust.html).
**Limitations of Synchronous Rust in kata-runtime:**
- Thread Proliferation: Every TTRPC connection creates multiple threads (Reaper, Listener, Handler), and each container adds 3 additional I/O threads, leading to high thread count and memory pressure.
- Timeout Complexity: Implementing reliable, cross-platform timeout mechanisms in synchronous code is difficult, especially when aligning with Golang-based components.
#### Implementation
The kata-runtime utilizes Tokio to manage asynchronous tasks. By offloading TTRPC and container-related I/O to a unified Tokio executor and switching dependencies (Timer, File, Netlink) to their asynchronous counterparts, we achieve non-blocking I/O. The built-in VMM remains on a dedicated OS thread to ensure control and real-time performance.
**Comparison of OS Thread usage (for N tokio worker threads and M containers)**
- Sync Runtime: OS thread count scales as 4 + 12*M.
- Async Runtime: OS thread count scales as 2 + N.
```shell
├─ main(OS thread)
├─ async-logger(OS thread)
└─ tokio worker(N * OS thread)
├─ agent log forwarder(1 * tokio task)
├─ health check thread(1 * tokio task)
├─ TTRPC reaper thread(M * tokio task)
├─ TTRPC listener thread(M * tokio task)
├─ TTRPC client handler thread(7 * M * tokio task)
├─ container stdin io thread(M * tokio task)
├─ container stdout io thread(M * tokio task)
└─ container stderr io thread(M * tokio task)
```
The Async Advantage:
We move away from thread-per-task to a Tokio-driven task model.
* **Scalability**: The OS thread count is reduced from 4 + 12*M (Sync) to 2 + N (Async), where N is the worker thread count.
* **Efficiency**: Non-blocking I/O allows a single thread to handle multiplexed container operations, significantly lowering memory consumption for high-density pod deployments.
---
## 2. Getting Started
To configure your preferred VMM strategy, locate the `[hypervisor]` block in your runtime configuration file:
- Install Kata Containers with the Rust Runtime and Dragonball as the built-in VMM by following the [containerd-kata](../../how-to/containerd-kata.md).
- Run a kata with builtin VMM Dragonball
```shell
$ sudo ctr run --runtime io.containerd.kata.v2 -d docker.io/library/ubuntu:latest hello
```
As the VMM and its image service have been builtin, you should only see a single containerd-shim-kata-v2 process.
---
## FAQ
* **Q1**: Is the architecture compatible with containerd?
Yes. It implements the containerd-shim-v2 interface, ensuring drop-in compatibility with standard cloud-native tooling.
* **Q2**: What is the security boundary for the "Built-in VMM" model?
The security boundary remains established by the hypervisor (hardware virtualization). The shift to a monolithic process model does not compromise isolation; rather, it improves the integrity of the control plane by reducing the attack surface typically associated with complex IPC mechanisms.
* **Q3**: What is the migration path?
Migration is managed via configuration policies. The containerd shim configuration will allow users to toggle between the legacy runtime and the runtime-rs (internally `RunD`) binary, facilitating canary deployments and gradual migration.
* **Q4**: Why upcall instead of ACPI?
Standard ACPI-based hotplugging requires heavy guest-side kernel emulation and udevd interaction. Dbs-upcall utilizes a vsock-based direct channel to trigger hotplug events, providing:
Deterministic execution: Bypassing complex guest-side ACPI state machines.
Lower overhead: Minimizing guest kernel footprint.
* **Q5**: How upcall works?
The `Dbs-upcall` architecture consists of a server-side driver in the guest kernel and a client-side thread within the VMM. Once the guest kernel initializes, it establishes a communication channel via vsock (using uds). This allows the VMM to directly request device hot-add/hot-remove operations. We have already open-sourced this implementation: [dbs-upcall](https://github.com/openanolis/dragonball-sandbox/tree/main/crates/dbs-upcall).

View File

@@ -43,7 +43,7 @@ To fulfill the [Kata design requirements](kata-design-requirements.md), and base
|`sandbox.AddInterface(inf)`| Add new NIC to the sandbox.|
|`sandbox.RemoveInterface(inf)`| Remove a NIC from the sandbox.|
|`sandbox.ListInterfaces()`| List all NICs and their configurations in the sandbox, return a `pbTypes.Interface` list.|
|`sandbox.UpdateRoutes(routes)`| Update the sandbox route table (e.g. for portmapping support), return a `pbTypes.Route` list.|
|`sandbox.UpdateRoutes(routes)`| Update the sandbox route table (e.g. for port mapping support), return a `pbTypes.Route` list.|
|`sandbox.ListRoutes()`| List the sandbox route table, return a `pbTypes.Route` list.|
### Sandbox Relay API

View File

@@ -8,7 +8,7 @@ The following benchmarking result shows the performance improvement compared wit
## Proposal - Bring `lazyload` ability to Kata Containers
`Nydusd` is a fuse/`virtiofs` daemon which is provided by `nydus` project and it supports `PassthroughFS` and [RAFS](https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md) (Registry Acceleration File System) natively, so in Kata Containers, we can use `nydusd` in place of `virtiofsd` and mount `nydus` image to guest in the meanwhile.
`Nydusd` is a fuse/`virtiofs` daemon which is provided by `nydus` project and it supports `PassthroughFS` and [`rafs`](https://github.com/dragonflyoss/image-service/blob/master/docs/nydus-design.md) (Registry Acceleration File System) natively, so in Kata Containers, we can use `nydusd` in place of `virtiofsd` and mount `nydus` image to guest in the meanwhile.
The process of creating/starting Kata Containers with `virtiofsd`,

View File

@@ -1,137 +1,324 @@
# Virtualization in Kata Containers
Kata Containers, a second layer of isolation is created on top of those provided by traditional namespace-containers. The
hardware virtualization interface is the basis of this additional layer. Kata will launch a lightweight virtual machine,
and use the guests Linux kernel to create a container workload, or workloads in the case of multi-container pods. In Kubernetes
and in the Kata implementation, the sandbox is carried out at the pod level. In Kata, this sandbox is created using a virtual machine.
## Overview
This document describes how Kata Containers maps container technologies to virtual machines technologies, and how this is realized in
the multiple hypervisors and virtual machine monitors that Kata supports.
Kata Containers creates a second layer of isolation on top of traditional namespace-based containers using hardware virtualization. Kata launches a lightweight virtual machine (VM) and uses the guest Linux kernel to create container workloads. In Kubernetes, the sandbox is implemented at the pod level using VMs.
## Mapping container concepts to virtual machine technologies
This document describes:
A typical deployment of Kata Containers will be in Kubernetes by way of a Container Runtime Interface (CRI) implementation. On every node,
Kubelet will interact with a CRI implementer (such as containerd or CRI-O), which will in turn interface with Kata Containers (an OCI based runtime).
- How Kata Containers maps container technologies to virtualization technologies
- The multiple hypervisors and Virtual Machine Monitors (VMMs) supported by Kata
- Guidance for selecting the appropriate hypervisor for your use case
The CRI API, as defined at the [Kubernetes CRI-API repo](https://github.com/kubernetes/cri-api/), implies a few constructs being supported by the
CRI implementation, and ultimately in Kata Containers. In order to support the full [API](https://github.com/kubernetes/cri-api/blob/a6f63f369f6d50e9d0886f2eda63d585fbd1ab6a/pkg/apis/runtime/v1alpha2/api.proto#L34-L110) with the CRI-implementer, Kata must provide the following constructs:
### Architecture
![API to construct](./arch-images/api-to-construct.png)
A typical Kata Containers deployment integrates with Kubernetes through a Container Runtime Interface (CRI) implementation:
These constructs can then be further mapped to what devices are necessary for interfacing with the virtual machine:
```
Kubelet → CRI (containerd/CRI-O) → Kata Containers (OCI runtime) → VM → Containers
```
![construct to VM concept](./arch-images/construct-to-vm-concept.png)
The CRI API requires Kata to support the following constructs:
Ultimately, these concepts map to specific para-virtualized devices or virtualization technologies.
| CRI Construct | VM Equivalent | Virtualization Technology |
|---------------|---------------|---------------------------|
| Pod Sandbox | VM | Hypervisor/VMM |
| Container | Process in VM | Namespace/Cgroup in guest |
| Network | Network Interface | virtio-net, vhost-net, physical, etc. |
| Storage | Block/File Device | virtio-block, virtio-scsi, virtio-fs |
| Compute | vCPU/Memory | KVM, ACPI hotplug |
![VM concept to underlying technology](./arch-images/vm-concept-to-tech.png)
### Mapping Container Concepts to Virtualization Technologies
Each hypervisor or VMM varies on how or if it handles each of these.
Kata Containers implements the Kubernetes Container Runtime Interface (CRI) to provide pod and container lifecycle management. The CRI API defines abstractions that Kata must translate into virtualization primitives.
## Kata Containers Hypervisor and VMM support
The mapping from CRI constructs to virtualization technologies follows a three-layer model:
Kata Containers [supports multiple hypervisors](../hypervisors.md).
```
CRI API Constructs → VM Abstractions → Para-virtualized Devices
```
Details of each solution and a summary are provided below.
**Layer 1: CRI API Constructs**
The CRI API ([kubernetes/cri-api](https://github.com/kubernetes/cri-api)) defines the following abstractions that Kata must implement:
| Construct | Description |
|-----------|-------------|
| Pod Sandbox | Isolated execution environment for containers |
| Container | Process workload within a sandbox |
| Network | Pod and container networking interfaces |
| Storage | Volume mounts and image storage |
| RuntimeConfig | Resource constraints (CPU, memory, cgroups) |
![CRI API to Kata Constructs](./arch-images/api-to-construct.png)
**Layer 2: VM Abstractions**
Kata translates CRI constructs into VM-level concepts:
| CRI Construct | VM Equivalent |
|---------------|---------------|
| Pod Sandbox | Virtual Machine |
| Container | Process/namespace in guest OS |
| Network | Virtual NIC (vNIC) |
| Storage | Virtual block device or filesystem |
| RuntimeConfig | VM resources (vCPU, memory) |
![Kata Constructs to VM Concepts](./arch-images/construct-to-vm-concept.png)
**Layer 3: Para-virtualized Devices**
VM abstractions are realized through para-virtualized drivers for optimal performance:
| VM Concept | Device Technology |
|------------|-------------------|
| vNIC | virtio-net, vhost-net, macvtap |
| Block Storage | virtio-block, virtio-scsi |
| Shared Filesystem | virtio-fs |
| Agent Communication | virtio-vsock |
| Device Passthrough | VFIO with IOMMU |
![VM Concepts to Underlying Technology](./arch-images/vm-concept-to-tech.png)
> **Note:** Each hypervisor implements these mappings differently based on its device model and feature set. See the [Hypervisor Details](#hypervisor-details) section for specific implementations.
### Device Mapping
Container constructs map to para-virtualized devices:
| Construct | Device Type | Technology |
|-----------|-------------|------------|
| Network | Network Interface | virtio-net, vhost-net |
| Storage (ephemeral) | Block Device | virtio-block, virtio-scsi |
| Storage (shared) | Filesystem | virtio-fs |
| Communication | Socket | virtio-vsock |
| GPU/Passthrough | PCI Device | VFIO, IOMMU |
## Supported Hypervisors and VMMs
Kata Containers supports multiple hypervisors, each with different characteristics:
| Hypervisor | Language | Architectures | Type |
|------------|----------|---------------|------|
| [QEMU] | C | x86_64, aarch64, ppc64le, s390x, risc-v | Type 2 (KVM) |
| [Cloud Hypervisor] | Rust | x86_64, aarch64 | Type 2 (KVM) |
| [Firecracker] | Rust | x86_64, aarch64 | Type 2 (KVM) |
| `Dragonball` | Rust | x86_64, aarch64 | Type 2 (KVM) Built-in |
> **Note:** All supported hypervisors use KVM (Kernel-based Virtual Machine) as the underlying hardware virtualization interface on Linux.
## Hypervisor Details
### QEMU/KVM
Kata Containers with QEMU has complete compatibility with Kubernetes.
QEMU is the most mature and feature-complete hypervisor option for Kata Containers.
Depending on the host architecture, Kata Containers supports various machine types,
for example `q35` on x86 systems, `virt` on ARM systems and `pseries` on IBM Power systems. The default Kata Containers
machine type is `q35`. The machine type and its [`Machine accelerators`](#machine-accelerators) can
be changed by editing the runtime [`configuration`](architecture/README.md#configuration) file.
**Machine Types:**
Devices and features used:
- virtio VSOCK or virtio serial
- virtio block or virtio SCSI
- [virtio net](https://www.redhat.com/en/virtio-networking-series)
- virtio fs or virtio 9p (recommend: virtio fs)
- VFIO
- hotplug
- machine accelerators
- `q35` (x86_64, default)
- `s390x` (s390x)
- `virt` (aarch64)
- `pseries` (ppc64le)
- `risc-v` (riscv64, experimental)
Machine accelerators and hotplug are used in Kata Containers to manage resource constraints, improve boot time and reduce memory footprint. These are documented below.
**Devices and Features:**
#### Machine accelerators
- virtio-vsock (agent communication)
- virtio-block or virtio-scsi (storage)
- virtio-net/vhost-net/vhost-user-net (networking)
- virtio-fs (shared filesystem, virtio-fs recommended)
- VFIO (device passthrough)
- CPU and memory hotplug
- NVDIMM (x86_64, for rootfs as persistent memory)
Machine accelerators are architecture specific and can be used to improve the performance
and enable specific features of the machine types. The following machine accelerators
are used in Kata Containers:
**Use Cases:**
- NVDIMM: This machine accelerator is x86 specific and only supported by `q35` machine types.
`nvdimm` is used to provide the root filesystem as a persistent memory device to the Virtual Machine.
- Production workloads requiring full CRI API compatibility
- Scenarios requiring device passthrough (VFIO)
- Multi-architecture deployments
#### Hotplug devices
**Configuration:** See [`configuration-qemu.toml`](../../src/runtime/config/configuration-qemu.toml.in)
The Kata Containers VM starts with a minimum amount of resources, allowing for faster boot time and a reduction in memory footprint. As the container launch progresses,
devices are hotplugged to the VM. For example, when a CPU constraint is specified which includes additional CPUs, they can be hot added. Kata Containers has support
for hot-adding the following devices:
- Virtio block
- Virtio SCSI
- VFIO
- CPU
### Dragonball (Built-in VMM)
### Firecracker/KVM
Dragonball is a Rust-based VMM integrated directly into the Kata Containers Rust runtime as a library.
Firecracker, built on many rust crates that are within [rust-VMM](https://github.com/rust-vmm), has a very limited device model, providing a lighter
footprint and attack surface, focusing on function-as-a-service like use cases. As a result, Kata Containers with Firecracker VMM supports a subset of the CRI API.
Firecracker does not support file-system sharing, and as a result only block-based storage drivers are supported. Firecracker does not support device
hotplug nor does it support VFIO. As a result, Kata Containers with Firecracker VMM does not support updating container resources after boot, nor
does it support device passthrough.
**Advantages:**
Devices used:
- virtio VSOCK
- virtio block
- virtio net
- **Zero IPC overhead**: VMM runs in the same process as the runtime
- **Unified lifecycle**: Simplified resource management and error handling
- **Optimized for containers**: Purpose-built for container workloads
- **Upcall support**: Direct VMM-to-Guest communication for efficient hotplug operations
- **Low resource overhead**: Minimal CPU and memory footprint
**Architecture:**
```
┌─────────────────────────────────────────┐
│ Kata Containers Runtime (Rust) │
│ ┌─────────────────────────────────┐ │
│ │ Dragonball VMM Library │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────┘
```
**Features:**
- Built-in virtio-fs/nydus support
- Async I/O via Tokio
- Single binary deployment
- Optimized startup latency
**Use Cases:**
- Default choice for most container workloads
- High-density container deployments and low resource overhead scenarios
- Scenarios requiring optimal startup performance
**Configuration:** See [`configuration-dragonball.toml`](../../src/runtime-rs/config/configuration-dragonball.toml.in)
### Cloud Hypervisor/KVM
[Cloud Hypervisor](https://github.com/cloud-hypervisor/cloud-hypervisor), based
on [rust-vmm](https://github.com/rust-vmm), is designed to have a
lighter footprint and smaller attack surface for running modern cloud
workloads. Kata Containers with Cloud
Hypervisor provides mostly complete compatibility with Kubernetes
comparable to the QEMU configuration. As of the 1.12 and 2.0.0 release
of Kata Containers, the Cloud Hypervisor configuration supports both CPU
and memory resize, device hotplug (disk and VFIO), file-system sharing through virtio-fs,
block-based volumes, booting from VM images backed by pmem device, and
fine-grained seccomp filters for each VMM threads (e.g. all virtio
device worker threads).
Cloud Hypervisor is a Rust-based VMM designed for modern cloud workloads with a focus on performance and security.
Devices and features used:
- virtio VSOCK or virtio serial
- virtio block
- virtio net
- virtio fs
- virtio pmem
- VFIO
- hotplug
- seccomp filters
- [HTTP OpenAPI](https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/vmm/src/api/openapi/cloud-hypervisor.yaml)
**Features:**
### StratoVirt/KVM
- CPU and memory resize
- Device hotplug (disk, VFIO)
- virtio-fs (shared filesystem)
- virtio-pmem (persistent memory)
- virtio-block (block storage)
- virtio-vsock (agent communication)
- Fine-grained seccomp filters per VMM thread
- HTTP OpenAPI for management
[StratoVirt](https://gitee.com/openeuler/stratovirt) is an enterprise-level open source VMM oriented to cloud data centers, implements a unified architecture to support Standard-VMs, containers and serverless (Micro-VM). StratoVirt has some competitive advantages, such as lightweight and low resource overhead, fast boot, hardware acceleration, and language-level security with Rust.
**Use Cases:**
Currently, StratoVirt in Kata supports Micro-VM machine type, mainly focus on FaaS cases, supporting device hotplug (virtio block), file-system sharing through virtio fs and so on. Kata Containers with StratoVirt now use virtio-mmio bus as driver, and doesn't support CPU/memory resize nor VFIO, thus doesn't support updating container resources after booted.
- High-performance cloud-native workloads
- Applications requiring memory/CPU resizing
- Security-sensitive deployments (seccomp isolation)
Devices and features used currently:
- Micro-VM machine type for FaaS(mmio, no ACPI)
- Virtual Socket(vhost VSOCK、virtio console)
- Virtual Storage(virtio block, mmio)
- Virtual Networking(virtio net, mmio)
- Shared Filesystem(virtio fs)
- Device Hotplugging(virtio block hotplug)
- Entropy Source(virtio RNG)
- QMP API
**Configuration:** See [`configuration-cloud-hypervisor.toml`](../../src/runtime-rs/config/configuration-cloud-hypervisor.toml.in)
### Summary
### Firecracker/KVM
| Solution | release introduced | brief summary |
|-|-|-|
| Cloud Hypervisor | 1.10 | upstream Cloud Hypervisor with rich feature support, e.g. hotplug, VFIO and FS sharing|
| Firecracker | 1.5 | upstream Firecracker, rust-VMM based, no VFIO, no FS sharing, no memory/CPU hotplug |
| QEMU | 1.0 | upstream QEMU, with support for hotplug and filesystem sharing |
| StratoVirt | 3.3 | upstream StratoVirt with FS sharing and virtio block hotplug, no VFIO, no CPU/memory resize |
Firecracker is a minimalist VMM built on rust-vmm crates, optimized for serverless and FaaS workloads.
**Devices:**
- virtio-vsock (agent communication)
- virtio-block (block storage)
- virtio-net (networking)
**Limitations:**
- No filesystem sharing (virtio-fs not supported)
- No device hotplug
- No VFIO/passthrough support
- No CPU/memory hotplug
- Limited CRI API support
**Use Cases:**
- Serverless/FaaS workloads
- Single-tenant microVMs
- Scenarios prioritizing minimal attack surface
**Configuration:** See [`configuration-fc.toml`](../../src/runtime/config/configuration-fc.toml.in)
## Hypervisor Comparison Summary
| Feature | QEMU | Cloud Hypervisor | Firecracker | Dragonball |
|---------|------|------------------|-------------|------------|
| Maturity | Excellent | Good | Good | Good |
| CRI Compatibility | Full | Full | Partial | Full |
| Filesystem Sharing | ✓ | ✓ | ✗ | ✓ |
| Device Hotplug | ✓ | ✓ | ✗ | ✓ |
| VFIO/Passthrough | ✓ | ✓ | ✗ | ✓ |
| CPU/Memory Hotplug | ✓ | ✓ | ✗ | ✓ |
| Security Isolation | Good | Excellent (seccomp) | Excellent | Excellent |
| Startup Latency | Good | Excellent | Excellent | Best |
| Resource Overhead | Medium | Low | Lowest | Lowest |
## Choosing a Hypervisor
### Decision Matrix
| Requirement | Recommended Hypervisor |
|-------------|------------------------|
| Full CRI API compatibility | QEMU, Cloud Hypervisor, Dragonball |
| Device passthrough (VFIO) | QEMU, Cloud Hypervisor, Dragonball |
| Minimal resource overhead | Dragonball, Firecracker |
| Fastest startup time | Dragonball, Firecracker |
| Serverless/FaaS | Dragonball, Firecracker |
| Production workloads | Dragonball, QEMU |
| Memory/CPU resizing | Dragonball, Cloud Hypervisor, QEMU |
| Maximum security isolation | Cloud Hypervisor (seccomp), Firecracker, Dragonball |
| Multi-architecture | QEMU |
### Recommendations
**For Most Users:** Use the default Dragonball VMM with the Kata Containers Rust runtime. It provides the best balance of performance, security, and container density.
**For Device Passthrough:** Use QEMU, Cloud Hypervisor, or Dragonball if you require VFIO device assignment.
**For Serverless:** Use Dragonball or Firecracker for ultra-lightweight, single-tenant microVMs.
**For Legacy/Ecosystem Compatibility:** Use QEMU for its extensive hardware emulation and multi-architecture support.
## Hypervisor Configuration
### Configuration Files
Each hypervisor has a dedicated configuration file:
| Hypervisor | Rust Runtime Configuration | Go Runtime Configuration |
|------------|----------------|-----------------|
| QEMU |`configuration-qemu-runtime-rs.toml` |`configuration-qemu.toml` |
| Cloud Hypervisor | `configuration-cloud-hypervisor.toml` | `configuration-clh.toml` |
| Firecracker | `configuration-rs-fc.toml` | `configuration-fc.toml` |
| Dragonball | `configuration-dragonball.toml` (default) | `No` |
> **Note:** Configuration files are typically installed in `/opt/kata/share/defaults/kata-containers/` or `/opt/kata/share/defaults/kata-containers/runtime-rs/` or `/usr/share/defaults/kata-containers/`.
### Switching Hypervisors
Use the `kata-manager` tool to switch the configured hypervisor:
```bash
# List available hypervisors
$ kata-manager -L
# Switch to a different hypervisor
$ sudo kata-manager -S <hypervisor-name>
```
For detailed instructions, see the [`kata-manager` documentation](../../utils/README.md).
## Hypervisor Versions
The following versions are used in this release (from [versions.yaml](../../versions.yaml)):
| Hypervisor | Version | Repository |
|------------|---------|------------|
| Cloud Hypervisor | v51.1 | https://github.com/cloud-hypervisor/cloud-hypervisor |
| Firecracker | v1.12.1 | https://github.com/firecracker-microvm/firecracker |
| QEMU | v10.2.1 | https://github.com/qemu/qemu |
| Dragonball | builtin | https://github.com/kata-containers/kata-containers/tree/main/src/dragonball |
> **Note:** Dragonball is integrated into the Kata Containers Rust runtime and does not have a separate version number.
> For the latest hypervisor versions, see the [versions.yaml](../../versions.yaml) file in the Kata Containers repository.
## References
- [Kata Containers Architecture](./architecture/README.md)
- [Configuration Guide](../../src/runtime/README.md#configuration)
- [QEMU Documentation](https://www.qemu.org/documentation/)
- [Cloud Hypervisor Documentation](https://github.com/cloud-hypervisor/cloud-hypervisor/blob/main/docs/api.md)
- [Firecracker Documentation](https://github.com/firecracker-microvm/firecracker/tree/main/docs)
- [Dragonball Source](https://github.com/kata-containers/kata-containers/tree/main/src/dragonball)
[KVM]: https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine
[QEMU]: https://www.qemu.org
[Cloud Hypervisor]: https://github.com/cloud-hypervisor/cloud-hypervisor
[Firecracker]: https://github.com/firecracker-microvm/firecracker
[`Dragonball`]: https://github.com/kata-containers/kata-containers/tree/main/src/dragonball

264
docs/helm-configuration.md Normal file
View File

@@ -0,0 +1,264 @@
# Helm Configuration
## Parameters
The helm chart provides a comprehensive set of configuration options. You may view the parameters and their descriptions by going to the [GitHub source](https://github.com/kata-containers/kata-containers/blob/main/tools/packaging/kata-deploy/helm-chart/kata-deploy/values.yaml) or by using helm:
```sh
# List available kata-deploy chart versions:
# helm search repo kata-deploy-charts/kata-deploy --versions
#
# Then replace X.Y.Z below with the desired chart version:
helm show values --version X.Y.Z oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy
```
### shims
Kata ships with a number of pre-built artifacts and runtimes. You may selectively enable or disable specific shims. For example:
```yaml title="values.yaml"
shims:
disableAll: true
qemu:
enabled: true
qemu-nvidia-gpu:
enabled: true
qemu-nvidia-gpu-snp:
enabled: false
```
Shims can also have configuration options specific to them:
```yaml
qemu-nvidia-gpu:
enabled: ~
supportedArches:
- amd64
- arm64
allowedHypervisorAnnotations: []
containerd:
snapshotter: ""
runtimeClass:
# This label is automatically added by gpu-operator. Override it
# if you want to use a different label.
# Uncomment once GPU Operator v26.3 is out
# nodeSelector:
# nvidia.com/cc.ready.state: "false"
```
It's best to reference the default `values.yaml` file above for more details.
### Custom Runtimes
Kata allows you to create custom runtime configurations. This is done by overlaying one of the pre-existing runtime configs with user-provided configs. For example, we can use the `qemu-nvidia-gpu` as a base config and overlay our own parameters to it:
```yaml
customRuntimes:
enabled: false
runtimes:
my-gpu-runtime:
baseConfig: "qemu-nvidia-gpu" # Required: existing config to use as base
dropIn: | # Optional: overrides via config.d mechanism
[hypervisor.qemu]
default_memory = 1024
default_vcpus = 4
runtimeClass: |
kind: RuntimeClass
apiVersion: node.k8s.io/v1
metadata:
name: kata-my-gpu-runtime
labels:
app.kubernetes.io/managed-by: kata-deploy
handler: kata-my-gpu-runtime
overhead:
podFixed:
memory: "640Mi"
cpu: "500m"
scheduling:
nodeSelector:
katacontainers.io/kata-runtime: "true"
# Optional: CRI-specific configuration
containerd:
snapshotter: "nydus" # Configure containerd snapshotter (nydus, erofs, etc.)
crio:
pullType: "guest-pull" # Configure CRI-O runtime_pull_image = true
```
Again, view the default [`values.yaml`](#parameters) file for more details.
## Examples
We provide a few examples that you can pass to helm via the `-f`/`--values` flag.
### [`try-kata-tee.values.yaml`](https://github.com/kata-containers/kata-containers/blob/main/tools/packaging/kata-deploy/helm-chart/kata-deploy/try-kata-tee.values.yaml)
This file enables only the TEE (Trusted Execution Environment) shims for confidential computing:
```sh
helm install kata-deploy oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy \
--version VERSION \
-f try-kata-tee.values.yaml
```
Includes:
- `qemu-snp` - AMD SEV-SNP (amd64)
- `qemu-tdx` - Intel TDX (amd64)
- `qemu-se` - IBM Secure Execution for Linux (SEL) (s390x)
- `qemu-se-runtime-rs` - IBM Secure Execution for Linux (SEL) Rust runtime (s390x)
- `qemu-cca` - Arm Confidential Compute Architecture (arm64)
- `qemu-coco-dev` - Confidential Containers development (amd64, s390x)
- `qemu-coco-dev-runtime-rs` - Confidential Containers development Rust runtime (amd64, s390x)
### [`try-kata-nvidia-gpu.values.yaml`](https://github.com/kata-containers/kata-containers/blob/main/tools/packaging/kata-deploy/helm-chart/kata-deploy/try-kata-nvidia-gpu.values.yaml)
This file enables only the NVIDIA GPU-enabled shims:
```sh
helm install kata-deploy oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy \
--version VERSION \
-f try-kata-nvidia-gpu.values.yaml
```
Includes:
- `qemu-nvidia-gpu` - Standard NVIDIA GPU support (amd64, arm64)
- `qemu-nvidia-gpu-snp` - NVIDIA GPU with AMD SEV-SNP (amd64)
- `qemu-nvidia-gpu-tdx` - NVIDIA GPU with Intel TDX (amd64)
### `nodeSelector`
We can deploy Kata only to specific nodes using `nodeSelector`
```sh
# First, label the nodes where you want kata-containers to be installed
$ kubectl label nodes worker-node-1 kata-containers=enabled
$ kubectl label nodes worker-node-2 kata-containers=enabled
# Then install the chart with `nodeSelector`
$ helm install kata-deploy \
--set nodeSelector.kata-containers="enabled" \
"${CHART}" --version "${VERSION}"
```
You can also use a values file:
```yaml title="values.yaml"
nodeSelector:
kata-containers: "enabled"
node-type: "worker"
```
```sh
$ helm install kata-deploy -f values.yaml "${CHART}" --version "${VERSION}"
```
### Multiple Kata installations on the Same Node
For debugging, testing and other use-case it is possible to deploy multiple
versions of Kata on the very same node. All the needed artifacts are getting the
`multiInstallSuffix` appended to distinguish each installation. **BEWARE** that one
needs at least **containerd-2.0** since this version has drop-in conf support
which is a prerequisite for the `multiInstallSuffix` to work properly.
```sh
$ helm install kata-deploy-cicd \
-n kata-deploy-cicd \
--set env.multiInstallSuffix=cicd \
--set env.debug=true \
"${CHART}" --version "${VERSION}"
```
Note: `runtimeClasses` are automatically created by Helm (via
`runtimeClasses.enabled=true`, which is the default).
Now verify the installation by examining the `runtimeClasses`:
```sh
$ kubectl get runtimeClasses
NAME HANDLER AGE
kata-clh-cicd kata-clh-cicd 77s
kata-cloud-hypervisor-cicd kata-cloud-hypervisor-cicd 77s
kata-dragonball-cicd kata-dragonball-cicd 77s
kata-fc-cicd kata-fc-cicd 77s
kata-qemu-cicd kata-qemu-cicd 77s
kata-qemu-coco-dev-cicd kata-qemu-coco-dev-cicd 77s
kata-qemu-nvidia-gpu-cicd kata-qemu-nvidia-gpu-cicd 77s
kata-qemu-nvidia-gpu-snp-cicd kata-qemu-nvidia-gpu-snp-cicd 77s
kata-qemu-nvidia-gpu-tdx-cicd kata-qemu-nvidia-gpu-tdx-cicd 76s
kata-qemu-runtime-rs-cicd kata-qemu-runtime-rs-cicd 77s
kata-qemu-se-runtime-rs-cicd kata-qemu-se-runtime-rs-cicd 77s
kata-qemu-snp-cicd kata-qemu-snp-cicd 77s
kata-qemu-tdx-cicd kata-qemu-tdx-cicd 77s
kata-stratovirt-cicd kata-stratovirt-cicd 77s
```
## RuntimeClass Node Selectors for TEE Shims
**Manual configuration:** Any `nodeSelector` you set under `shims.<shim>.runtimeClass.nodeSelector`
is **always applied** to that shim's RuntimeClass, whether or not NFD is present. Use this when
you want to pin TEE workloads to specific nodes (e.g. without NFD, or with custom labels).
**Auto-inject when NFD is present:** If you do *not* set a `runtimeClass.nodeSelector` for a
TEE shim, the chart can **automatically inject** NFD-based labels when NFD is detected in the
cluster (deployed by this chart with `node-feature-discovery.enabled=true` or found externally):
- AMD SEV-SNP shims: `amd.feature.node.kubernetes.io/snp: "true"`
- Intel TDX shims: `intel.feature.node.kubernetes.io/tdx: "true"`
- IBM Secure Execution for Linux (SEL) shims (s390x): `feature.node.kubernetes.io/cpu-security.se.enabled: "true"`
The chart uses Helm's `lookup` function to detect NFD (by looking for the
`node-feature-discovery-worker` DaemonSet). Auto-inject only runs when NFD is detected and
no manual `runtimeClass.nodeSelector` is set for that shim.
**Note**: NFD detection requires cluster access. During `helm template` (dry-run without a
cluster), external NFD is not seen, so auto-injected labels are not added. Manual
`runtimeClass.nodeSelector` values are still applied in all cases.
## Customizing Configuration with Drop-in Files
When kata-deploy installs Kata Containers, the base configuration files should not
be modified directly. Instead, use drop-in configuration files to customize
settings. This approach ensures your customizations survive kata-deploy upgrades.
### How Drop-in Files Work
The Kata runtime reads the base configuration file and then applies any `.toml`
files found in the `config.d/` directory alongside it. Files are processed in
alphabetical order, with later files overriding earlier settings.
### Creating Custom Drop-in Files
To add custom settings, create a `.toml` file in the appropriate `config.d/`
directory. Use a numeric prefix to control the order of application.
**Reserved prefixes** (used by kata-deploy):
- `10-*`: Core kata-deploy settings
- `20-*`: Debug settings
- `30-*`: Kernel parameters
**Recommended prefixes for custom settings**: `50-89`
### Drop-In Config Examples
#### Adding Custom Kernel Parameters
```bash
# SSH into the node or use kubectl exec
sudo mkdir -p /opt/kata/share/defaults/kata-containers/runtimes/qemu/config.d/
sudo cat > /opt/kata/share/defaults/kata-containers/runtimes/qemu/config.d/50-custom.toml << 'EOF'
[hypervisor.qemu]
kernel_params = "my_param=value"
EOF
```
#### Changing Default Memory Size
```bash
sudo cat > /opt/kata/share/defaults/kata-containers/runtimes/qemu/config.d/50-memory.toml << 'EOF'
[hypervisor.qemu]
default_memory = 4096
EOF
```

View File

@@ -3,9 +3,9 @@
## Kubernetes Integration
- [Run Kata containers with `crictl`](run-kata-with-crictl.md)
- [Run Kata Containers with Kubernetes](run-kata-with-k8s.md)
- [How to use Kata Containers and Containerd](containerd-kata.md)
- [How to use Kata Containers and containerd with Kubernetes](how-to-use-k8s-with-containerd-and-kata.md)
- [How to use Kata Containers and CRI-O with Kubernetes](how-to-use-k8s-with-crio-and-kata.md)
- [Kata Containers and service mesh for Kubernetes](service-mesh.md)
- [How to import Kata Containers logs into Fluentd](how-to-import-kata-logs-with-fluentd.md)
@@ -50,3 +50,4 @@
- [How to pull images in the guest](how-to-pull-images-in-guest-with-kata.md)
- [How to use mem-agent to decrease the memory usage of Kata container](how-to-use-memory-agent.md)
- [How to use seccomp with runtime-rs](how-to-use-seccomp-with-runtime-rs.md)
- [How to use passthroughfd-IO with runtime-rs and Dragonball](how-to-use-passthroughfd-io-within-runtime-rs.md)

View File

@@ -5,7 +5,7 @@ and [Kata Containers](https://katacontainers.io). The containerd provides not on
command line tool, but also the [CRI](https://kubernetes.io/blog/2016/12/container-runtime-interface-cri-in-kubernetes/)
interface for [Kubernetes](https://kubernetes.io) and other CRI clients.
This document is primarily written for Kata Containers v1.5.0-rc2 or above, and containerd v1.2.0 or above.
This document is primarily written for Kata Containers v3.28 or above, and containerd v1.7.0 or above.
Previous versions are addressed here, but we suggest users upgrade to the newer versions for better support.
## Concepts
@@ -14,7 +14,7 @@ Previous versions are addressed here, but we suggest users upgrade to the newer
[`RuntimeClass`](https://kubernetes.io/docs/concepts/containers/runtime-class/) is a Kubernetes feature first
introduced in Kubernetes 1.12 as alpha. It is the feature for selecting the container runtime configuration to
use to run a pods containers. This feature is supported in `containerd` since [v1.2.0](https://github.com/containerd/containerd/releases/tag/v1.2.0).
use to run a pod's containers. This feature is supported in `containerd` since [v1.2.0](https://github.com/containerd/containerd/releases/tag/v1.2.0).
Before the `RuntimeClass` was introduced, Kubernetes was not aware of the difference of runtimes on the node. `kubelet`
creates Pod sandboxes and containers through CRI implementations, and treats all the Pods equally. However, there
@@ -23,7 +23,7 @@ workloads with isolated sandboxes (i.e. Kata Containers).
As a result, the CRI implementations extended their semantics for the requirements:
- At the beginning, [Frakti](https://github.com/kubernetes/frakti) checks the network configuration of a Pod, and
- At the beginning, [`Frakti`](https://github.com/kubernetes/frakti) checks the network configuration of a Pod, and
treat Pod with `host` network as trusted, while others are treated as untrusted.
- The containerd introduced an annotation for untrusted Pods since [v1.0](https://github.com/containerd/cri/blob/v1.0.0-rc.0/docs/config.md):
```yaml
@@ -123,18 +123,56 @@ The following sections outline how to add Kata Containers to the configurations.
#### Kata Containers as a `RuntimeClass`
For
- Kata Containers v1.5.0 or above (including `1.5.0-rc`)
- Containerd v1.2.0 or above
- Kubernetes v1.12.0 or above
For Kubernetes users, we suggest using `RuntimeClass` to select Kata Containers as the runtime for untrusted workloads. The configuration is as follows:
- Kata Containers v3.28.0 or above
- Containerd v1.7.0 or above
- Kubernetes v1.33 or above
The `RuntimeClass` is suggested.
The following example registers custom runtimes into containerd:
You can check the detailed information about the configuration of containerd in the [Containerd config documentation](https://github.com/containerd/containerd/blob/main/docs/cri/config.md).
+ In containerd 2.x
```toml
version = 3
[plugins."io.containerd.cri.v1.runtime".containerd]
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes]
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes.kata.options]
ConfigPath = "/opt/kata/share/defaults/kata-containers/configuration.toml"
```
+ In containerd 1.7.x
```toml
version = 2
[plugins."io.containerd.grpc.v1.cri".containerd]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata.options]
ConfigPath = "/opt/kata/share/defaults/kata-containers/configuration.toml"
```
The following configuration includes two runtime classes:
- `plugins.cri.containerd.runtimes.runc`: the runc, and it is the default runtime.
- `plugins.cri.containerd.runtimes.kata`: The function in containerd (reference [the document here](https://github.com/containerd/containerd/tree/main/core/runtime/v2))
- `plugins.<X>.containerd.runtimes.runc`: the runc, and it is the default runtime.
- `plugins.<X>.containerd.runtimes.kata`: The function in containerd (reference [the document here](https://github.com/containerd/containerd/tree/main/core/runtime/v2))
where the dot-connected string `io.containerd.kata.v2` is translated to `containerd-shim-kata-v2` (i.e. the
binary name of the Kata implementation of [Containerd Runtime V2 (Shim API)](https://github.com/containerd/containerd/tree/main/core/runtime/v2)).
binary name of the Kata implementation of [Containerd Runtime V2 (Shim API)](https://github.com/containerd/containerd/tree/main/core/runtime/v2)). By default, the `containerd-shim-kata-v2` (short of `shimv2`) binary will be installed under the path of `/usr/local/bin/`.
And `<X>` is `io.containerd.cri.v1.runtime` for containerd v2.x and `io.containerd.grpc.v1.cri` for containerd v1.7.x.
+ In containerd 1.7.x
```toml
[plugins.cri.containerd]
@@ -149,7 +187,7 @@ The following configuration includes two runtime classes:
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
[plugins.cri.containerd.runtimes.kata]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
privileged_without_host_devices = true
pod_annotations = ["io.katacontainers.*"]
@@ -158,13 +196,71 @@ The following configuration includes two runtime classes:
ConfigPath = "/opt/kata/share/defaults/kata-containers/configuration.toml"
```
+ In containerd 2.x
```toml
[plugins."io.containerd.cri.v1.runtime".containerd]
no_pivot = false
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes]
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes.runc]
privileged_without_host_devices = false
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes.runc.options]
BinaryName = ""
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
privileged_without_host_devices = true
pod_annotations = ["io.katacontainers.*"]
container_annotations = ["io.katacontainers.*"]
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes.kata.options]
ConfigPath = "/opt/kata/share/defaults/kata-containers/configuration.toml"
```
`privileged_without_host_devices` tells containerd that a privileged Kata container should not have direct access to all host devices. If unset, containerd will pass all host devices to Kata container, which may cause security issues.
`pod_annotations` is the list of pod annotations passed to both the pod sandbox as well as container through the OCI config.
`container_annotations` is the list of container annotations passed through to the OCI config of the containers.
This `ConfigPath` option is optional. If you do not specify it, shimv2 first tries to get the configuration file from the environment variable `KATA_CONF_FILE`. If neither are set, shimv2 will use the default Kata configuration file paths (`/etc/kata-containers/configuration.toml` and `/usr/share/defaults/kata-containers/configuration.toml`).
This `ConfigPath` option is optional. If you want to use a different configuration file, you can specify the path of the configuration file with `ConfigPath` in the containerd configuration file. We use containerd 2.x configuration as an example here, and the configuration for containerd 1.7.x is similar, just replace `io.containerd.cri.v1.runtime` with `io.containerd.grpc.v1.cri`.
```toml
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes.kata.options]
ConfigPath = "/opt/kata/share/defaults/kata-containers/configuration-qemu.toml"
```
> **Note:** In this example, the specified `ConfigPath` is valid in Kubernetes/Containerd workflow with containerd v1.7+ but doesn't work with ctr and nerdctl.
If you do not specify it, `shimv2` first tries to get the configuration file from the environment variable `KATA_CONF_FILE`. If you want to adopt this way, you should first create a shell script as `containerd-shim-kata-v2` which is placed under the path of `/usr/local/bin/`. The following is an example of the shell script `containerd-shim-kata-qemu-v2` which specifies the configuration file with `KATA_CONF_FILE`
> **Note:** Just use containerd 2.x configuration as an example, the configuration for containerd 1.7.x is similar, just replace `io.containerd.cri.v1.runtime` with `io.containerd.grpc.v1.cri`
```shell
~$ cat /usr/local/bin/containerd-shim-kata-qemu-v2
#!/bin/bash
KATA_CONF_FILE=/opt/kata/share/defaults/kata-containers/configuration-qemu.toml /opt/kata/bin/containerd-shim-kata-v2 "$@"
```
And then just reference it in the configuration of containerd:
```toml
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes.kata-qemu]
runtime_type = "io.containerd.kata-qemu.v2"
```
Finally you can run a Kata container with the runtime `io.containerd.kata-qemu.v2`:
```shell
$ sudo ctr run --cni --runtime io.containerd.kata-qemu.v2 -t --rm docker.io/library/busybox:latest hello sh
```
> **Note:** The `KATA_CONF_FILE` environment variable is valid in both Kubernetes/Containerd workflow with containerd and containerd tools(ctr, nerdctl, etc.) scenarios.
If neither are set, shimv2 will use the default Kata configuration file paths (`/etc/kata-containers/configuration.toml` and `/usr/share/defaults/kata-containers/configuration.toml` and `/opt/kata/share/defaults/kata-containers/configuration.toml`).
#### Kata Containers as the runtime for untrusted workload
@@ -173,18 +269,20 @@ for an untrusted workload. With the following configuration, you can run trusted
and then, run an untrusted workload with Kata Containers:
```toml
[plugins.cri.containerd]
# "plugins.cri.containerd.default_runtime" is the runtime to use in containerd.
[plugins.cri.containerd.default_runtime]
[plugins."io.containerd.grpc.v1.cri".containerd]
# "plugins."io.containerd.grpc.v1.cri".containerd.default_runtime" is the runtime to use in containerd.
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
# runtime_type is the runtime type to use in containerd e.g. io.containerd.runtime.v1.linux
runtime_type = "io.containerd.runtime.v1.linux"
# "plugins.cri.containerd.untrusted_workload_runtime" is a runtime to run untrusted workloads on it.
[plugins.cri.containerd.untrusted_workload_runtime]
# "plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime" is a runtime to run untrusted workloads on it.
[plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
# runtime_type is the runtime type to use in containerd e.g. io.containerd.runtime.v1.linux
runtime_type = "io.containerd.kata.v2"
```
> **Note:** The `untrusted_workload_runtime` is deprecated since containerd v1.7.0, and it is recommended to use `RuntimeClass` instead.
You can find more information on the [Containerd config documentation](https://github.com/containerd/containerd/blob/main/docs/cri/config.md)
#### Kata Containers as the default runtime
@@ -192,8 +290,8 @@ You can find more information on the [Containerd config documentation](https://g
If you want to set Kata Containers as the only runtime in the deployment, you can simply configure as follows:
```toml
[plugins.cri.containerd]
[plugins.cri.containerd.default_runtime]
[plugins."io.containerd.grpc.v1.cri".containerd]
[plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
runtime_type = "io.containerd.kata.v2"
```
@@ -246,11 +344,14 @@ debug: true
### Launch containers with `ctr` command line
> **Note:** With containerd command tool `ctr`, the `ConfigPath` is not supported, and the configuration file should be explicitly specified with the option `--runtime-config-path`, otherwise, it'll use the default configurations.
To run a container with Kata Containers through the containerd command line, you can run the following:
```bash
$ sudo ctr image pull docker.io/library/busybox:latest
$ sudo ctr run --cni --runtime io.containerd.run.kata.v2 -t --rm docker.io/library/busybox:latest hello sh
$ CONFIG_PATH="/opt/kata/share/defaults/kata-containers/configuration-qemu.toml"
$ sudo ctr run --cni --runtime io.containerd.kata.v2 --runtime-config-path $CONFIG_PATH -t --rm docker.io/library/busybox:latest hello sh
```
This launches a BusyBox container named `hello`, and it will be removed by `--rm` after it quits.
@@ -260,7 +361,9 @@ loopback interface is created.
### Launch containers using `ctr` command line with rootfs bundle
#### Get rootfs
Use the script to create rootfs
```bash
ctr i pull quay.io/prometheus/busybox:latest
ctr i export rootfs.tar quay.io/prometheus/busybox:latest
@@ -278,7 +381,9 @@ for ((i=0;i<$(cat ${layers_dir}/manifest.json | jq -r ".[].Layers | length");i++
tar -C ${rootfs_dir} -xf ${layers_dir}/$(cat ${layers_dir}/manifest.json | jq -r ".[].Layers[${i}]")
done
```
#### Get `config.json`
Use runc spec to generate `config.json`
```bash
cd ./bundle/rootfs
@@ -295,10 +400,13 @@ Change the root `path` in `config.json` to the absolute path of rootfs
```
#### Run container
```bash
sudo ctr run -d --runtime io.containerd.run.kata.v2 --config bundle/config.json hello
CONFIG_PATH="/opt/kata/share/defaults/kata-containers/configuration-qemu.toml"
sudo ctr run -d --runtime io.containerd.kata.v2 --runtime-config-path $CONFIG_PATH --config bundle/config.json hello
sudo ctr t exec --exec-id ${ID} -t hello sh
```
### Launch Pods with `crictl` command line
With the `crictl` command line of `cri-tools`, you can specify runtime class with `-r` or `--runtime` flag.

View File

@@ -18,7 +18,7 @@ The host kernel must be equal to or later than upstream version [6.11](https://c
[`sev-utils`](https://github.com/amd/sev-utils/blob/coco-202501150000/docs/snp.md) is an easy way to install the required host kernel with the `setup-host` command. However, it will also build compatible guest kernel, OVMF, and QEMU components which are not necessary as these components are packaged with kata. The `sev-utils` script utility can be used with these additional components to test the memory encrypted launch and attestation of a base QEMU SNP guest.
For a simplified way to build just the upstream compatible host kernel, use the Confidential Containers fork of [AMDESE AMDSEV](https://github.com/confidential-containers/amdese-amdsev/tree/amd-snp-202501150000). Individual components can be built by running the following command:
For a simplified way to build just the upstream compatible host kernel, use the Confidential Containers fork of [`amdese-amdsev`](https://github.com/confidential-containers/amdese-amdsev/tree/amd-snp-202501150000). Individual components can be built by running the following command:
```
./build.sh kernel host --install
@@ -65,7 +65,7 @@ $ ./configure --enable-virtfs --target-list=x86_64-softmmu --enable-debug
$ make -j "$(nproc)"
$ popd
```
- Create cert-chain for SNP attestation ( using [snphost](https://github.com/virtee/snphost/blob/main/docs/snphost.1.adoc) )
- Create cert-chain for SNP attestation ( using [`snphost`](https://github.com/virtee/snphost/blob/main/docs/snphost.1.adoc) )
```bash
$ git clone https://github.com/virtee/snphost.git && cd snphost/
$ cargo build
@@ -96,6 +96,10 @@ path = "/path/to/qemu/build/qemu-system-x86_64"
```toml
shared_fs = "virtio-9p"
```
- Use `blockfile` snapshotter: Since virtio-fs remains unsupported due to bugs in QEMU snp-v3, and virtio-9p is no longer supported in runtime-rs, it is recommended to use the blockfile snapshotter. This allows container images to be managed via block devices without relying on a shared file system. To enable this, set the `snapshotter` to `blockfile` in the containerd config file, please refer to [blockfile guide](https://github.com/containerd/containerd/blob/main/docs/snapshotters/blockfile.md) for more information. Additionally, shared_fs should be set to "none" since no shared file system is used.
```toml
shared_fs = "none"
```
- Disable `virtiofsd` since it is no longer required (comment out)
```toml
# virtio_fs_daemon = "/usr/libexec/virtiofsd"
@@ -178,4 +182,3 @@ sudo reboot
```bash
sudo rmmod kvm_amd && sudo modprobe kvm_amd sev_snp=0
```

View File

@@ -12,11 +12,11 @@ Currently, there is no widely applicable and convenient method available for use
According to the proposal, it requires to use the `kata-ctl direct-volume` command to add a direct assigned block volume device to the Kata Containers runtime.
And then with the help of method [get_volume_mount_info](https://github.com/kata-containers/kata-containers/blob/099b4b0d0e3db31b9054e7240715f0d7f51f9a1c/src/libs/kata-types/src/mount.rs#L95), get information from JSON file: `(mountinfo.json)` and parse them into structure [Direct Volume Info](https://github.com/kata-containers/kata-containers/blob/099b4b0d0e3db31b9054e7240715f0d7f51f9a1c/src/libs/kata-types/src/mount.rs#L70) which is used to save device-related information.
And then with the help of method [get_volume_mount_info](https://github.com/kata-containers/kata-containers/blob/099b4b0d0e3db31b9054e7240715f0d7f51f9a1c/src/libs/kata-types/src/mount.rs#L95), get information from JSON file: `(mountInfo.json)` and parse them into structure [Direct Volume Info](https://github.com/kata-containers/kata-containers/blob/099b4b0d0e3db31b9054e7240715f0d7f51f9a1c/src/libs/kata-types/src/mount.rs#L70) which is used to save device-related information.
We only fill the `mountinfo.json`, such as `device` ,`volume_type`, `fs_type`, `metadata` and `options`, which correspond to the fields in [Direct Volume Info](https://github.com/kata-containers/kata-containers/blob/099b4b0d0e3db31b9054e7240715f0d7f51f9a1c/src/libs/kata-types/src/mount.rs#L70), to describe a device.
We only fill the `mountInfo.json`, such as `device` ,`volume-type`, `fstype`, `metadata` and `options`, which correspond to the fields in [Direct Volume Info](https://github.com/kata-containers/kata-containers/blob/099b4b0d0e3db31b9054e7240715f0d7f51f9a1c/src/libs/kata-types/src/mount.rs#L70), to describe a device.
The JSON file `mountinfo.json` placed in a sub-path `/kubelet/kata-test-vol-001/volume001` which under fixed path `/run/kata-containers/shared/direct-volumes/`.
The JSON file `mountInfo.json` placed in a sub-path `/kubelet/kata-test-vol-001/volume001` which under fixed path `/run/kata-containers/shared/direct-volumes/`.
And the full path looks like: `/run/kata-containers/shared/direct-volumes/kubelet/kata-test-vol-001/volume001`, But for some security reasons. it is
encoded as `/run/kata-containers/shared/direct-volumes/L2t1YmVsZXQva2F0YS10ZXN0LXZvbC0wMDEvdm9sdW1lMDAx`.
@@ -47,18 +47,18 @@ $ sudo mkfs.ext4 /tmp/stor/rawdisk01.20g
```json
{
"device": "/tmp/stor/rawdisk01.20g",
"volume_type": "directvol",
"fs_type": "ext4",
"volume-type": "directvol",
"fstype": "ext4",
"metadata":"{}",
"options": []
}
```
```bash
$ sudo kata-ctl direct-volume add /kubelet/kata-direct-vol-002/directvol002 "{\"device\": \"/tmp/stor/rawdisk01.20g\", \"volume_type\": \"directvol\", \"fs_type\": \"ext4\", \"metadata\":"{}", \"options\": []}"
$ sudo kata-ctl direct-volume add /kubelet/kata-direct-vol-002/directvol002 "{\"device\": \"/tmp/stor/rawdisk01.20g\", \"volume-type\": \"directvol\", \"fstype\": \"ext4\", \"metadata\":"{}", \"options\": []}"
$# /kubelet/kata-direct-vol-002/directvol002 <==> /run/kata-containers/shared/direct-volumes/W1lMa2F0ZXQva2F0YS10a2F0DAxvbC0wMDEvdm9sdW1lMDAx
$ cat W1lMa2F0ZXQva2F0YS10a2F0DAxvbC0wMDEvdm9sdW1lMDAx/mountInfo.json
{"volume_type":"directvol","device":"/tmp/stor/rawdisk01.20g","fs_type":"ext4","metadata":{},"options":[]}
{"volume-type":"directvol","device":"/tmp/stor/rawdisk01.20g","fstype":"ext4","metadata":{},"options":[]}
```
#### Run a Kata container with direct block device volume
@@ -76,7 +76,7 @@ $ sudo ctr run -t --rm --runtime io.containerd.kata.v2 --mount type=directvol,sr
> **Tip:** It only supports `vfio-pci` based PCI device passthrough mode.
In this scenario, the device's host kernel driver will be replaced by `vfio-pci`, and IOMMU group ID generated.
And either device's BDF or its VFIO IOMMU group ID in `/dev/vfio/` is fine for "device" in `mountinfo.json`.
And either device's BDF or its VFIO IOMMU group ID in `/dev/vfio/` is fine for "device" in `mountInfo.json`.
```bash
$ lspci -nn -k -s 45:00.1
@@ -92,15 +92,15 @@ $ ls /sys/kernel/iommu_groups/110/devices/
#### setup VFIO device for kata-containers
First, configure the `mountinfo.json`, as below:
First, configure the `mountInfo.json`, as below:
- (1) device with `BB:DD:F`
```json
{
"device": "45:00.1",
"volume_type": "vfiovol",
"fs_type": "ext4",
"volume-type": "vfiovol",
"fstype": "ext4",
"metadata":"{}",
"options": []
}
@@ -111,8 +111,8 @@ First, configure the `mountinfo.json`, as below:
```json
{
"device": "0000:45:00.1",
"volume_type": "vfiovol",
"fs_type": "ext4",
"volume-type": "vfiovol",
"fstype": "ext4",
"metadata":"{}",
"options": []
}
@@ -123,8 +123,8 @@ First, configure the `mountinfo.json`, as below:
```json
{
"device": "/dev/vfio/110",
"volume_type": "vfiovol",
"fs_type": "ext4",
"volume-type": "vfiovol",
"fstype": "ext4",
"metadata":"{}",
"options": []
}
@@ -133,10 +133,10 @@ First, configure the `mountinfo.json`, as below:
Second, run kata-containers with device(`/dev/vfio/110`) as an example:
```bash
$ sudo kata-ctl direct-volume add /kubelet/kata-vfio-vol-003/vfiovol003 "{\"device\": \"/dev/vfio/110\", \"volume_type\": \"vfiovol\", \"fs_type\": \"ext4\", \"metadata\":"{}", \"options\": []}"
$ sudo kata-ctl direct-volume add /kubelet/kata-vfio-vol-003/vfiovol003 "{\"device\": \"/dev/vfio/110\", \"volume-type\": \"vfiovol\", \"fstype\": \"ext4\", \"metadata\":"{}", \"options\": []}"
$ # /kubelet/kata-vfio-vol-003/directvol003 <==> /run/kata-containers/shared/direct-volumes/F0va22F0ZvaS12F0YS10a2F0DAxvbC0F0ZXvdm9sdF0Z0YSx
$ cat F0va22F0ZvaS12F0YS10a2F0DAxvbC0F0ZXvdm9sdF0Z0YSx/mountInfo.json
{"volume_type":"vfiovol","device":"/dev/vfio/110","fs_type":"ext4","metadata":{},"options":[]}
{"volume-type":"vfiovol","device":"/dev/vfio/110","fstype":"ext4","metadata":{},"options":[]}
```
#### Run a Kata container with VFIO block device based volume
@@ -190,25 +190,25 @@ be passed to Hypervisor, such as Dragonball, Cloud-Hypervisor, Firecracker or QE
First, `mkdir` a sub-path `kubelet/kata-test-vol-001/` under `/run/kata-containers/shared/direct-volumes/`.
Second, fill fields in `mountinfo.json`, it looks like as below:
Second, fill fields in `mountInfo.json`, it looks like as below:
```json
{
"device": "/tmp/vhu-targets/vhost-blk-rawdisk01.sock",
"volume_type": "spdkvol",
"fs_type": "ext4",
"volume-type": "spdkvol",
"fstype": "ext4",
"metadata":"{}",
"options": []
}
```
Third, with the help of `kata-ctl direct-volume` to add block device to generate `mountinfo.json`, and run a kata container with `--mount`.
Third, with the help of `kata-ctl direct-volume` to add block device to generate `mountInfo.json`, and run a kata container with `--mount`.
```bash
$ # kata-ctl direct-volume add
$ sudo kata-ctl direct-volume add /kubelet/kata-test-vol-001/volume001 "{\"device\": \"/tmp/vhu-targets/vhost-blk-rawdisk01.sock\", \"volume_type\":\"spdkvol\", \"fs_type\": \"ext4\", \"metadata\":"{}", \"options\": []}"
$ sudo kata-ctl direct-volume add /kubelet/kata-test-vol-001/volume001 "{\"device\": \"/tmp/vhu-targets/vhost-blk-rawdisk01.sock\", \"volume-type\":\"spdkvol\", \"fstype\": \"ext4\", \"metadata\":"{}", \"options\": []}"
$ # /kubelet/kata-test-vol-001/volume001 <==> /run/kata-containers/shared/direct-volumes/L2t1YmVsZXQva2F0YS10ZXN0LXZvbC0wMDEvdm9sdW1lMDAx
$ cat L2t1YmVsZXQva2F0YS10ZXN0LXZvbC0wMDEvdm9sdW1lMDAx/mountInfo.json
$ {"volume_type":"spdkvol","device":"/tmp/vhu-targets/vhost-blk-rawdisk01.sock","fs_type":"ext4","metadata":{},"options":[]}
$ {"volume-type":"spdkvol","device":"/tmp/vhu-targets/vhost-blk-rawdisk01.sock","fstype":"ext4","metadata":{},"options":[]}
```
As `/run/kata-containers/shared/direct-volumes/` is a fixed path , we will be able to run a kata pod with `--mount` and set

View File

@@ -17,7 +17,7 @@ You must have a running Kubernetes cluster first. If not, [install a Kubernetes
Also you should ensure that `kubectl` working correctly.
> **Note**: More information about Kubernetes integrations:
> - [Run Kata Containers with Kubernetes](run-kata-with-k8s.md)
> - [Run Kata Containers with Kubernetes](how-to-use-k8s-with-crio-and-kata.md)
> - [How to use Kata Containers and Containerd](containerd-kata.md)
> - [How to use Kata Containers and containerd with Kubernetes](how-to-use-k8s-with-containerd-and-kata.md)

View File

@@ -46,6 +46,8 @@ There are several kinds of Kata configurations and they are listed below.
| `io.katacontainers.config.hypervisor.block_device_cache_noflush` | `boolean` | Denotes whether flush requests for the device are ignored |
| `io.katacontainers.config.hypervisor.block_device_cache_set` | `boolean` | cache-related options will be set to block devices or not |
| `io.katacontainers.config.hypervisor.block_device_driver` | string | the driver to be used for block device, valid values are `virtio-blk`, `virtio-scsi`, `nvdimm`|
| `io.katacontainers.config.hypervisor.blk_logical_sector_size` | uint32 | logical sector size in bytes reported by block devices to the guest (0 = hypervisor default, must be a power of 2 between 512 and 65536) |
| `io.katacontainers.config.hypervisor.blk_physical_sector_size` | uint32 | physical sector size in bytes reported by block devices to the guest (0 = hypervisor default, must be a power of 2 between 512 and 65536) |
| `io.katacontainers.config.hypervisor.cpu_features` | `string` | Comma-separated list of CPU features to pass to the CPU (QEMU) |
| `io.katacontainers.config.hypervisor.default_max_vcpus` | uint32| the maximum number of vCPUs allocated for the VM by the hypervisor |
| `io.katacontainers.config.hypervisor.default_memory` | uint32| the memory assigned for a VM by the hypervisor in `MiB` |

View File

@@ -1,6 +1,7 @@
# Run Kata Containers with Kubernetes
# How to use Kata Containers and CRI-O with Kubernetes
## Prerequisites
This guide requires Kata Containers available on your system, install-able by following [this guide](../install/README.md).
## Install a CRI implementation
@@ -9,22 +10,16 @@ Kubernetes CRI (Container Runtime Interface) implementations allow using any
OCI-compatible runtime with Kubernetes, such as the Kata Containers runtime.
Kata Containers support both the [CRI-O](https://github.com/kubernetes-incubator/cri-o) and
[containerd](https://github.com/containerd/containerd) CRI implementations.
After choosing one CRI implementation, you must make the appropriate configuration
to ensure it integrates with Kata Containers.
Kata Containers 1.5 introduced the `shimv2` for containerd 1.2.0, reducing the components
required to spawn pods and containers, and this is the preferred way to run Kata Containers with Kubernetes ([as documented here](../how-to/how-to-use-k8s-with-containerd-and-kata.md#configure-containerd-to-use-kata-containers)).
An equivalent shim implementation for CRI-O is planned.
[containerd](https://github.com/containerd/containerd) CRI implementations. We choose `CRI-O` for our examples in this guide.
### CRI-O
For CRI-O installation instructions, refer to the [CRI-O Tutorial](https://github.com/cri-o/cri-o/blob/main/tutorial.md) page.
The following sections show how to set up the CRI-O snippet configuration file (default path: `/etc/crio/crio.conf`) for Kata.
Unless otherwise stated, all the following settings are specific to the `crio.runtime` table:
```toml
# The "crio.runtime" table contains settings pertaining to the OCI
# runtime used and options for how to set up and manage the OCI runtime.
@@ -33,16 +28,17 @@ Unless otherwise stated, all the following settings are specific to the `crio.ru
A comprehensive documentation of the configuration file can be found [here](https://github.com/cri-o/cri-o/blob/main/docs/crio.conf.5.md).
> **Note**: After any change to this file, the CRI-O daemon have to be restarted with:
>````
>$ sudo systemctl restart crio
>````
#### Kubernetes Runtime Class (CRI-O v1.12+)
The [Kubernetes Runtime Class](https://kubernetes.io/docs/concepts/containers/runtime-class/)
is the preferred way of specifying the container runtime configuration to run a Pod's containers.
To use this feature, Kata must added as a runtime handler. This can be done by
dropping a `50-kata` snippet file into `/etc/crio/crio.conf.d`, with the
content shown below:
To use this feature, Kata must added as a runtime handler. This can be done by dropping a `50-kata`
snippet file into `/etc/crio/crio.conf.d`, with the content shown below:
```toml
[crio.runtime.runtimes.kata]
@@ -52,13 +48,6 @@ content shown below:
privileged_without_host_devices = true
```
### containerd
To customize containerd to select Kata Containers runtime, follow our
"Configure containerd to use Kata Containers" internal documentation
[here](../how-to/how-to-use-k8s-with-containerd-and-kata.md#configure-containerd-to-use-kata-containers).
## Install Kubernetes
Depending on what your needs are and what you expect to do with Kubernetes,
@@ -72,25 +61,16 @@ implementation you chose, and the Kubelet service has to be updated accordingly.
### Configure for CRI-O
`/etc/systemd/system/kubelet.service.d/0-crio.conf`
```
[Service]
Environment="KUBELET_EXTRA_ARGS=--container-runtime=remote --runtime-request-timeout=15m --container-runtime-endpoint=unix:///var/run/crio/crio.sock"
```
### Configure for containerd
`/etc/systemd/system/kubelet.service.d/0-cri-containerd.conf`
```
[Service]
Environment="KUBELET_EXTRA_ARGS=--container-runtime=remote --runtime-request-timeout=15m --container-runtime-endpoint=unix:///run/containerd/containerd.sock"
```
For more information about containerd see the "Configure Kubelet to use containerd"
documentation [here](../how-to/how-to-use-k8s-with-containerd-and-kata.md#configure-kubelet-to-use-containerd).
## Run a Kubernetes pod with Kata Containers
After you update your Kubelet service based on the CRI implementation you
are using, reload and restart Kubelet. Then, start your cluster:
After you update your Kubelet service based on the CRI implementation you are using, reload and restart Kubelet. Then, start your cluster:
```bash
$ sudo systemctl daemon-reload
$ sudo systemctl restart kubelet
@@ -98,12 +78,6 @@ $ sudo systemctl restart kubelet
# If using CRI-O
$ sudo kubeadm init --ignore-preflight-errors=all --cri-socket /var/run/crio/crio.sock --pod-network-cidr=10.244.0.0/16
# If using containerd
$ cat <<EOF | tee kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
criSocket: "/run/containerd/containerd.sock"
---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
@@ -118,6 +92,7 @@ $ export KUBECONFIG=/etc/kubernetes/admin.conf
### Allow pods to run in the control-plane node
By default, the cluster will not schedule pods in the control-plane node. To enable control-plane node scheduling:
```bash
$ sudo -E kubectl taint nodes --all node-role.kubernetes.io/control-plane-
```
@@ -161,6 +136,7 @@ If a pod has the `runtimeClassName` set to `kata`, the CRI plugin runs the pod w
```
- Create the pod
```bash
$ sudo -E kubectl apply -f nginx-kata.yaml
```
@@ -172,6 +148,7 @@ If a pod has the `runtimeClassName` set to `kata`, the CRI plugin runs the pod w
```
- Check hypervisor is running
```bash
$ ps aux | grep qemu
```

View File

@@ -315,7 +315,7 @@ $ kata-agent-ctl connect --server-address "unix:///var/run/kata/$PODID/root/kata
### compact_threshold
Control the mem-agent compaction function compact threshold.<br>
compact_threshold is the pages number.<br>
When examining the /proc/pagetypeinfo, if there's an increase in the number of movable pages of orders smaller than the compact_order compared to the amount following the previous compaction period, and this increase surpasses a certain threshold specifically, more than compact_threshold number of pages, or the number of free pages has decreased by compact_threshold since the previous compaction. Current compact run period will not do compaction because there is no enough fragmented pages to be compaction.<br>
When examining the `/proc/pagetypeinfo`, if there's an increase in the number of movable pages of orders smaller than the compact_order compared to the amount following the previous compaction period, and this increase surpasses a certain threshold specifically, more than compact_threshold number of pages, or the number of free pages has decreased by compact_threshold since the previous compaction. Current compact run period will not do compaction because there is no enough fragmented pages to be compaction.<br>
This design aims to minimize the impact of unnecessary compaction calls on system performance.<br>
Default to 1024.

View File

@@ -0,0 +1,159 @@
# How to Use Passthrough-FD IO within Runtime-rs and Dragonball
This document describes the Passthrough-FD (pass-fd) technology implemented in Kata Containers to optimize IO performance. By bypassing the intermediate proxy layers, this technology significantly reduces latency and CPU overhead for container IO streams.
## Important Limitation
Before diving into the technical details, please note the following restriction:
- Exclusive Support for Dragonball VMM: This feature is currently implemented only for Kata Containers' built-in VMM, Dragonball.
- Unsupported VMMs: Other VMMs such as QEMU, Cloud Hypervisor, and Firecracker do not support this feature at this time.
## Overview
The original IO implementation in Kata Containers suffered from an excessively long data path, leading to poor efficiency. For instance, copying a 10GB file could take as long as 10 minutes.
To address this, Kata AC member @lifupan and @frezcirno introduced a series of optimizations using passthrough-fd technology. This approach allows the VMM to directly handle file descriptors (FDs), dramatically improving IO throughput.
## Traditional IO Path
Before the introduction of Passthrough-FD, Kata's IO streams were implemented using `ttrpc + virtio-vsock`.
The data flow was as follows:
```mermaid
graph LR
subgraph Host ["Host"]
direction LR
Containerd["Containerd"]
subgraph KS ["kata-shim"]
buffer(("buffer"))
end
Vsock["vsock"]
subgraph VM ["vm"]
Agent["kata-agent"]
Container["container"]
end
end
Containerd -->|stdin| buffer
buffer --> Vsock
Vsock --> Agent
Agent -.-> Container
%% Style Rendering
style Host fill:#f0f8ff,stroke:#333,stroke-dasharray: 5 5
style VM fill:#fff9c4,stroke:#e0e0e0
style buffer fill:#c8e6c9,stroke:#ff9800,stroke-dasharray: 5 5
style Vsock fill:#bbdefb,stroke:#2196f3
style Containerd fill:#f5f5f5,stroke:#333
style Agent fill:#fff,stroke:#333
style Container fill:#fff,stroke:#333
```
The kata-shim (containerd-shim-kata-v2) on the Host opens the FIFO pipes provided by containerd via the shimv2 interface.
This results in three FDs (stdin, stdout, and stderr).
The kata-shim manages three separate threads to handle these streams.
The Bottleneck: kata-shim acts as a "middleman," maintaining three internal buffers. It must read data from the FDs into its own buffers before forwarding them via ttrpc over vsock to the destination.
This multi-threaded proxying and buffering in the shim layer introduced significant overhead.
## What is Passthrough-FD?
Passthrough-FD technology enhances the Dragonball VMM's hybrid-vsock implementation with support for recv-fd.
```mermaid
graph LR
subgraph Host ["Host"]
direction LR
Containerd["Containerd"]
Vsock["vsock"]
subgraph VM ["vm"]
Agent["kata-agent"]
Container["container"]
end
end
Containerd -->|stdin| Vsock
Vsock --> Agent
Agent -.-> Container
%% Style Rendering
style Host fill:#f0f8ff,stroke:#333,stroke-dasharray: 5 5
style VM fill:#fff9c4,stroke:#e0e0e0
style Vsock fill:#bbdefb,stroke:#2196f3
style Containerd fill:#f5f5f5,stroke:#333
style Agent fill:#fff,stroke:#333
style Container fill:#fff,stroke:#333
```
Instead of requiring an intermediate layer to read and forward data, the hybrid-vsock module can now directly receive file descriptors from the Host. This allows the system to "pass through" the host's FDs directly to the kata-agent. By eliminating the proxying logic in kata-shim, the IO stream is effectively connected directly to the guest environment.
## Technical Details
The end-to-end process follows these steps:
```mermaid
sequenceDiagram
autonumber
box rgb(220,235,255) Guest (VM)
participant Agent as kata-agent<br/>(Server)
participant VSOCK as AF_VSOCK socket<br/>(Hybrid Vsock)
end
box rgb(255,240,220) Host
participant Shim as kata-shim<br/>(Client)
participant FIFO as File or FIFO<br/>(stdin/stdout/stderr)
end
Note over Agent: Agent Initialization:<br/>listen() on passfd_listener_port
Shim->>FIFO: open() to acquire Fd<br/>(for stdin / stdout / stderr)
Shim->>VSOCK: connect() + send("passfd\n")<br/>+ send_with_fd(Fd, PortA)
Note over VSOCK,Agent: FD Transfer via Hybrid Vsock<br/>(repeat for stdin-port, stdout-port, stderr-port)
VSOCK->>Agent: forward connection + Fd + PortA
Agent->>Agent: accept() → get conn_fd + host-port<br/>save: map[host-port] = conn_fd<br/>(3 entries: stdin-port, stdout-port, stderr-port)
Shim->>Agent: create_container RPC<br/>(includes stdin-port, stdout-port, stderr-port)
Agent->>Agent: lookup map[stdin-port] → bind to container stdin<br/>lookup map[stdout-port] → bind to container stdout<br/>lookup map[stderr-port] → bind to container stderr
Agent-->>Shim: create_container RPC response (OK)
```
1. Agent Initialization: The kata-agent starts a server listening on the port specified by passfd_listener_port.
2. FD Transfer: During the container creation phase, the kata-shim sends the FDs for stdin, stdout, and stderr to the Dragonball hybrid-vsock module using the sendfd mechanism.
3. Connection Establishment: Through hybrid-vsock, these FDs connect to the server started by the agent in Step 1.
4. Identification: The agent's server calls accept() to obtain the connection FD and a corresponding host-port. It saves the connection using the host-port as a unique identifier. At this stage, the agent has three established connections (identified by stdin-port, stdout-port, and stderr-port).
5. RPC Mapping: When kata-shim invokes the create_container RPC, it includes these three port identifiers in the request.
6. Final Binding: Upon receiving the RPC, the agent retrieves the saved connections using the provided ports and binds them directly to the container's standard IO streams.
## How to enable PassthroughFD IO within Configuration?
The Passthrough-FD feature is controlled by two main parameters in the Kata configuration file:
- use_passfd_io: A boolean flag to enable or disable the Passthrough-FD IO feature.
- passfd_listener_port: Specifies the port on which the kata-agent listens for FD connections. The default value is 1027.
To enable Passthrough-FD IO, set use_passfd_io to true in the configuration file:
```toml
...
# If enabled, the runtime will attempt to use fd passthrough feature for process io.
# Note: this feature is only supported by the Dragonball hypervisor.
use_passfd_io = true
# If fd passthrough io is enabled, the runtime will attempt to use the specified port instead of the default port.
passfd_listener_port = 1027
```

View File

@@ -73,5 +73,5 @@ See below example config:
privileged_without_host_devices = true
```
- [Kata Containers with CRI-O](../how-to/run-kata-with-k8s.md#cri-o)
- [Kata Containers with CRI-O](../how-to/how-to-use-k8s-with-crio-and-kata.md#cri-o)

View File

@@ -8,7 +8,7 @@
> **Note:** `cri-tools` is only used for debugging and validation purpose, and don't use it to run production workloads.
> **Note:** For how to install and configure `cri-tools` with CRI runtimes like `containerd` or CRI-O, please also refer to other [howtos](./README.md).
> **Note:** For how to install and configure `cri-tools` with CRI runtimes like `containerd` or CRI-O, please also refer to other [how-tos](./README.md).
## Use `crictl` run Pods in Kata containers

View File

@@ -16,83 +16,38 @@ which hypervisors you may wish to investigate further.
## Types
| Hypervisor | Written in | Architectures | Type |
|-|-|-|-|
|[Cloud Hypervisor] | rust | `aarch64`, `x86_64` | Type 2 ([KVM]) |
|[Firecracker] | rust | `aarch64`, `x86_64` | Type 2 ([KVM]) |
|[QEMU] | C | all | Type 2 ([KVM]) | `configuration-qemu.toml` |
|[`Dragonball`] | rust | `aarch64`, `x86_64` | Type 2 ([KVM]) |
|[StratoVirt] | rust | `aarch64`, `x86_64` | Type 2 ([KVM]) |
| Hypervisor | Written in | Architectures | GPU Support | Intel TDX | AMD SEV-SNP |
|-|-|-|-|-|-|
|[Cloud Hypervisor](#cloud-hypervisor) | rust | `aarch64`, `x86_64` | :x: | :x: | :x: |
|[Firecracker](#firecracker) | rust | `aarch64`, `x86_64` | :x: | :x: | :x: |
|[QEMU](#qemu) | C | all | :white_check_mark: | :white_check_mark: | :white_check_mark: |
|[Dragonball](#dragonball) | rust | `aarch64`, `x86_64` | :x: | :x: | :x: |
|StratoVirt | rust | `aarch64`, `x86_64` | :x: | :x: | :x: |
## Determine currently configured hypervisor
Each Kata runtime is configured for a specific hypervisor through the runtime's configuration file. For example:
```bash
$ kata-runtime kata-env | awk -v RS= '/\[Hypervisor\]/' | grep Path
```toml title="/opt/kata/share/defaults/kata-containers/configuration.toml"
[hypervisor.qemu]
path = "/opt/kata/bin/qemu-system-x86_64"
```
## Choose a Hypervisor
```toml title="/opt/kata/share/defaults/kata-containers/configuration-clh.toml"
[hypervisor.clh]
path = "/opt/kata/bin/cloud-hypervisor"
```
The table below provides a brief summary of some of the differences between
the hypervisors:
## Cloud Hypervisor
| Hypervisor | Summary | Features | Limitations | Container Creation speed | Memory density | Use cases | Comment |
|-|-|-|-|-|-|-|-|
|[Cloud Hypervisor] | Low latency, small memory footprint, small attack surface | Minimal | | excellent | excellent | High performance modern cloud workloads | |
|[Firecracker] | Very slimline | Extremely minimal | Doesn't support all device types | excellent | excellent | Serverless / FaaS | |
|[QEMU] | Lots of features | Lots | | good | good | Good option for most users | |
|[`Dragonball`] | Built-in VMM, low CPU and memory overhead| Minimal | | excellent | excellent | Optimized for most container workloads | `out-of-the-box` Kata Containers experience |
|[StratoVirt] | Unified architecture supporting three scenarios: VM, container, and serverless | Extremely minimal(`MicroVM`) to Lots(`StandardVM`) | | excellent | excellent | Common container workloads | `StandardVM` type of StratoVirt for Kata is under development |
[Cloud Hypervisor](https://www.cloudhypervisor.org/) is a more modern hypervisor written in Rust.
For further details, see the [Virtualization in Kata Containers](design/virtualization.md) document and the official documentation for each hypervisor.
## Firecracker
## Hypervisor configuration files
[Firecracker](https://firecracker-microvm.github.io/) is a minimal and lightweight hypervisor created for the AWS Lambda product.
Since each hypervisor offers different features and options, Kata Containers
provides a separate
[configuration file](../src/runtime/README.md#configuration)
for each. The configuration files contain comments explaining which options
are available, their default values and how each setting can be used.
## QEMU
| Hypervisor | Golang runtime config file | golang runtime short name | golang runtime default | rust runtime config file | rust runtime short name | rust runtime default |
|-|-|-|-|-|-|-|
| [Cloud Hypervisor] | [`configuration-clh.toml`](../src/runtime/config/configuration-clh.toml.in) | `clh` | | [`configuration-cloud-hypervisor.toml`](../src/runtime-rs/config/configuration-cloud-hypervisor.toml.in) | `cloud-hypervisor` | |
| [Firecracker] | [`configuration-fc.toml`](../src/runtime/config/configuration-fc.toml.in) | `fc` | | | | |
| [QEMU] | [`configuration-qemu.toml`](../src/runtime/config/configuration-qemu.toml.in) | `qemu` | yes | [`configuration-qemu.toml`](../src/runtime-rs/config/configuration-qemu-runtime-rs.toml.in) | `qemu` | |
| [`Dragonball`] | | | | [`configuration-dragonball.toml`](../src/runtime-rs/config/configuration-dragonball.toml.in) | `dragonball` | yes |
| [StratoVirt] | [`configuration-stratovirt.toml`](../src/runtime/config/configuration-stratovirt.toml.in) | `stratovirt` | | | | |
QEMU is the best supported hypervisor for NVIDIA-based GPUs and for confidential computing use-cases (such as Intel TDX and AMD SEV-SNP). Runtimes that use this are normally named `kata-qemu-nvidia-gpu-*`. The Kata project focuses primarily on QEMU runtimes for GPU support.
> **Notes:**
>
> - The short names specified are used by the [`kata-manager`](../utils/README.md) tool.
> - As shown by the default columns, each runtime type has its own default hypervisor.
> - The [golang runtime](../src/runtime) is the current default runtime.
> - The [rust runtime](../src/runtime-rs), also known as `runtime-rs`,
> is the newer runtime written in the rust language.
> - See the [Configuration](../README.md#configuration) for further details.
> - The configuration file links in the table link to the "source"
> versions: these are not usable configuration files as they contain
> variables that need to be expanded:
> - The links are provided for reference only.
> - The final (installed) versions, where all variables have been
> expanded, are built from these source configuration files.
> - The pristine configuration files are usually installed in the
> `/opt/kata/share/defaults/kata-containers/` or
> `/usr/share/defaults/kata-containers/` directories.
> - Some hypervisors may have the same name for both golang and rust
> runtimes, but the file contents may differ.
> - If there is no configuration file listed for the golang or
> rust runtimes, this either means the hypervisor cannot be run with
> a particular runtime, or that a driver has not yet been made
> available for that runtime.
## Dragonball
## Switch configured hypervisor
To switch the configured hypervisor, you only need to run a single command.
See [the `kata-manager` documentation](../utils/README.md#choose-a-hypervisor) for further details.
[Cloud Hypervisor]: https://github.com/cloud-hypervisor/cloud-hypervisor
[Firecracker]: https://github.com/firecracker-microvm/firecracker
[KVM]: https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine
[QEMU]: http://www.qemu.org
[`Dragonball`]: https://github.com/kata-containers/kata-containers/blob/main/src/dragonball
[StratoVirt]: https://gitee.com/openeuler/stratovirt
Dragonball is a special hypervisor created by the Ant Group that runs in the same process as the Rust-based containerd shim.

94
docs/index.md Normal file
View File

@@ -0,0 +1,94 @@
# Kata Containers
Kata Containers is an open source community working to build a secure container runtime with lightweight virtual machines (VM's) that feel and perform like standard Linux containers, but provide stronger workload isolation using hardware virtualization technology as a second layer of defense.
## How it Works
Kata implements the [Open Containers Runtime Specification](https://github.com/opencontainers/runtime-spec). More specifically, it implements a containerd shim that implements the expected interface for managing container lifecycles. The default containerd runtime of `runc` spawns a container like this:
```mermaid
flowchart TD
subgraph Host
containerd
runc
process[Container Process]
containerd --> runc --> process
end
```
When containerd receives a request to spawn a container, it will pull the container image down and then call out to the runc shim (usually located at `/usr/local/bin/containerd-shim-runc-v2`). runc will then create various process isolation resources like Linux namespaces (networking, PIDs, mounts etc), seccomp filters, Linux capability reductions, and then spawn the process inside of those resources. This process runs in the host kernel.
Kata spawns containers like this:
```mermaid
flowchart TD
subgraph Host
containerdOuter[containerd]
kata
containerdOuter --> kata
kata --> kataAgent
subgraph VM
kataAgent[Kata Agent]
process[Container Process]
kataAgent --> process
end
end
```
The container process spawned inside of the VM allows us to isolate the guest kernel from the host system. This is the fundamental principle of how Kata achieves its isolation boundaries.
## Example
When Kata is installed in a system, a number of artifacts are laid down. containerd's config will be modified as such:
```toml title="/etc/containerd/config.toml"
imports = ["/opt/kata/containerd/config.d/kata-deploy.toml"]
```
This file will contain configuration for various flavors of Kata runtimes. We can see the vanilla CPU runtime config here:
```toml title="/opt/kata/containerd/config.d/kata-deploy.toml"
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes.kata-qemu]
runtime_type = "io.containerd.kata-qemu.v2"
runtime_path = "/opt/kata/bin/containerd-shim-kata-v2"
privileged_without_host_devices = true
pod_annotations = ["io.katacontainers.*"]
[plugins."io.containerd.cri.v1.runtime".containerd.runtimes.kata-qemu.options]
ConfigPath = "/opt/kata/share/defaults/kata-containers/configuration-qemu.toml"
```
Because containerd's CRI is aware of the Kata runtimes, we can spawn Kubernetes pods:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
runtimeClassName: kata-qemu
containers:
- name: test
image: "quay.io/libpod/ubuntu:latest"
command: ["/bin/bash", "-c"]
args: ["echo hello"]
```
We can also spawn a Kata container by submitting a request to containerd like so:
<div class="annotate" markdown>
```sh
$ ctr image pull quay.io/libpod/ubuntu:latest
$ ctr run --runtime "io.containerd.kata.v2" --runtime-config-path /opt/kata/share/defaults/kata-containers/configuration-qemu.toml --rm -t "quay.io/libpod/ubuntu:latest" foo sh
# echo hello
hello
```
</div>
!!! tip
`ctr` is not aware of the CRI config in `/etc/containerd/config.toml`. This is why you must specify the `--runtime-config-path`. Additionally, the `--runtime` value is converted into a specific binary name which containerd then searches for in its `PATH`. See the [containerd docs](https://github.com/containerd/containerd/blob/release/2.2/core/runtime/v2/README.md#usage) for more details.

View File

@@ -18,6 +18,3 @@ artifacts required to run Kata Containers on Kubernetes.
* [upgrading document](../Upgrading.md)
* [developer guide](../Developer-Guide.md)
* [runtime documentation](../../src/runtime/README.md)
## Kata Containers 3.0 rust runtime installation
* [installation guide](../install/kata-containers-3.0-rust-runtime-installation-guide.md)

View File

@@ -1,116 +0,0 @@
# Kata Containers 3.0 rust runtime installation
The following is an overview of the different installation methods available.
## Prerequisites
Kata Containers 3.0 rust runtime requires nested virtualization or bare metal. Check
[hardware requirements](/src/runtime/README.md#hardware-requirements) to see if your system is capable of running Kata
Containers.
### Platform support
Kata Containers 3.0 rust runtime currently runs on 64-bit systems supporting the following
architectures:
> **Notes:**
> For other architectures, see https://github.com/kata-containers/kata-containers/issues/4320
| Architecture | Virtualization technology |
|-|-|
| `x86_64`| [Intel](https://www.intel.com) VT-x |
| `aarch64` ("`arm64`")| [ARM](https://www.arm.com) Hyp |
## Packaged installation methods
| Installation method | Description | Automatic updates | Use case | Availability
|------------------------------------------------------|----------------------------------------------------------------------------------------------|-------------------|-----------------------------------------------------------------------------------------------|----------- |
| [Using kata-deploy](#kata-deploy-installation) | The preferred way to deploy the Kata Containers distributed binaries on a Kubernetes cluster | **No!** | Best way to give it a try on kata-containers on an already up and running Kubernetes cluster. | Yes |
| [Using official distro packages](#official-packages) | Kata packages provided by Linux distributions official repositories | yes | Recommended for most users. | No |
| [Automatic](#automatic-installation) | Run a single command to install a full system | **No!** | For those wanting the latest release quickly. | No |
| [Manual](#manual-installation) | Follow a guide step-by-step to install a working system | **No!** | For those who want the latest release with more control. | No |
| [Build from source](#build-from-source-installation) | Build the software components manually | **No!** | Power users and developers only. | Yes |
### Kata Deploy Installation
Follow the [`kata-deploy`](../../tools/packaging/kata-deploy/helm-chart/README.md).
### Official packages
`ToDo`
### Automatic Installation
`ToDo`
### Manual Installation
`ToDo`
## Build from source installation
### Rust Environment Set Up
* Download `Rustup` and install `Rust`
> **Notes:**
> For Rust version, please set `RUST_VERSION` to the value of `languages.rust.meta.newest-version key` in [`versions.yaml`](../../versions.yaml) or, if `yq` is available on your system, run `export RUST_VERSION=$(yq read versions.yaml languages.rust.meta.newest-version)`.
Example for `x86_64`
```
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
$ source $HOME/.cargo/env
$ rustup install ${RUST_VERSION}
$ rustup default ${RUST_VERSION}-x86_64-unknown-linux-gnu
```
* Musl support for fully static binary
Example for `x86_64`
```
$ rustup target add x86_64-unknown-linux-musl
```
* [Musl `libc`](http://musl.libc.org/) install
Example for musl 1.2.3
```
$ curl -O https://git.musl-libc.org/cgit/musl/snapshot/musl-1.2.3.tar.gz
$ tar vxf musl-1.2.3.tar.gz
$ cd musl-1.2.3/
$ ./configure --prefix=/usr/local/
$ make && sudo make install
```
### Install Kata 3.0 Rust Runtime Shim
```
$ git clone https://github.com/kata-containers/kata-containers.git
$ cd kata-containers/src/runtime-rs
$ make && sudo make install
```
After running the command above, the default config file `configuration.toml` will be installed under `/usr/share/defaults/kata-containers/`, the binary file `containerd-shim-kata-v2` will be installed under `/usr/local/bin/` .
### Install Shim Without Builtin Dragonball VMM
By default, runtime-rs includes the `Dragonball` VMM. To build without the built-in `Dragonball` hypervisor, use `make USE_BUILDIN_DB=false`:
```bash
$ cd kata-containers/src/runtime-rs
$ make USE_BUILDIN_DB=false
```
After building, specify the desired hypervisor during installation using `HYPERVISOR`. For example, to use `qemu` or `cloud-hypervisor`:
```
sudo make install HYPERVISOR=qemu
```
or
```
sudo make install HYPERVISOR=cloud-hypervisor
```
### Build Kata Containers Kernel
Follow the [Kernel installation guide](/tools/packaging/kernel/README.md).
### Build Kata Rootfs
Follow the [Rootfs installation guide](../../tools/osbuilder/rootfs-builder/README.md).
### Build Kata Image
Follow the [Image installation guide](../../tools/osbuilder/image-builder/README.md).
### Install Containerd
Follow the [Containerd installation guide](container-manager/containerd/containerd-install.md).

64
docs/installation.md Normal file
View File

@@ -0,0 +1,64 @@
# Installation
## Helm Chart
[helm](https://helm.sh/docs/intro/install/) can be used to install templated kubernetes manifests.
### Prerequisites
- **Kubernetes ≥ v1.22** v1.22 is the first release where the CRI v1 API
became the default and `RuntimeClass` left alpha. The chart depends on those
stable interfaces; earlier clusters need `featuregates` or CRI shims that are
out of scope.
- **Kata Release 3.12** - v3.12.0 introduced publishing the helm-chart on the
release page for easier consumption, since v3.8.0 we shipped the helm-chart
via source code in the kata-containers `GitHub` repository.
- CRIcompatible runtime (containerd or CRIO). If one wants to use the
`multiInstallSuffix` feature one needs at least **containerd-2.0** which
supports drop-in config files
- Nodes must allow loading kernel modules and installing Kata artifacts (the
chart runs privileged containers to do so)
### `helm install`
```sh
# Install directly from the official ghcr.io OCI registry
# update the VERSION X.YY.Z to your needs or just use the latest
export VERSION=$(curl -sSL https://api.github.com/repos/kata-containers/kata-containers/releases/latest | jq .tag_name | tr -d '"')
export CHART="oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy"
$ helm install kata-deploy "${CHART}" --version "${VERSION}"
# See everything you can configure
$ helm show values "${CHART}" --version "${VERSION}"
```
This installs the `kata-deploy` DaemonSet and the default Kata `RuntimeClass`
resources on your cluster.
To see what versions of the chart are available:
```sh
$ helm show chart oci://ghcr.io/kata-containers/kata-deploy-charts/kata-deploy
```
### `helm uninstall`
```sh
$ helm uninstall kata-deploy -n kube-system
```
During uninstall, Helm will report that some resources were kept due to the
resource policy (`ServiceAccount`, `ClusterRole`, `ClusterRoleBinding`). This
is **normal**. A post-delete hook Job runs after uninstall and removes those
resources so no cluster-wide `RBAC` is left behind.
## Pre-Built Release
Kata can also be installed using the pre-built releases: https://github.com/kata-containers/kata-containers/releases
This method does not have any facilities for artifact lifecycle management.

116
docs/prerequisites.md Normal file
View File

@@ -0,0 +1,116 @@
# Prerequisites
## Kubernetes
If using Kubernetes, at least version `v1.22` is recommended. This is the first release that the CRI v1 API and the `RuntimeClass` left alpha.
## containerd
Kata requires a [CRI](https://kubernetes.io/docs/concepts/containers/cri/)-compatible container runtime. containerd is commonly used for Kata. We recommend installing containerd using your platform's package distribution mechanism. We recommend at least the latest version of containerd v2.1.x.[^1]
### Debian/Ubuntu
To install on Debian-based systems:
```sh
$ apt update
$ apt install containerd
$ systemctl status containerd
● containerd.service - containerd container runtime
Loaded: loaded (/etc/systemd/system/containerd.service; enabled; preset: enabled)
Drop-In: /etc/systemd/system/containerd.service.d
└─http-proxy.conf
Active: active (running) since Wed 2026-02-25 22:58:13 UTC; 5 days ago
Docs: https://containerd.io
Main PID: 3767885 (containerd)
Tasks: 540
Memory: 70.7G (peak: 70.8G)
CPU: 4h 9min 26.153s
CGroup: /runtime.slice/containerd.service
├─ 12694 /usr/local/bin/container
```
### Fedora/RedHat
To install on Fedora-based systems:
```
$ yum install containerd
```
??? help
Documentation assistance is requested for more specific instructions on Fedora systems.
### Pre-Built Releases
Many Linux distributions will not package the latest versions of containerd. If you find that your distribution provides very old versions of containerd, it's recommended to upgrade with the [pre-built releases](https://github.com/containerd/containerd/releases).
#### Executable
Download the latest release of containerd:
```sh
$ wget https://github.com/containerd/containerd/releases/download/v${VERSION}/containerd-${VERSION}-linux-${PLATFORM}.tar.gz
# Extract to the current directory
$ tar -xf ./containerd*.tar.gz
# Extract to root if you want it installed to its final location.
$ tar -C / -xf ./*.tar.gz
```
### Containerd Config
Containerd requires a config file at `/etc/containerd/config.toml`. This needs to be populated with a simple default config:
```sh
$ /usr/local/bin/containerd config default > /etc/containerd/config.toml
```
### Systemd Unit File
Install the systemd unit file:
```sh
$ wget -O /etc/systemd/system/containerd.service https://raw.githubusercontent.com/containerd/containerd/main/containerd.service
```
!!! info
- You must modify the `ExecStart` line to the location of the installed containerd executable.
- containerd's `PATH` variable must allow it to find `containerd-shim-kata-v2`. You can do this by either creating a symlink from `/usr/local/bin/containerd-shim-kata-v2` to `/opt/kata/bin/containerd-shim-kata-v2` or by modifying containerd's `PATH` variable to search in `/opt/kata/bin/`. See the Environment= command in systemd.exec(5) for further details.
Reload systemd and start containerd:
```sh
$ systemctl daemon-reload
$ systemctl enable --now containerd
$ systemctl start containerd
$ systemctl status containerd
```
More details can be found on the [containerd installation docs](https://github.com/containerd/containerd/blob/main/docs/getting-started.md).
### Enable CRI
If you're using Kubernetes, you must enable the containerd Container Runtime Interface (CRI) plugin:
```sh
$ ctr plugins ls | grep cri
io.containerd.cri.v1 images - ok
io.containerd.cri.v1 runtime linux/amd64 ok
io.containerd.grpc.v1 cri - ok
```
If these are not enabled, you'll need to remove it from the `disabled_plugins` section of the containerd config.
[^1]: Kata makes use of containerd's drop-in config merging in `/etc/containerd/config.d/` which is only available starting from containerd v2. containerd v1 may work, but some Kata features will not work as expected.
## runc
The default `runc` runtime needs to be installed for non-kata containers. More details can be found at the [containerd docs](https://github.com/containerd/containerd/blob/979c80d8a5d7fc7be34102a1ada53ae5a0ff09e8/docs/RUNC.md).

9
docs/requirements.txt Normal file
View File

@@ -0,0 +1,9 @@
mkdocs-materialx==10.0.9
mkdocs-glightbox==0.4.0
mkdocs-macros-plugin==1.5.0
mkdocs-awesome-nav==3.3.0
mkdocs-open-in-new-tab==1.0.8
mkdocs-redirects==1.2.2
CairoSVG==2.9.0
pillow==12.1.1
click==8.2.1

View File

@@ -0,0 +1,56 @@
# Runtime Configuration
The containerd shims (both the Rust and Go implementations) take configuration files to control their behavior. These files are in `/opt/kata/share/defaults/kata-containers/`. An example excerpt:
```toml title="/opt/kata/share/defaults/kata-containers/configuration.toml"
[hypervisor.qemu]
path = "/opt/kata/bin/qemu-system-x86_64"
kernel = "/opt/kata/share/kata-containers/vmlinux.container"
image = "/opt/kata/share/kata-containers/kata-containers.img"
machine_type = "q35"
# rootfs filesystem type:
# - ext4 (default)
# - xfs
# - erofs
rootfs_type = "ext4"
# Enable running QEMU VMM as a non-root user.
# By default QEMU VMM run as root. When this is set to true, QEMU VMM process runs as
# a non-root random user. See documentation for the limitations of this mode.
rootless = false
# List of valid annotation names for the hypervisor
# Each member of the list is a regular expression, which is the base name
# of the annotation, e.g. "path" for io.katacontainers.config.hypervisor.path"
enable_annotations = ["enable_iommu", "virtio_fs_extra_args", "kernel_params"]
```
These files should never be modified directly. If you wish to create a modified version of these files, you may create your own [custom runtime](helm-configuration.md#custom-runtimes). For example, to modify the image path, we provide these values to helm:
```yaml title="values.yaml"
customRuntimes:
enabled: true
runtimes:
my-gpu-runtime:
baseConfig: "qemu-nvidia-gpu"
dropIn: |
[hypervisor.qemu]
image = "/path/to/custom-image.img"
runtimeClass: |
kind: RuntimeClass
apiVersion: node.k8s.io/v1
metadata:
name: kata-my-gpu-runtime
labels:
app.kubernetes.io/managed-by: kata-deploy
handler: kata-my-gpu-runtime
overhead:
podFixed:
memory: "640Mi"
cpu: "500m"
scheduling:
nodeSelector:
katacontainers.io/kata-runtime: "true"
```

View File

@@ -175,7 +175,7 @@ specific).
##### Dragonball networking
For Dragonball, the `virtio-net` backend default is within Dragonbasll's VMM.
For Dragonball, the `virtio-net` backend default is within Dragonball's VMM.
#### virtio-vsock

View File

@@ -1,11 +1,12 @@
# Enabling NVIDIA GPU workloads using GPU passthrough with Kata Containers
This page provides:
1. A description of the components involved when running GPU workloads with
Kata Containers using the NVIDIA TEE and non-TEE GPU runtime classes.
1. An explanation of the orchestration flow on a Kubernetes node for this
scenario.
1. A deployment guide enabling to utilize these runtime classes.
1. A deployment guide to utilize these runtime classes.
The goal is to educate readers familiar with Kubernetes and Kata Containers
on NVIDIA's reference implementation which is reflected in Kata CI's build
@@ -18,58 +19,56 @@ Confidential Containers.
> **Note:**
>
> The current supported mode for enabling GPU workloads in the TEE scenario
> is single GPU passthrough (one GPU per pod) on AMD64 platforms (AMD SEV-SNP
> being the only supported TEE scenario so far with support for Intel TDX being
> on the way).
> The currently supported modes for enabling GPU workloads in the TEE
> scenario are: (1) singleGPU passthrough (one physical GPU per pod) and
> (2) multi-GPU passthrough on NVSwitch (NVLink) based HGX systems
> (for example, HGX Hopper (SXM) and HGX Blackwell / HGX B200).
## Component Overview
Before providing deployment guidance, we describe the components involved to
support running GPU workloads. We start from a top to bottom perspective
from the NVIDIA GPU operator via the Kata runtime to the components within
from the NVIDIA GPU Operator via the Kata runtime to the components within
the NVIDIA GPU Utility Virtual Machine (UVM) root filesystem.
### NVIDIA GPU Operator
A central component is the
[NVIDIA GPU operator](https://github.com/NVIDIA/gpu-operator) which can be
deployed onto your cluster as a helm chart. Installing the GPU operator
[NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) which can be
deployed onto your cluster as a helm chart. Installing the GPU Operator
delivers various operands on your nodes in the form of Kubernetes DaemonSets.
These operands are vital to support the flow of orchestrating pod manifests
using NVIDIA GPU runtime classes with GPU passthrough on your nodes. Without
getting into the details, the most important operands and their
responsibilities are:
- **nvidia-vfio-manager:** Binding discovered NVIDIA GPUs to the `vfio-pci`
driver for VFIO passthrough.
- **nvidia-vfio-manager:** Binding discovered NVIDIA GPUs and nvswitches to
the `vfio-pci` driver for VFIO passthrough.
- **nvidia-cc-manager:** Transitioning GPUs into confidential computing (CC)
and non-CC mode (see the
[NVIDIA/k8s-cc-manager](https://github.com/NVIDIA/k8s-cc-manager)
repository).
- **nvidia-kata-manager:** Creating host-side CDI specifications for GPU
passthrough, resulting in the file `/var/run/cdi/nvidia.yaml`, containing
`kind: nvidia.com/pgpu` (see the
[NVIDIA/k8s-kata-manager](https://github.com/NVIDIA/k8s-kata-manager)
repository).
- **nvidia-sandbox-device-plugin** (see the
[NVIDIA/sandbox-device-plugin](https://github.com/NVIDIA/sandbox-device-plugin)
repository):
- Creating host-side CDI specifications for GPU passthrough,
resulting in the file `/var/run/cdi/nvidia.yaml`, containing
`kind: nvidia.com/pgpu`
- Allocating GPUs during pod deployment.
- Discovering NVIDIA GPUs, their capabilities, and advertising these to
the Kubernetes control plane (allocatable resources as type
`nvidia.com/pgpu` resources will appear for the node and GPU Device IDs
will be registered with Kubelet). These GPUs can thus be allocated as
container resources in your pod manifests. See below GPU operator
container resources in your pod manifests. See below GPU Operator
deployment instructions for the use of the key `pgpu`, controlled via a
variable.
To summarize, the GPU operator manages the GPUs on each node, allowing for
To summarize, the GPU Operator manages the GPUs on each node, allowing for
simple orchestration of pod manifests using Kata Containers. Once the cluster
with GPU operator and Kata bits is up and running, the end user can schedule
with GPU Operator and Kata bits is up and running, the end user can schedule
Kata NVIDIA GPU workloads, using resource limits and the
`kata-qemu-nvidia-gpu` or `kata-qemu-nvidia-gpu-snp` runtime classes, for
example:
`kata-qemu-nvidia-gpu`, `kata-qemu-nvidia-gpu-tdx` or
`kata-qemu-nvidia-gpu-snp` runtime classes, for example:
```yaml
apiVersion: v1
@@ -213,7 +212,7 @@ API and kernel drivers, interacting with the pass-through GPU device.
An additional step is exercised in our CI samples: when using images from an
authenticated registry, the guest-pull mechanism triggers attestation using
trustee's Key Broker Service (KBS) for secure release of the NGC API
Trustee's Key Broker Service (KBS) for secure release of the NGC API
authentication key used to access the NVCR container registry. As part of
this, the attestation agent exercises composite attestation and transitions
the GPU into `Ready` state (without this, the GPU has to explicitly be
@@ -232,24 +231,40 @@ NVIDIA GPU CI validation jobs. Note that, this setup:
- uses the genpolicy tool to attach Kata agent security policies to the pod
manifest
- has dedicated (composite) attestation tests, a CUDA vectorAdd test, and a
NIM/RA test sample with secure API key release
NIM/RA test sample with secure API key release using sealed secrets.
A similar deployment guide and scenario description can be found in NVIDIA resources
under
[Early Access: NVIDIA GPU Operator with Confidential Containers based on Kata](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/confidential-containers.html).
[NVIDIA Confidential Containers Overview (Early Access)](https://docs.nvidia.com/datacenter/cloud-native/confidential-containers/latest/overview.html).
### Feature Set
The NVIDIA stack for Kata Containers leverages features for the confidential
computing scenario from both the confidential containers open source project
and from the Kata Containers source tree, such as:
- composite attestation using Trustee and the NVIDIA Remote Attestation
Service NRAS
- generating kata agent security policies using the genpolicy tool
- use of signed sealed secrets
- access to authenticated registries for container image guest-pull
- container image signature verification and encrypted container images
- ephemeral container data and image layer storage
### Requirements
The requirements for the TEE scenario are:
- Ubuntu 25.10 as host OS
- CPU with AMD SEV-SNP support with proper BIOS/UEFI version and settings
- CPU with AMD SEV-SNP or Intel TDX support with proper BIOS/UEFI version
and settings
- CC-capable Hopper/Blackwell GPU with proper VBIOS version.
BIOS and VBIOS configuration is out of scope for this guide. Other resources,
such as the documentation found on the
[NVIDIA Trusted Computing Solutions](https://docs.nvidia.com/nvtrust/index.html)
page and the above linked NVIDIA documentation, provide guidance on
page, on the
[Secure AI Compatibility Matrix](https://www.nvidia.com/en-us/data-center/solutions/confidential-computing/secure-ai-compatibility-matrix/)
page, and on the above linked NVIDIA documentation, provide guidance on
selecting proper hardware and on properly configuring its firmware and OS.
### Installation
@@ -257,12 +272,16 @@ selecting proper hardware and on properly configuring its firmware and OS.
#### Containerd and Kubernetes
First, set up your Kubernetes cluster. For instance, in Kata CI, our NVIDIA
jobs use a single-node vanilla Kubernetes cluster with a 2.x containerd
version and Kata's current supported Kubernetes version. We set this cluster
up using the `deploy_k8s` function from `tests/integration/kubernetes/gha-run.sh`
as follows:
jobs use a single-node vanilla Kubernetes cluster with a 2.1 containerd
version and Kata's current supported Kubernetes version. This cluster is
being set up using the `deploy_k8s` function from the script file
`tests/integration/kubernetes/gha-run.sh`. If you intend to run this script,
follow these steps, and make sure you have `yq` and `helm` installed. Note
that, these scripts query the GitHub API, so creating and declaring a
personal access token prevents rate limiting issues.
You can execute the function as follows:
```bash
$ export GH_TOKEN="<your-gh-pat>"
$ export KUBERNETES="vanilla"
$ export CONTAINER_ENGINE="containerd"
$ export CONTAINER_ENGINE_VERSION="v2.1"
@@ -276,8 +295,11 @@ $ deploy_k8s
> `runtimeRequestTimeout` timeout value than the two minute default timeout.
> Using the guest-pull mechanism, pulling large images may take a significant
> amount of time and may delay container start, possibly leading your Kubelet
> to de-allocate your pod before it transitions from the *container created*
> to the *container running* state.
> to de-allocate your pod before it transitions from the *container creating*
> to the *container running* state. The NVIDIA shim configurations use a
> `create_container_timeout` of 1200s, which is the equivalent value on shim
> side, controlling the time the shim allows for a container to remain in
> *container creating* state.
> **Note:**
>
@@ -291,7 +313,7 @@ $ deploy_k8s
#### GPU Operator
Assuming you have the helm tools installed, deploy the latest version of the
GPU Operator as a helm chart (minimum version: `v25.10.0`):
GPU Operator as a helm chart (minimum version: `v26.3.0`):
```bash
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && helm repo update
@@ -300,33 +322,27 @@ $ helm install --wait --generate-name \
nvidia/gpu-operator \
--set sandboxWorkloads.enabled=true \
--set sandboxWorkloads.defaultWorkload=vm-passthrough \
--set kataManager.enabled=true \
--set kataManager.config.runtimeClasses=null \
--set kataManager.repository=nvcr.io/nvidia/cloud-native \
--set kataManager.image=k8s-kata-manager \
--set kataManager.version=v0.2.4 \
--set ccManager.enabled=true \
--set ccManager.defaultMode=on \
--set ccManager.repository=nvcr.io/nvidia/cloud-native \
--set ccManager.image=k8s-cc-manager \
--set ccManager.version=v0.2.0 \
--set sandboxDevicePlugin.repository=nvcr.io/nvidia/cloud-native \
--set sandboxDevicePlugin.image=nvidia-sandbox-device-plugin \
--set sandboxDevicePlugin.version=v0.0.1 \
--set 'sandboxDevicePlugin.env[0].name=P_GPU_ALIAS' \
--set 'sandboxDevicePlugin.env[0].value=pgpu' \
--set sandboxWorkloads.mode=kata \
--set nfd.enabled=true \
--set nfd.nodefeaturerules=true
```
> **Note:**
>
> For heterogeneous clusters with different GPU types, you can omit
> the `P_GPU_ALIAS` environment variable lines. This will cause the sandbox
> device plugin to create GPU model-specific resource types (e.g.,
> `nvidia.com/GH100_H100L_94GB`) instead of the generic `nvidia.com/pgpu`,
> which in turn can be used by pods through respective resource limits.
> For simplicity, this guide uses the generic alias.
> For heterogeneous clusters with different GPU types, you can specify an
> empty `P_GPU_ALIAS` environment variable for the sandbox device plugin:
> `- --set 'sandboxDevicePlugin.env[0].name=P_GPU_ALIAS' \`
> `- --set 'sandboxDevicePlugin.env[0].value=""' \`
> This will cause the sandbox device plugin to create GPU model-specific
> resource types (e.g., `nvidia.com/GH100_H100L_94GB`) instead of the
> default `pgpu` type, which usually results in advertising a resource of
> type `nvidia.com/pgpu`
> The exposed device resource types can be used for pods by specifying
> respective resource limits.
> Your node's nvswitches are exposed as resources of type
> `nvidia.com/nvswitch` by default. Using the variable `NVSWITCH_ALIAS`
> allows to control the advertising behavior similar to the `P_GPU_ALIAS`
> variable.
> **Note:**
>
@@ -351,8 +367,7 @@ $ helm install kata-deploy \
--create-namespace \
-f "https://raw.githubusercontent.com/kata-containers/kata-containers/refs/tags/${VERSION}/tools/packaging/kata-deploy/helm-chart/kata-deploy/try-kata-nvidia-gpu.values.yaml" \
--set nfd.enabled=false \
--set shims.qemu-nvidia-gpu-tdx.enabled=false \
--wait --timeout 10m --atomic \
--wait --timeout 10m \
"${CHART}" --version "${VERSION}"
```
@@ -382,31 +397,22 @@ mode which requires entering a licensing agreement with NVIDIA, see the
### Cluster validation and preparation
If you did not use the `sandboxWorkloads.defaultWorkload=vm-passthrough`
parameter during GPU operator deployment, label your nodes for GPU VM
parameter during GPU Operator deployment, label your nodes for GPU VM
passthrough, for the example of using all nodes for GPU passthrough, run:
```bash
$ kubectl label nodes --all nvidia.com/gpu.workload.config=vm-passthrough --overwrite
```
Check if the `nvidia-cc-manager` pod is running if you intend to run GPU TEE
scenarios. If not, you need to manually label the node as CC capable. Current
GPU Operator node feature rules do not yet recognize all CC capable GPU PCI
IDs. Run the following command:
```bash
$ kubectl label nodes --all nvidia.com/cc.capable=true
```
After this, assure the `nvidia-cc-manager` pod is running. With the suggested
parameters for GPU Operator deployment, the `nvidia-cc-manager` will
automatically transition the GPU into CC mode.
With the suggested parameters for GPU Operator deployment, the
`nvidia-cc-manager` operand will automatically transition the GPU into CC
mode.
After deployment, you can transition your node(s) to the desired CC state,
using either the `on` or `off` value, depending on your scenario. For the
non-CC scenario, transition to the `off` state via:
using either the `on`, `ppcie`, or `off` value, depending on your scenario.
For the non-CC scenario, transition to the `off` state via:
`kubectl label nodes --all nvidia.com/cc.mode=off` and wait until all pods
are back running. When an actual change is exercised, various GPU operator
are back running. When an actual change is exercised, various GPU Operator
operands will be restarted.
Ensure all pods are running:
@@ -425,9 +431,10 @@ $ lspci -nnk -d 10de:
### Run the CUDA vectorAdd sample
Create the following file:
Create the pod manifest with:
```yaml
```bash
$ cat > cuda-vectoradd-kata.yaml.in << 'EOF'
apiVersion: v1
kind: Pod
metadata:
@@ -445,6 +452,7 @@ spec:
limits:
nvidia.com/pgpu: "1"
memory: 16Gi
EOF
```
Depending on your scenario and on the CC state, export your desired runtime
@@ -477,6 +485,17 @@ To stop the pod, run: `kubectl delete pod cuda-vectoradd-kata`.
### Next steps
#### Use multi-GPU passthrough
If you have machines supporting multi-GPU passthrough, use a pod deployment
manifest which uses 8 pgpu and 4 nvswitch resources.
On the NVIDIA Hopper architecture multi-GPU passthrough uses protected PCIe
(PPCIE) which claims exclusive use of the nvswitches for a single CVM. In
this case, transition your relevant node(s) GPU mode to `ppcie` mode.
The NVIDIA Blackwell architecture uses NVLink encryption which places the
switches outside of the Trusted Computing Base (TCB) and so does not
require a separate switch setting.
#### Transition between CC and non-CC mode
Use the previously described node labeling approach to transition between
@@ -492,7 +511,7 @@ and a basic NIM/RAG deployment. Running CI tests for the TEE GPU scenario
requires KBS to be deployed (except for the CUDA vectorAdd test). The best
place to get started running these tests locally is to look into our
[NVIDIA CI workflow manifest](https://github.com/kata-containers/kata-containers/blob/main/.github/workflows/run-k8s-tests-on-nvidia-gpu.yaml)
and into the underling
and into the underlying
[run_kubernetes_nv_tests.sh](https://github.com/kata-containers/kata-containers/blob/main/tests/integration/kubernetes/run_kubernetes_nv_tests.sh)
script. For example, to run the CUDA vectorAdd scenario against the TEE GPU
runtime class use the following commands:
@@ -547,6 +566,22 @@ With GPU passthrough being supported by the
you can use the tool to create a Kata agent security policy. Our CI deploys
all sample pod manifests with a Kata agent security policy.
Note that, using containerd 2.1 in upstream's CI, we use the following
modification to the genpolicy default settings:
```bash
[
{
"op": "replace",
"path": "/kata_config/oci_version",
"value": "1.2.1"
}
]
```
This modification is applied via the genpolicy drop-in configuration file
`src\tools\genpolicy\drop-in-examples\20-oci-1.2.1-drop-in.json`.
When using a newer containerd version, such as containerd 2.2, the OCI
version field needs to be adjusted to "1.3.0", for instance.
#### Deploy pods using your own containers and manifests
You can author pod manifests leveraging your own containers, for instance,
@@ -564,6 +599,3 @@ following annotation in the manifest:
>
> - musl-based container images (e.g., using Alpine), or distro-less
> containers are not supported.
> - for the TEE scenario, only single-GPU passthrough per pod is supported,
> so your pod resource limit must be: `nvidia.com/pgpu: "1"` (on a system
> with multiple GPUs, you can thus pass through one GPU per pod).

91
mkdocs.yaml Normal file
View File

@@ -0,0 +1,91 @@
site_name: "Kata Containers Docs"
site_description: "Developer and user documentation for the Kata Containers project."
site_author: "Kata Containers Community"
repo_url: "https://github.com/kata-containers/kata-containers"
site_url: "https://kata-containers.github.io/kata-containers"
edit_uri: "edit/main/docs/"
repo_name: kata-containers
theme:
name: materialx
favicon: "assets/images/favicon.svg"
logo: "assets/images/favicon.svg"
topbar_style: glass
palette:
- media: "(prefers-color-scheme)"
toggle:
icon: material/brightness-auto
name: Switch to light mode
- media: "(prefers-color-scheme: light)"
scheme: default
primary: blue
accent: light blue
toggle:
icon: material/weather-sunny
name: Switch to dark mode
- media: "(prefers-color-scheme: dark)"
scheme: slate
primary: cyan
accent: cyan
toggle:
icon: material/brightness-4
name: Switch to system preference
features:
- content.action.edit
- content.action.view
- content.code.annotate
- content.code.copy
- content.code.select
- content.footnote.tooltips
- content.tabs.link
- content.tooltips
- navigation.expand
- navigation.indexes
- navigation.path
- navigation.sections
- navigation.tabs
- navigation.tracking
- navigation.top
- navigation.instant
- navigation.instant.prefetch
- navigation.instant.progress
- toc.follow
markdown_extensions:
- abbr
- admonition
- attr_list
- def_list
- footnotes
- md_in_html
- pymdownx.arithmatex:
generic: true
- pymdownx.emoji:
emoji_index: !!python/name:material.extensions.emoji.twemoji
emoji_generator: !!python/name:material.extensions.emoji.to_svg
- pymdownx.details
- pymdownx.highlight:
anchor_linenums: true
line_spans: __span
pygments_lang_class: true
auto_title: true
- pymdownx.keys
- pymdownx.magiclink
- pymdownx.superfences:
custom_fences:
- name: mermaid
class: mermaid
format: !!python/name:pymdownx.superfences.fence_code_format
- pymdownx.inlinehilite
- pymdownx.tabbed:
alternate_style: true
- pymdownx.tilde
- pymdownx.caret
- pymdownx.mark
- toc:
permalink: true
plugins:
- search
- awesome-nav

View File

@@ -1,3 +1,3 @@
[toolchain]
# Keep in sync with versions.yaml
channel = "1.91"
channel = "1.92"

1830
src/agent/Cargo.lock generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,103 +1,3 @@
[workspace]
members = ["rustjail", "policy", "vsock-exporter"]
[workspace.package]
authors = ["The Kata Containers community <kata-dev@lists.katacontainers.io>"]
edition = "2018"
license = "Apache-2.0"
rust-version = "1.88.0"
[workspace.dependencies]
oci-spec = { version = "0.8.1", features = ["runtime"] }
lazy_static = "1.3.0"
ttrpc = { version = "0.8.4", features = ["async"], default-features = false }
protobuf = "3.7.2"
libc = "0.2.94"
# Notes:
# - Needs to stay in sync with libs
# - Upgrading to 0.27+ will require code changes (see #11842)
nix = "0.26.4"
capctl = "0.2.0"
scan_fmt = "0.2.6"
scopeguard = "1.0.0"
thiserror = "1.0.26"
regex = "1.10.5"
serial_test = "0.10.0"
url = "2.5.0"
derivative = "2.2.0"
const_format = "0.2.30"
# Async helpers
async-trait = "0.1.50"
async-recursion = "0.3.2"
futures = "0.3.30"
# Async runtime
tokio = { version = "1.46.1", features = ["full"] }
tokio-vsock = "0.3.4"
netlink-sys = { version = "0.7.0", features = ["tokio_socket"] }
rtnetlink = "0.14.0"
netlink-packet-route = "0.19.0"
netlink-packet-core = "0.7.0"
ipnetwork = "0.17.0"
slog = "2.5.2"
slog-scope = "4.1.2"
slog-term = "2.9.0"
# Redirect ttrpc log calls
slog-stdlog = "4.0.0"
log = "0.4.11"
cfg-if = "1.0.0"
prometheus = { version = "0.14.0", features = ["process"] }
procfs = "0.12.0"
anyhow = "1"
cgroups = { package = "cgroups-rs", git = "https://github.com/kata-containers/cgroups-rs", rev = "v0.3.5" }
# Tracing
tracing = "0.1.41"
tracing-subscriber = "0.2.18"
tracing-opentelemetry = "0.13.0"
opentelemetry = { version = "0.14.0", features = ["rt-tokio-current-thread"] }
# Configuration
serde = { version = "1.0.129", features = ["derive"] }
serde_json = "1.0.39"
toml = "0.5.8"
clap = { version = "4.5.40", features = ["derive"] }
strum = "0.26.2"
strum_macros = "0.26.2"
tempfile = "3.19.1"
which = "4.3.0"
rstest = "0.18.0"
async-std = { version = "1.12.0", features = ["attributes"] }
# Local dependencies
kata-agent-policy = { path = "policy" }
rustjail = { path = "rustjail" }
vsock-exporter = { path = "vsock-exporter" }
mem-agent = { path = "../libs/mem-agent" }
kata-sys-util = { path = "../libs/kata-sys-util" }
kata-types = { path = "../libs/kata-types", features = ["safe-path"] }
# Note: this crate sets the slog 'max_*' features which allows the log level
# to be modified at runtime.
logging = { path = "../libs/logging" }
protocols = { path = "../libs/protocols" }
runtime-spec = { path = "../libs/runtime-spec" }
safe-path = { path = "../libs/safe-path" }
test-utils = { path = "../libs/test-utils" }
[package]
name = "kata-agent"
version = "0.1.0"
@@ -157,7 +57,8 @@ cgroups.workspace = true
# Tracing
tracing.workspace = true
tracing-subscriber.workspace = true
tracing-opentelemetry.workspace = true
# TODO: bump tracing-opentelemetry to sync with version in workspace
tracing-opentelemetry = "0.17.0"
opentelemetry.workspace = true
# Configuration
@@ -195,7 +96,6 @@ pv_core = { git = "https://github.com/ibm-s390-linux/s390-tools", rev = "4942504
tempfile.workspace = true
which.workspace = true
rstest.workspace = true
async-std.workspace = true
test-utils.workspace = true
@@ -207,7 +107,3 @@ seccomp = ["rustjail/seccomp"]
standard-oci-runtime = ["rustjail/standard-oci-runtime"]
agent-policy = ["kata-agent-policy"]
init-data = []
[[bin]]
name = "kata-agent"
path = "src/main.rs"

View File

@@ -63,7 +63,7 @@ ifneq ($(EXTRA_RUSTFEATURES),)
override EXTRA_RUSTFEATURES := --features "$(EXTRA_RUSTFEATURES)"
endif
TARGET_PATH = target/$(TRIPLE)/$(BUILD_TYPE)/$(TARGET)
TARGET_PATH = ../../target/$(TRIPLE)/$(BUILD_TYPE)/$(TARGET)
##VAR DESTDIR=<path> is a directory prepended to each installed target file
DESTDIR ?=
@@ -153,7 +153,7 @@ vendor:
#TARGET test: run cargo tests
test: $(GENERATED_FILES)
@RUST_LIB_BACKTRACE=0 RUST_BACKTRACE=1 cargo test --all --target $(TRIPLE) $(EXTRA_RUSTFEATURES) -- --nocapture
@RUST_LIB_BACKTRACE=0 RUST_BACKTRACE=1 cargo test -p kata-agent --target $(TRIPLE) $(EXTRA_RUSTFEATURES) -- --nocapture
##TARGET check: run test
check: $(GENERATED_FILES) standard_rust_check

View File

@@ -89,7 +89,7 @@ pub fn baremount(
let destination_str = destination.to_string_lossy();
if let Ok(m) = get_linux_mount_info(destination_str.deref()) {
if m.fs_type == fs_type && !flags.contains(MsFlags::MS_REMOUNT) {
slog_info!(logger, "{source:?} is already mounted at {destination:?}");
slog::info!(logger, "{source:?} is already mounted at {destination:?}");
return Ok(());
}
}

View File

@@ -110,8 +110,10 @@ impl Namespace {
unshare(cf)?;
if ns_type == NamespaceType::Uts && hostname.is_some() {
nix::unistd::sethostname(hostname.unwrap())?;
if ns_type == NamespaceType::Uts {
if let Some(host) = hostname {
nix::unistd::sethostname(host)?;
}
}
// Bind mount the new namespace from the current thread onto the mount point to persist it.

View File

@@ -2317,8 +2317,14 @@ async fn cdh_handler_trusted_storage(oci: &mut Spec) -> Result<()> {
for specdev in devices.iter() {
if specdev.path().as_path().to_str() == Some(TRUSTED_IMAGE_STORAGE_DEVICE) {
let dev_major_minor = format!("{}:{}", specdev.major(), specdev.minor());
cdh_secure_mount("BlockDevice", &dev_major_minor, "LUKS", KATA_IMAGE_WORK_DIR)
.await?;
cdh_secure_mount(
"block-device",
&dev_major_minor,
"luks2",
KATA_IMAGE_WORK_DIR,
"-E lazy_journal_init",
)
.await?;
break;
}
}
@@ -2331,6 +2337,7 @@ pub(crate) async fn cdh_secure_mount(
device_id: &str,
encrypt_type: &str,
mount_point: &str,
mkfs_opts: &str,
) -> Result<()> {
if !confidential_data_hub::is_cdh_client_initialized() {
return Ok(());
@@ -2340,19 +2347,31 @@ pub(crate) async fn cdh_secure_mount(
info!(
sl(),
"cdh_secure_mount: device_type {}, device_id {}, encrypt_type {}, integrity {}",
"cdh_secure_mount: device_type {}, device_id {}, encrypt_type {}, integrity {}, mkfs_opts {}",
device_type,
device_id,
encrypt_type,
integrity
integrity,
mkfs_opts
);
let options = std::collections::HashMap::from([
("deviceId".to_string(), device_id.to_string()),
("encryptType".to_string(), encrypt_type.to_string()),
("sourceType".to_string(), "empty".to_string()),
("targetType".to_string(), "fileSystem".to_string()),
("filesystemType".to_string(), "ext4".to_string()),
("mkfsOpts".to_string(), mkfs_opts.to_string()),
("encryptionType".to_string(), encrypt_type.to_string()),
("dataIntegrity".to_string(), integrity),
]);
std::fs::create_dir_all(mount_point).inspect_err(|e| {
error!(
sl(),
"Failed to create mount point directory {}: {:?}", mount_point, e
);
})?;
confidential_data_hub::secure_mount(device_type, &options, vec![], mount_point).await?;
Ok(())

View File

@@ -59,7 +59,14 @@ async fn handle_block_storage(
.contains(&"encryption_key=ephemeral".to_string());
if has_ephemeral_encryption {
crate::rpc::cdh_secure_mount("BlockDevice", dev_num, "LUKS", &storage.mount_point).await?;
crate::rpc::cdh_secure_mount(
"block-device",
dev_num,
"luks2",
&storage.mount_point,
"-O ^has_journal -m 0 -i 163840 -I 128",
)
.await?;
set_ownership(logger, storage)?;
new_device(storage.mount_point.clone())
} else {

View File

@@ -5,7 +5,8 @@
use anyhow::Result;
use opentelemetry::sdk::propagation::TraceContextPropagator;
use opentelemetry::{global, sdk::trace::Config, trace::TracerProvider};
use opentelemetry::trace::TracerProvider;
use opentelemetry::{global, sdk::trace::Config};
use slog::{info, o, Logger};
use std::collections::HashMap;
use tracing_opentelemetry::OpenTelemetryLayer;
@@ -23,15 +24,12 @@ pub fn setup_tracing(name: &'static str, logger: &Logger) -> Result<()> {
let config = Config::default();
let builder = opentelemetry::sdk::trace::TracerProvider::builder()
.with_batch_exporter(exporter, opentelemetry::runtime::TokioCurrentThread)
.with_batch_exporter(exporter, opentelemetry::runtime::Tokio)
.with_config(config);
let provider = builder.build();
// We don't need a versioned tracer.
let version = None;
let tracer = provider.get_tracer(name, version);
let tracer = provider.tracer(name);
let _global_provider = global::set_tracer_provider(provider);

View File

@@ -10,7 +10,7 @@ libc.workspace = true
thiserror.workspace = true
opentelemetry = { workspace = true, features = ["serialize"] }
tokio-vsock.workspace = true
bincode = "1.3.3"
serde_json = "1.0"
byteorder = "1.4.3"
slog = { workspace = true, features = [
"dynamic-keys",

View File

@@ -58,7 +58,7 @@ pub enum Error {
#[error("connection error: {0}")]
ConnectionError(String),
#[error("serialisation error: {0}")]
SerialisationError(#[from] bincode::Error),
SerialisationError(#[from] serde_json::Error),
#[error("I/O error: {0}")]
IOError(#[from] std::io::Error),
}
@@ -81,8 +81,7 @@ async fn write_span(
let mut writer = writer.lock().await;
let encoded_payload: Vec<u8> =
bincode::serialize(&span).map_err(|e| make_io_error(e.to_string()))?;
serde_json::to_vec(span).map_err(|e| make_io_error(e.to_string()))?;
let payload_len: u64 = encoded_payload.len() as u64;
let mut payload_len_as_bytes: [u8; HEADER_SIZE_BYTES as usize] =

View File

@@ -50,6 +50,7 @@ vm-memory = { workspace = true, features = ["backend-mmap"] }
crossbeam-channel = "0.5.6"
vfio-bindings = { workspace = true, optional = true }
vfio-ioctls = { workspace = true, optional = true }
kata-sys-util = { path = "../libs/kata-sys-util" }
[dev-dependencies]
slog-async = "2.7.0"

View File

@@ -10,10 +10,10 @@ This repository contains the following submodules:
| Name | Arch| Description |
| --- | --- | --- |
| [`bootparam`](src/x86_64/bootparam.rs) | x86_64 | Magic addresses externally used to lay out x86_64 VMs |
| [fdt](src/aarch64/fdt.rs) | aarch64| Create FDT for Aarch64 systems |
| [layout](src/x86_64/layout.rs) | x86_64 | x86_64 layout constants |
| [layout](src/aarch64/layout.rs/) | aarch64 | aarch64 layout constants |
| [mptable](src/x86_64/mptable.rs) | x86_64 | MP Table configurations used for defining VM boot status |
| [`fdt`](src/aarch64/fdt.rs) | aarch64| Create FDT for Aarch64 systems |
| [`layout`](src/x86_64/layout.rs) | x86_64 | x86_64 layout constants |
| [`layout`](src/aarch64/layout.rs/) | aarch64 | aarch64 layout constants |
| [`mptable`](src/x86_64/mptable.rs) | x86_64 | MP Table configurations used for defining VM boot status |
## Acknowledgement

View File

@@ -3,7 +3,7 @@
This crate is a collection of modules that provides helpers and utilities to create a TDX Dragonball VM.
Currently this crate involves:
- tdx-ioctls
- `tdx-ioctls`
## Acknowledgement

View File

@@ -44,6 +44,7 @@ vm-memory = { workspace = true, features = ["backend-mmap"] }
sendfd = "0.4.3"
vhost-rs = { version = "0.15.0", package = "vhost", optional = true }
timerfd = "1.0"
kata-sys-util = { workspace = true}
[dev-dependencies]
vm-memory = { workspace = true, features = ["backend-mmap", "backend-atomic"] }

View File

@@ -2,6 +2,7 @@
//
// SPDX-License-Identifier: Apache-2.0 AND BSD-3-Clause
use kata_sys_util::netns::NetnsGuard;
use std::any::Any;
use std::collections::HashMap;
use std::ffi::CString;
@@ -454,6 +455,11 @@ impl<AS: GuestAddressSpace> VirtioFs<AS> {
prefetch_list_path: Option<String>,
) -> FsResult<()> {
debug!("http_server rafs");
// We need to make sure the nydus worker thread in the runD main process's network namespace
// instead of the vmm thread's netns, which wouldn't access the host network.
let _netns_guard =
NetnsGuard::new("/proc/self/ns/net").map_err(|e| FsError::BackendFs(e.to_string()))?;
let file = Path::new(&source);
let (mut rafs, rafs_cfg) = match config.as_ref() {
Some(cfg) => {

View File

@@ -1,13 +1,114 @@
The `src/libs` directory hosts library crates which may be shared by multiple Kata Containers components
or published to [`crates.io`](https://crates.io/index.html).
# Kata Containers Library Crates
### Library Crates
Currently it provides following library crates:
The `src/libs` directory hosts library crates shared by multiple Kata Containers components. These libraries provide common utilities, data types, and protocol definitions to facilitate development and maintain consistency across the project.
## Library Crates
| Library | Description |
|-|-|
| [logging](logging/) | Facilities to setup logging subsystem based on slog. |
| [system utilities](kata-sys-util/) | Collection of facilities and helpers to access system services. |
| [types](kata-types/) | Collection of constants and data types shared by multiple Kata Containers components. |
| [safe-path](safe-path/) | Utilities to safely resolve filesystem paths. |
| [test utilities](test-utils/) | Utilities to share test code. |
|---------|-------------|
| [kata-types](kata-types/) | Constants, data types, and configuration structures shared by Kata Containers components |
| [kata-sys-util](kata-sys-util/) | System utilities: CPU, device, filesystem, hooks, K8s, mount, netns, NUMA, PCI, protection, spec validation |
| [protocols](protocols/) | ttrpc protocol definitions for agent, health, remote, CSI, OCI, confidential data hub |
| [runtime-spec](runtime-spec/) | OCI runtime spec data structures and constants |
| [shim-interface](shim-interface/) | Shim management interface with RESTful API over Unix domain socket |
| [logging](logging/) | Slog-based logging with JSON output and systemd journal support |
| [safe-path](safe-path/) | Safe path resolution to prevent symlink and TOCTOU attacks |
| [mem-agent](mem-agent/) | Memory management agent: memcg, compact, PSI monitoring |
| [test-utils](test-utils/) | Test macros for root/non-root privileges and KVM accessibility |
## Details
### kata-types
Core types and configurations including:
- Annotations for CRI-containerd, CRI-O, dockershim
- Hypervisor configurations (QEMU, Cloud Hypervisor, Firecracker, Dragonball)
- Agent and runtime configurations
- Kubernetes-specific utilities
### kata-sys-util
System-level utilities:
- `cpu`: CPU information and affinity
- `device`: Device management
- `fs`: Filesystem operations
- `hooks`: Hook execution
- `k8s`: Kubernetes utilities
- `mount`: Mount operations
- `netns`: Network namespace handling
- `numa`: NUMA topology
- `pcilibs`: PCI device access
- `protection`: Hardware protection features
- `spec`: OCI spec loading
- `validate`: Input validation
### protocols
Generated ttrpc protocol bindings:
- `agent`: Kata agent API
- `health`: Health check service
- `remote`: Remote hypervisor API
- `csi`: Container storage interface
- `oci`: OCI specifications
- `confidential_data_hub`: Confidential computing support
Features: `async` for async ttrpc, `with-serde` for serde support.
### runtime-spec
OCI runtime specification types:
- `ContainerState`: Creating, Created, Running, Stopped, Paused
- `State`: Container state with version, id, status, pid, bundle, annotations
- Namespace constants: pid, network, mount, ipc, user, uts, cgroup
### shim-interface
Shim management service interface:
- RESTful API over Unix domain socket (`/run/kata/<sid>/shim-monitor.sock`)
- `MgmtClient` for HTTP requests to shim management server
- Sandbox ID resolution with prefix matching
### logging
Slog-based logging framework:
- JSON output to file or stdout
- systemd journal support
- Runtime log level filtering per component/subsystem
- Async drain for thread safety
### safe-path
Secure filesystem path handling:
- `scoped_join()`: Safely join paths under a root directory
- `scoped_resolve()`: Resolve paths constrained by root
- `PinnedPathBuf`: TOCTOU-safe path reference
- `ScopedDirBuilder`: Safe directory creation
### mem-agent
Memory management for containers:
- `memcg`: Memory cgroup configuration and monitoring
- `compact`: Memory compaction control
- `psi`: Pressure stall information monitoring
- Async runtime with configurable policies
### test-utils
Testing utilities:
- `skip_if_root!`: Skip test if running as root
- `skip_if_not_root!`: Skip test if not running as root
- `skip_if_kvm_unaccessable!`: Skip test if KVM is unavailable
- `assert_result!`: Assert expected vs actual results
## License
All crates are licensed under Apache-2.0.

View File

@@ -1,16 +1,100 @@
# `kata-sys-util`
This crate is a collection of utilities and helpers for
[Kata Containers](https://github.com/kata-containers/kata-containers/) components to access system services.
System utilities and helpers for [Kata Containers](https://github.com/kata-containers/kata-containers/) components to access Linux system services.
It provides safe wrappers over system services, such as:
- file systems
- mount
- NUMA
## Overview
## Support
This crate provides safe wrappers and utility functions for interacting with various Linux system services and kernel interfaces. It is designed specifically for the Kata Containers ecosystem.
## Features
### File System Operations (`fs`)
- Path canonicalization and basename extraction
- Filesystem type detection (FUSE, OverlayFS)
- Symlink detection
- Reflink copy with fallback to regular copy
### Mount Operations (`mount`)
- Bind mount and remount operations
- Mount propagation type management (SHARED, PRIVATE, SLAVE, UNBINDABLE)
- Overlay filesystem mount option compression
- Safe mount destination creation
- Umount with timeout support
- `/proc/mounts` parsing utilities
### CPU Utilities (`cpu`)
- CPU information parsing from `/proc/cpuinfo`
- CPU flags detection and validation
- Architecture-specific support (x86_64, s390x)
### NUMA Support (`numa`)
- CPU to NUMA node mapping
- NUMA node information retrieval from sysfs
- NUMA CPU validation
### Device Management (`device`)
- Block device major/minor number detection
- Device ID resolution for cgroup operations
### Kubernetes Support (`k8s`)
- Ephemeral volume detection
- EmptyDir volume handling
- Kubernetes-specific mount type identification
### Network Namespace (`netns`)
- Network namespace switching with RAII guard pattern
- Network namespace name generation
### OCI Specification Utilities (`spec`)
- Container type detection (PodSandbox, PodContainer)
- Sandbox ID extraction from OCI annotations
- OCI spec loading utilities
### Validation (`validate`)
- Container/exec ID validation
- Environment variable validation
### Hooks (`hooks`)
- OCI hook execution and management
- Hook state tracking
- Timeout handling for hook execution
### Guest Protection (`protection`)
- Confidential computing detection (TDX, SEV, SNP, PEF, SE, ARM CCA , etc.)
- Architecture-specific protection checking (x86_64, s390x, aarch64, powerpc64)
### Random Generation (`rand`)
- Secure random byte generation
- UUID generation
### PCI Device Management (`pcilibs`)
- PCI device enumeration and management
- PCI configuration space access
- Memory resource allocation for PCI devices
## Supported Architectures
- x86_64
- aarch64
- s390x
- powerpc64 (little-endian)
- riscv64
## Supported Operating Systems
**Operating Systems**:
- Linux
## License

View File

@@ -177,7 +177,7 @@ pub fn get_linux_mount_info(mount_point: &str) -> Result<LinuxMountInfo> {
///
/// To ensure security, the `create_mount_destination()` function takes an extra parameter `root`,
/// which is used to ensure that `dst` is within the specified directory. And a safe version of
/// `PathBuf` is returned to avoid TOCTTOU type of flaws.
/// `PathBuf` is returned to avoid TOCTOU type of flaws.
pub fn create_mount_destination<S: AsRef<Path>, D: AsRef<Path>, R: AsRef<Path>>(
src: S,
dst: D,

View File

@@ -1,18 +1,53 @@
# kata-types
This crate is a collection of constants and data types shared by multiple
[Kata Containers](https://github.com/kata-containers/kata-containers/) components.
Constants and data types shared by Kata Containers components.
It defines constants and data types used by multiple Kata Containers components. Those constants
and data types may be defined by Kata Containers or by other projects/specifications, such as:
## Overview
This crate provides common constants, data types, and configuration structures used across multiple [Kata Containers](https://github.com/kata-containers/kata-containers/) components. It includes definitions from:
- Kata Containers project
- [Containerd](https://github.com/containerd/containerd)
- [Kubelet](https://github.com/kubernetes/kubelet)
- [Kubelet](https://github.com/kubernetes/kubernetes)
## Support
## Modules
**Operating Systems**:
- Linux
| Module | Description |
|--------|-------------|
| `annotations` | Annotation keys for CRI-containerd, CRI-O, dockershim, and third-party integrations |
| `capabilities` | Hypervisor capability flags (block device, multi-queue, filesystem sharing, etc.) |
| `config` | Configuration structures for agent, hypervisor (QEMU, Cloud Hypervisor, Firecracker, Dragonball), and runtime |
| `container` | Container-related constants and types |
| `cpu` | CPU resource management types |
| `device` | Device-related definitions |
| `fs` | Filesystem constants |
| `handler` | Handler-related types |
| `initdata` | Initdata specification for TEE data injection |
| `k8s` | Kubernetes-specific paths and utilities (empty-dir, configmap, secret, projected volumes) |
| `machine_type` | Machine type definitions |
| `mount` | Mount point structures and validation |
| `rootless` | Rootless VMM support utilities |
## Configuration
The `config` module supports:
- TOML-based configuration loading
- Drop-in configuration files
- Hypervisor-specific configurations (QEMU, Cloud Hypervisor, Firecracker, Dragonball, Remote)
- Agent configuration
- Runtime configuration
- Shared mount definitions
## Features
- `enable-vendor`: Enable vendor-specific extensions
- `safe-path`: Enable safe path resolution (platform-specific)
## Platform Support
- **Linux**: Fully supported
## License
This code is licensed under [Apache-2.0](../../../LICENSE).
Apache-2.0 - See [LICENSE](../../../LICENSE)

View File

@@ -24,9 +24,7 @@ message SecureMountRequest {
string mount_point = 4;
}
message SecureMountResponse {
string mount_path = 1;
}
message SecureMountResponse {}
message ImagePullRequest {
// - `image_url`: The reference of the image to pull

View File

@@ -4,7 +4,7 @@ Safe Path
A library to safely handle filesystem paths, typically for container runtimes.
There are often path related attacks, such as symlink based attacks, TOCTTOU attacks. The `safe-path` crate
There are often path related attacks, such as symlink based attacks, time-of-check to time-of-use (TOCTOU) attacks. The `safe-path` crate
provides several functions and utility structures to protect against path resolution related attacks.
## Support

View File

@@ -26,7 +26,7 @@
//! information and ensure data out of the container rootfs directory won't be affected
//! by the container. There are several types of attacks related to container mount namespace:
//! - symlink based attack
//! - Time of check to time of use (TOCTTOU)
//! - Time of check to time of use (TOCTOU)
//!
//! This crate provides several mechanisms for container runtimes to safely handle filesystem paths
//! when preparing mount namespace for containers.
@@ -35,13 +35,13 @@
//! - [scoped_resolve()](crate::scoped_resolve()): resolve `unsafe_path` to a relative path,
//! rooted at and constrained by `root`.
//! - [struct PinnedPathBuf](crate::PinnedPathBuf): safe version of `PathBuf` to protect from
//! TOCTTOU style of attacks, which ensures:
//! TOCTOU style of attacks, which ensures:
//! - the value of [`PinnedPathBuf::as_path()`] never changes.
//! - the path returned by [`PinnedPathBuf::as_path()`] is always a symlink.
//! - the filesystem object referenced by the symlink [`PinnedPathBuf::as_path()`] never changes.
//! - the value of [`PinnedPathBuf::target()`] never changes.
//! - [struct ScopedDirBuilder](crate::ScopedDirBuilder): safe version of `DirBuilder` to protect
//! from symlink race and TOCTTOU style of attacks, which enhances security by:
//! from symlink race and TOCTOU style of attacks, which enhances security by:
//! - ensuring the new directories are created under a specified `root` directory.
//! - avoiding symlink race attacks during making directories.
//! - returning a [PinnedPathBuf] for the last level of directory, so it could be used for other

View File

@@ -15,7 +15,7 @@ use std::path::{Component, Path, PathBuf};
use crate::scoped_join;
/// A safe version of [`PathBuf`] pinned to an underlying filesystem object to protect from
/// `TOCTTOU` style of attacks.
/// `TOCTOU` style of attacks.
///
/// A [`PinnedPathBuf`] is a resolved path buffer pinned to an underlying filesystem object, which
/// guarantees:

View File

@@ -117,7 +117,7 @@ pub fn scoped_resolve<R: AsRef<Path>, U: AsRef<Path>>(root: R, unsafe_path: U) -
/// Note that the guarantees provided by this function only apply if the path components in the
/// returned string are not modified (in other words are not replaced with symlinks on the
/// filesystem) after this function has returned. You may use [crate::PinnedPathBuf] to protect
/// from such TOCTTOU attacks.
/// from such TOCTOU attacks.
pub fn scoped_join<R: AsRef<Path>, U: AsRef<Path>>(root: R, unsafe_path: U) -> Result<PathBuf> {
do_scoped_resolve(root, unsafe_path).map(|(root, path)| root.join(path))
}

View File

@@ -99,11 +99,11 @@ HYPERVISOR_REMOTE = remote
ARCH_SUPPORT_DB := x86_64 aarch64
ifneq ($(filter $(ARCH),$(ARCH_SUPPORT_DB)),)
# When set to true, builds the built-in Dragonball hypervisor
USE_BUILDIN_DB := true
USE_BUILTIN_DB := true
else
USE_BUILDIN_DB := false
USE_BUILTIN_DB := false
$(info Dragonball does not support ARCH $(ARCH), disabled. \
Specify "USE_BUILDIN_DB=true" to force enable.)
Specify "USE_BUILTIN_DB=true" to force enable.)
endif
HYPERVISOR ?= $(HYPERVISOR_DB)
@@ -483,7 +483,7 @@ USER_VARS += CONFIG_REMOTE_IN
USER_VARS += CONFIG_QEMU_COCO_DEV_IN
USER_VARS += DESTDIR
USER_VARS += HYPERVISOR
USER_VARS += USE_BUILDIN_DB
USER_VARS += USE_BUILTIN_DB
USER_VARS += DBCMD
USER_VARS += DBCTLCMD
USER_VARS += FCCTLCMD
@@ -651,7 +651,7 @@ COMMIT_MSG = $(if $(COMMIT),$(COMMIT),unknown)
EXTRA_RUSTFEATURES :=
# if use dragonball hypervisor, add the feature to build dragonball in runtime
ifeq ($(USE_BUILDIN_DB),true)
ifeq ($(USE_BUILTIN_DB),true)
EXTRA_RUSTFEATURES += dragonball
endif

View File

@@ -1,131 +1,179 @@
# runtime-rs
## Wath's runtime-rs
## What is runtime-rs
`runtime-rs` is a new component introduced in Kata Containers 3.0, it is a Rust version of runtime(shim). It like [runtime](../runtime), but they have many difference:
`runtime-rs` is a core component of Kata Containers 4.0. It is a high-performance, Rust-based implementation of the containerd shim v2 runtime.
- `runtime-rs` is written in Rust, and `runtime` is written in Go.
- `runtime` is the default shim in Kata Containers 3.0, `runtime-rs` is still under heavy development.
- `runtime-rs` has a completed different architecture than `runtime`, you can check at the [architecture overview](../../docs/design/architecture_3.0).
Key characteristics:
**Note**:
- **Implementation Language**: Rust, leveraging memory safety and zero-cost abstractions
- **Project Maturity**: Production-ready component of Kata Containers 4.0
- **Architectural Design**: Modular framework optimized for Kata Containers 4.0
`runtime-rs` is still under heavy development, you should avoid using it in critical system.
For architecture details, see [Architecture Overview](../../docs/design/architecture_4.0).
## Architecture overview
## Architecture Overview
Also, `runtime-rs` provides the following features:
Key features:
- Turn key solution with builtin `Dragonball` Sandbox, all components in one process
- Async I/O to reduce resource consumption
- Extensible framework for multiple services, runtimes and hypervisors
- Lifecycle management for sandbox and container associated resources
See the [architecture overview](../../docs/design/architecture_3.0)
for details on the `runtime-rs` design.
`runtime-rs` is a runtime written in Rust, it is composed of several crates.
This picture shows the overview about the crates under this directory and the relation between crates.
- **Built-in VMM (Dragonball)**: Deeply integrated into shim lifecycle, eliminating IPC overhead for peak performance
- **Asynchronous I/O**: Tokio-based async runtime for high-concurrency with reduced thread footprint
- **Extensible Framework**: Pluggable hypervisors, network interfaces, and storage backends
- **Resource Lifecycle Management**: Comprehensive sandbox and container resource management
![crates overview](docs/images/crate-overview.svg)
Not all the features have been implemented yet, for details please check the [roadmap](../../docs/design/architecture_3.0/README.md#roadmap).
## Crates
The `runtime-rs` directory contains some crates in the crates directory that compose the `containerd-shim-kata-v2`.
| Crate | Description |
|-|-|
| [`shim`](crates/shim)| containerd shimv2 implementation |
| [`service`](crates/service)| services for containers, includes task service |
| [`runtimes`](crates/runtimes)| container runtimes |
| [`resource`](crates/resource)| sandbox and container resources |
| [`hypervisor`](crates/hypervisor)| hypervisor that act as a sandbox |
| [`agent`](crates/agent)| library used to communicate with agent in the guest OS |
| [`persist`](crates/persist)| persist container state to disk |
|-------|-------------|
| [`shim`](crates/shim) | Containerd shim v2 entry point (start, delete, run commands) |
| [`service`](crates/service) | Services including TaskService for containerd shim protocol |
| [`runtimes`](crates/runtimes) | Runtime handlers: VirtContainer (default), LinuxContainer(experimental), WasmContainer(experimental) |
| [`resource`](crates/resource) | Resource management: network, share_fs, rootfs, volume, cgroups, cpu_mem |
| [`hypervisor`](crates/hypervisor) | Hypervisor implementations |
| [`agent`](crates/agent) | Guest agent communication (KataAgent) |
| [`persist`](crates/persist) | State persistence to disk (JSON format) |
| [`shim-ctl`](crates/shim-ctl) | Development tool for testing shim without containerd |
### shim
`shim` is the entry point of the containerd shim process, it implements containerd shim's [binary protocol](https://github.com/containerd/containerd/tree/v1.6.8/runtime/v2#commands):
Entry point implementing [containerd shim v2 binary protocol](https://github.com/containerd/containerd/tree/main/runtime/v2#commands):
- start: start a new shim process
- delete: delete exist a shim process
- run: run ttRPC service in shim
containerd will launch a shim process and the shim process will serve as a ttRPC server to provide shim service through `TaskService` from `service` crate.
- `start`: Start new shim process
- `delete`: Delete existing shim process
- `run`: Run ttRPC service
### service
The `runtime-rs` has an extensible framework, includes extension of services, runtimes, and hypervisors.
Currently, only containerd compatible `TaskService` is implemented.
`TaskService` has implemented the [containerd shim protocol](https://docs.rs/containerd-shim-protos/0.2.0/containerd_shim_protos/),
and interacts with runtimes through messages.
Extensible service framework. Currently implements `TaskService` conforming to [containerd shim protocol](https://docs.rs/containerd-shim-protos/).
### runtimes
Runtime is a container runtime, the runtime handler handles messages from task services to manage containers.
Runtime handler and Runtime instance is used to deal with the operation for sandbox and container.
Runtime handlers manage sandbox and container operations:
Currently, only `VirtContainer` has been implemented.
| Handler | Feature Flag | Description |
|---------|--------------|-------------|
| `VirtContainer` | `virt` (default) | Virtual machine-based containers |
| `LinuxContainer` | `linux` | Linux container runtime (experimental) |
| `WasmContainer` | `wasm` | WebAssembly runtime (experimental) |
### resource
In `runtime-rs`, all networks/volumes/rootfs are abstracted as resources.
All resources abstracted uniformly:
Resources are classified into two types:
- **Sandbox resources**: network, share-fs
- **Container resources**: rootfs, volume, cgroup
- sandbox resources: network, share-fs
- container resources: rootfs, volume, cgroup
[Here](../../docs/design/architecture_3.0/README.md#resource-manager) is a detailed description of the resources.
Sub-modules: `cpu_mem`, `cdi_devices`, `coco_data`, `network`, `share_fs`, `rootfs`, `volume`
### hypervisor
For `VirtContainer`, there will be more hypervisors to choose.
Supported hypervisors:
Currently, built-in `Dragonball` has been implemented. We have also added initial support for `cloud-hypervisor` with CI being added next.
| Hypervisor | Mode | Description |
|------------|------|-------------|
| Dragonball | Built-in | Integrated VMM for peak performance (default) |
| QEMU | External | Full-featured emulator |
| Cloud Hypervisor | External | Modern VMM (x86_64, aarch64) |
| Firecracker | External | Lightweight microVM |
| Remote | External | Remote hypervisor |
The built-in VMM mode (Dragonball) is recommended for production, offering superior performance by eliminating IPC overhead.
### agent
`agent` is used to communicate with agent in the guest OS from the shim side. The only supported agent is `KataAgent`.
Communication with guest OS agent via ttRPC. Supports `KataAgent` for full container lifecycle management.
### persist
Persist defines traits and functions to help different components save state to disk and load state from disk.
State serialization to disk for sandbox recovery after restart. Stores `state.json` under `/run/kata/<sandbox-id>/`.
### helper libraries
## Build from Source and Install
Some helper libraries are maintained in [the library directory](../libs) so that they can be shared with other rust components.
### Prerequisites
## Build and install
Download `Rustup` and install Rust. For Rust version, see `languages.rust.meta.newest-version` in [`versions.yaml`](../../versions.yaml).
See the
[build from the source section of the rust runtime installation guide](../../docs/install/kata-containers-3.0-rust-runtime-installation-guide.md#build-from-source-installation).
Example for `x86_64`:
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
rustup install ${RUST_VERSION}
rustup default ${RUST_VERSION}-x86_64-unknown-linux-gnu
```
### Musl Support (Optional)
For fully static binary:
```bash
# Add musl target
rustup target add x86_64-unknown-linux-musl
# Install musl libc (example: musl 1.2.3)
curl -O https://git.musl-libc.org/cgit/musl/snapshot/musl-1.2.3.tar.gz
tar vxf musl-1.2.3.tar.gz
cd musl-1.2.3/
./configure --prefix=/usr/local/
make && sudo make install
```
### Install Kata 4.0 Rust Runtime Shim
```bash
git clone https://github.com/kata-containers/kata-containers.git
cd kata-containers/src/runtime-rs
make && sudo make install
```
After installation:
- Config file: `/usr/share/defaults/kata-containers/configuration.toml`
- Binary: `/usr/local/bin/containerd-shim-kata-v2`
### Install Without Built-in Dragonball VMM
To build without the built-in Dragonball hypervisor:
```bash
make USE_BUILTIN_DB=false
```
Specify hypervisor during installation:
```bash
sudo make install HYPERVISOR=qemu
# or
sudo make install HYPERVISOR=cloud-hypervisor
```
## Configuration
`runtime-rs` has the same [configuration as `runtime`](../runtime/README.md#configuration) with some [limitations](#limitations).
Configuration files in `config/`:
| Config File | Hypervisor | Notes |
|-------------|------------|-------|
| `configuration-dragonball.toml.in` | Dragonball | Built-in VMM |
| `configuration-qemu-runtime-rs.toml.in` | QEMU | Default external |
| `configuration-cloud-hypervisor.toml.in` | Cloud Hypervisor | Modern VMM |
| `configuration-rs-fc.toml.in` | Firecracker | Lightweight microVM |
| `configuration-remote.toml.in` | Remote | Remote hypervisor |
| `configuration-qemu-tdx-runtime-rs.toml.in` | QEMU + TDX | Intel TDX confidential computing |
| `configuration-qemu-snp-runtime-rs.toml.in` | QEMU + SEV-SNP | AMD SEV-SNP confidential computing |
| `configuration-qemu-se-runtime-rs.toml.in` | QEMU + SEV | AMD SEV confidential computing |
| `configuration-qemu-coco-dev-runtime-rs.toml.in` | QEMU + CoCo | CoCo development |
See [runtime configuration](../runtime/README.md#configuration) for configuration options.
## Logging
See the
[debugging section of the developer guide](../../docs/Developer-Guide.md#troubleshoot-kata-containers).
See [Developer Guide - Troubleshooting](../../docs/Developer-Guide.md#troubleshoot-kata-containers).
## Debugging
See the
[debugging section of the developer guide](../../docs/Developer-Guide.md#troubleshoot-kata-containers).
An [experimental alternative binary](crates/shim-ctl/README.md) is available that removes containerd dependencies and makes it easier to run the shim proper outside of the runtime's usual deployment environment (i.e. on a developer machine).
For development, use [`shim-ctl`](crates/shim-ctl/README.md) to test shim without containerd dependencies.
## Limitations
For Kata Containers limitations, see the
[limitations file](../../docs/Limitations.md)
for further details.
`runtime-rs` is under heavy developments, and doesn't support all features as the Golang version [`runtime`](../runtime), check the [roadmap](../../docs/design/architecture_3.0/README.md#roadmap) for details.
See [Limitations](../../docs/Limitations.md) for details.

View File

@@ -1,23 +1,21 @@
# Multi-vmm support for runtime-rs
# Multi-VMM Support for runtime-rs
## 0. Status
External hypervisor support is currently being developed.
Multiple external hypervisors are supported in the Rust runtime, including QEMU, Firecracker, and Cloud Hypervisor. This document outlines the key implementation details for multi-VMM support in the Rust runtime.
See [the main tracking issue](https://github.com/kata-containers/kata-containers/issues/4634)
for further details.
## 1. Hypervisor Configuration
Some key points for supporting multi-vmm in rust runtime.
## 1. Hypervisor Config
The diagram below gives an overview for the hypervisor config
The diagram below provides an overview of the hypervisor configuration:
![hypervisor config](../../docs/images/hypervisor-config.svg)
VMM's config info will be loaded when initialize the runtime instance, there are some important functions need to be focused on.
VMM configuration information is loaded during runtime instance initialization. The following key functions are critical to this process:
### `VirtContainer::init()`
This function initialize the runtime handler. It will register the plugins into the HYPERVISOR_PLUGINS. Different plugins are needed for different hypervisors.
This function initializes the runtime handler and registers plugins into the `HYPERVISOR_PLUGINS` registry. Different hypervisors require different plugins:
```rust
#[async_trait]
impl RuntimeHandler for VirtContainer {
@@ -30,21 +28,24 @@ impl RuntimeHandler for VirtContainer {
}
```
[This is the plugin method for QEMU. Other VMM plugin methods haven't support currently.](../../../libs/kata-types/src/config/hypervisor/qemu.rs)
QEMU plugin defines the methods to adjust and validate the hypervisor config file, those methods could be modified if it is needed.
Currently, the QEMU plugin is fully implemented, we can take it as an example. The QEMU plugin defines methods to adjust and validate the hypervisor configuration file. These methods can be customized as needed.
Details of the QEMU plugin implementation can be found in [QEMU Plugin Implementation](../../../libs/kata-types/src/config/hypervisor/qemu.rs)
When loading the TOML configuration, the registered plugins are invoked to adjust and validate the configuration file:
After that, when loading the TOML config, the plugins will be called to adjust and validate the config file.
```rust
async fn try_init(&mut self, spec: &oci::Spec) -> Result<()> {
async fn try_init(&mut self, spec: &oci::Spec) -> Result<()> {
...
let config = load_config(spec).context("load config")?;
...
}
```
### new_instance
### `new_instance`
This function creates a runtime instance that manages container and sandbox operations. During this process, a hypervisor instance is created. For QEMU, the hypervisor instance is instantiated and configured with the appropriate configuration file:
This function will create a runtime_instance which include the operations for container and sandbox. At the same time, a hypervisor instance will be created. QEMU instance will be created here as well, and set the hypervisor config file
```rust
async fn new_hypervisor(toml_config: &TomlConfig) -> Result<Arc<dyn Hypervisor>> {
let hypervisor_name = &toml_config.runtime.hypervisor_name;
@@ -70,7 +71,8 @@ async fn new_hypervisor(toml_config: &TomlConfig) -> Result<Arc<dyn Hypervisor>>
## 2. Hypervisor Trait
[To support multi-vmm, the hypervisor trait need to be implemented.](./src/lib.rs)
[The hypervisor trait must be implemented to support multi-VMM architectures.](./src/lib.rs)
```rust
pub trait Hypervisor: Send + Sync {
// vm manager
@@ -97,6 +99,7 @@ pub trait Hypervisor: Send + Sync {
async fn save_state(&self) -> Result<HypervisorState>;
}
```
In current design, VM will be started in the following steps.
In the current design, the VM startup process follows these steps:
![vmm start](../../docs/images/vm-start.svg)

View File

@@ -15,13 +15,13 @@ serde = { workspace = true, features = ["rc", "derive"] }
serde_json = { workspace = true }
tokio = { workspace = true, features = ["sync", "rt"] }
nix = "0.26.2"
thiserror = { workspace = true }
thiserror = "2.0.18"
# Cloud Hypervisor public HTTP API functions
# Note that the version specified is not necessarily the version of CH
# being used. This version is used to pin the CH config structure
# which is relatively static.
api_client = { git = "https://github.com/cloud-hypervisor/cloud-hypervisor", crate = "api_client", tag = "v27.0" }
api_client = { git = "https://github.com/cloud-hypervisor/cloud-hypervisor", tag = "v51.0" }
# Local dependencies
kata-types = { workspace = true }

View File

@@ -135,7 +135,7 @@ pub async fn cloud_hypervisor_vm_netdev_add_with_fds(
"PUT",
"vm.add-net",
Some(&serialised),
request_fds,
&request_fds,
)
.map_err(|e| anyhow!(e))?;

View File

@@ -65,8 +65,6 @@ INITRDCONFIDENTIALNAME = $(PROJECT_TAG)-initrd-confidential.img
IMAGENAME_NV = $(PROJECT_TAG)-nvidia-gpu.img
IMAGENAME_CONFIDENTIAL_NV = $(PROJECT_TAG)-nvidia-gpu-confidential.img
INITRDNAME_NV = $(PROJECT_TAG)-initrd-nvidia-gpu.img
INITRDNAME_CONFIDENTIAL_NV = $(PROJECT_TAG)-initrd-nvidia-gpu-confidential.img
TARGET = $(BIN_PREFIX)-runtime
RUNTIME_OUTPUT = $(CURDIR)/$(TARGET)
@@ -136,8 +134,6 @@ INITRDCONFIDENTIALPATH := $(PKGDATADIR)/$(INITRDCONFIDENTIALNAME)
IMAGEPATH_NV := $(PKGDATADIR)/$(IMAGENAME_NV)
IMAGEPATH_CONFIDENTIAL_NV := $(PKGDATADIR)/$(IMAGENAME_CONFIDENTIAL_NV)
INITRDPATH_NV := $(PKGDATADIR)/$(INITRDNAME_NV)
INITRDPATH_CONFIDENTIAL_NV := $(PKGDATADIR)/$(INITRDNAME_CONFIDENTIAL_NV)
ROOTFSTYPE_EXT4 := \"ext4\"
ROOTFSTYPE_XFS := \"xfs\"
@@ -483,16 +479,12 @@ ifneq (,$(QEMUCMD))
KERNELPATH_CONFIDENTIAL_NV = $(KERNELDIR)/$(KERNELNAME_CONFIDENTIAL_NV)
DEFAULTVCPUS_NV = 1
DEFAULTMEMORY_NV = 2048
DEFAULTMEMORY_NV = 8192
DEFAULTTIMEOUT_NV = 1200
DEFAULTVFIOPORT_NV = root-port
DEFAULTPCIEROOTPORT_NV = 8
# Disable the devtmpfs mount in guest. NVRC does this, and later kata-agent
# attempts this as well in a non-failing manner. Otherwise, NVRC fails when
# using an image and /dev is already mounted.
KERNELPARAMS_NV = "cgroup_no_v1=all"
KERNELPARAMS_NV += "devtmpfs.mount=0"
KERNELPARAMS_NV += "pci=realloc"
KERNELPARAMS_NV += "pci=nocrs"
KERNELPARAMS_NV += "pci=assign-busses"
@@ -660,10 +652,6 @@ USER_VARS += IMAGENAME_NV
USER_VARS += IMAGENAME_CONFIDENTIAL_NV
USER_VARS += IMAGEPATH_NV
USER_VARS += IMAGEPATH_CONFIDENTIAL_NV
USER_VARS += INITRDNAME_NV
USER_VARS += INITRDNAME_CONFIDENTIAL_NV
USER_VARS += INITRDPATH_NV
USER_VARS += INITRDPATH_CONFIDENTIAL_NV
USER_VARS += KERNELNAME_NV
USER_VARS += KERNELPATH_NV
USER_VARS += KERNELNAME_CONFIDENTIAL_NV

View File

@@ -229,6 +229,12 @@ disable_image_nvdimm = @DEFDISABLEIMAGENVDIMM_CLH@
# The default setting is "no-port"
hot_plug_vfio = "no-port"
# In a confidential compute environment hot-plugging can compromise
# security.
# Enable cold-plugging of VFIO devices to a root-port.
# The default setting is "no-port", which means disabled.
cold_plug_vfio = "no-port"
# Path to OCI hook binaries in the *guest rootfs*.
# This does not affect host-side hooks which must instead be added to
# the OCI spec passed to the runtime.

View File

@@ -235,6 +235,16 @@ block_device_cache_direct = false
# Default false
block_device_cache_noflush = false
# Specifies the logical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_logical_sector_size = 0
# Specifies the physical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_physical_sector_size = 0
# Enable iothreads (data-plane) to be used. This causes IO to be
# handled in a separate IO thread. This is currently only implemented
# for SCSI.

View File

@@ -247,6 +247,16 @@ block_device_cache_direct = false
# Default false
block_device_cache_noflush = false
# Specifies the logical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_logical_sector_size = 0
# Specifies the physical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_physical_sector_size = 0
# Enable iothreads (data-plane) to be used. This causes IO to be
# handled in a separate IO thread. This is currently implemented
# for virtio-scsi and virtio-blk.

View File

@@ -287,6 +287,16 @@ block_device_cache_direct = false
# Default false
block_device_cache_noflush = false
# Specifies the logical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_logical_sector_size = 0
# Specifies the physical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_physical_sector_size = 0
# Enable iothreads (data-plane) to be used. This causes IO to be
# handled in a separate IO thread. This is currently implemented
# for virtio-scsi and virtio-blk.
@@ -599,7 +609,7 @@ debug_console_enabled = false
# Agent connection dialing timeout value in seconds
# (default: 90)
dial_timeout = 90
dial_timeout = @DEFAULTTIMEOUT_NV@
[runtime]
# If enabled, the runtime will log additional debug messages to the
@@ -727,7 +737,7 @@ disable_guest_empty_dir = @DEFDISABLEGUESTEMPTYDIR@
# - block-encrypted
# Plugs a block device to be encrypted in the guest.
#
emptydir_mode = "@DEFEMPTYDIRMODE@"
emptydir_mode = "@DEFEMPTYDIRMODE_COCO@"
# Enabled experimental feature list, format: ["a", "b"].
# Experimental features are features not stable enough for production,

View File

@@ -264,6 +264,16 @@ block_device_cache_direct = false
# Default false
block_device_cache_noflush = false
# Specifies the logical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_logical_sector_size = 0
# Specifies the physical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_physical_sector_size = 0
# Enable iothreads (data-plane) to be used. This causes IO to be
# handled in a separate IO thread. This is currently implemented
# for virtio-scsi and virtio-blk.
@@ -576,7 +586,7 @@ debug_console_enabled = false
# Agent connection dialing timeout value in seconds
# (default: 90)
dial_timeout = 90
dial_timeout = @DEFAULTTIMEOUT_NV@
[runtime]
# If enabled, the runtime will log additional debug messages to the
@@ -704,7 +714,7 @@ disable_guest_empty_dir = @DEFDISABLEGUESTEMPTYDIR@
# - block-encrypted
# Plugs a block device to be encrypted in the guest.
#
emptydir_mode = "@DEFEMPTYDIRMODE@"
emptydir_mode = "@DEFEMPTYDIRMODE_COCO@"
# Enabled experimental feature list, format: ["a", "b"].
# Experimental features are features not stable enough for production,

View File

@@ -246,6 +246,16 @@ block_device_cache_direct = false
# Default false
block_device_cache_noflush = false
# Specifies the logical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_logical_sector_size = 0
# Specifies the physical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_physical_sector_size = 0
# Enable iothreads (data-plane) to be used. This causes IO to be
# handled in a separate IO thread. This is currently implemented
# for virtio-scsi and virtio-blk.
@@ -578,7 +588,7 @@ debug_console_enabled = false
# Agent connection dialing timeout value in seconds
# (default: 90)
dial_timeout = 90
dial_timeout = @DEFAULTTIMEOUT_NV@
[runtime]
# If enabled, the runtime will log additional debug messages to the

View File

@@ -249,6 +249,16 @@ block_device_cache_direct = false
# Default false
block_device_cache_noflush = false
# Specifies the logical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_logical_sector_size = 0
# Specifies the physical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_physical_sector_size = 0
# Enable iothreads (data-plane) to be used. This causes IO to be
# handled in a separate IO thread. This is currently implemented
# for virtio-scsi and virtio-blk.
@@ -689,7 +699,7 @@ disable_guest_empty_dir = @DEFDISABLEGUESTEMPTYDIR@
# - block-encrypted
# Plugs a block device to be encrypted in the guest.
#
emptydir_mode = "@DEFEMPTYDIRMODE@"
emptydir_mode = "@DEFEMPTYDIRMODE_COCO@"
# Enabled experimental feature list, format: ["a", "b"].
# Experimental features are features not stable enough for production,

View File

@@ -281,6 +281,16 @@ block_device_cache_direct = false
# Default false
block_device_cache_noflush = false
# Specifies the logical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_logical_sector_size = 0
# Specifies the physical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_physical_sector_size = 0
# Enable iothreads (data-plane) to be used. This causes IO to be
# handled in a separate IO thread. This is currently implemented
# for virtio-scsi and virtio-blk.

View File

@@ -263,6 +263,16 @@ block_device_cache_direct = false
# Default false
block_device_cache_noflush = false
# Specifies the logical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_logical_sector_size = 0
# Specifies the physical sector size, in bytes, reported by block devices to the guest.
# Common values are 512 and 4096. Set to 0 to use the QEMU/hypervisor default.
# Default 0
block_device_physical_sector_size = 0
# Enable iothreads (data-plane) to be used. This causes IO to be
# handled in a separate IO thread. This is currently implemented
# for virtio-scsi and virtio-blk.

Some files were not shown because too many files have changed in this diff Show More