Add a how-to describing how runtime-rs sizes static sandboxes from
overhead plus requested CPU/memory, including that fractional vCPU
results are rounded up for VMM-visible vCPU counts, and link it from the
how-to README.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Update the composable-vm-images proposal with the design decisions we only
arrived at after experimenting with the implementation:
* Replace the hardcoded agent path-resolution table with the data-driven
components.toml manifest (process levels, args/optional_args, env,
wait_socket, ${...} substitution, and select/variants), keeping the agent
generic.
* Document the attester-variant contract: NVRC exports KATA_ATTESTER_VARIANT
and the manifest selects the stock vs NVIDIA attestation-agent.
* Document the runtime dependency requirements found during bring-up: the
nvidia attester's LD_LIBRARY_PATH (libnvat closure in the coco addon +
NVML in the gpu addon) and the NVML-init failure mode, plus CDH
secure_mount tooling placement -- plain storage (mke2fs/mkfs.ext4/dd) in
the base vs encrypted storage (cryptsetup) in the coco addon, the CDH
PATH, and the base/addon ABI lockstep.
* Reflect the storage tooling and bundled libraries in the base/coco-addon
build sections, and mark the GPU addon as implemented.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
I created this over the course of testing my VISIBLE_CDI_DEVICES
changes. I think this will be useful to folks who don't understand the
right way to deploy custom artifacts.
Signed-off-by: LandonTClipp <lclipp@coreweave.com>
Switch the NVIDIA GPU example values file to install Kata via the
Job-based deployment mode (deploymentMode: job) instead of the
always-on, privileged DaemonSet, so that nothing keeps running on the
node once the install completes.
To exercise this in our CI, make the helm_helper aware of the deployment
mode coming from the (base) values file:
- In "job" mode, clear job.nodeSelectorExpressions so the dispatcher
targets every discovered node. Our CI clusters are typically
single-node, where the only node carries the control-plane label,
and the default selector excludes control-plane/master nodes.
- There is no always-on DaemonSet to wait on in "job" mode. The
dispatcher runs as a blocking post-install hook and the final
per-node stage labels the node, so wait until at least one node
carries the katacontainers.io/kata-runtime label as the
"install complete" signal (dumping Job/pod logs on timeout).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Clean the runtime configuration section by focusing first on the helm
configuration. Then, pivot into a further explanation on how the runtime
can be directly configured. Link to where these config parameters are
explained more in-depth.
Add open-in-new-tab (already downloaded in requirements.txt) in the
mkdocs plugin config so that links don't open in the same tab.
Signed-off-by: LandonTClipp <lclipp@coreweave.com>
Document the new opt-in deploymentMode: job alongside the default
DaemonSet model in the maintained docs (not just the chart README):
- helm-configuration.md: add a "Deployment Modes (DaemonSet vs Job)"
section covering the dispatcher-driven staged install/cleanup
pipelines, why a dispatcher is used instead of Helm-rendered per-node
Jobs (O(1) release, guaranteed coverage, paced rollout, explicit
privilege split), the "re-run helm upgrade to cover newly added
nodes" model (no always-on reconcile component), and the
node-selection precedence (job.nodes > job.nodeSelector +
job.nodeSelectorExpressions) that defaults to worker nodes.
- installation.md: note that the DaemonSet is the default but no longer
the only model, linking to the section above.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Pin idna to 3.15 and pymdown-extensions to 10.21.3 to address
security vulnerabilities:
- GHSA-65pc-fj4g-8rjx (idna, severity 6.9)
- GHSA-62q4-447f-wv8h (pymdown-extensions, severity 4.3)
- GHSA-r6h4-mm7h-8pmq (pymdown-extensions, severity 2.7)
These dependencies were previously transitive and vulnerable.
They are now explicitly pinned to secure versions.
Generated-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Remove the Go runtime file_mem_backend and valid_file_mem_backends
config knobs, along with the corresponding sandbox annotation handling.
The runtime still enables file-backed shared memory automatically for
virtio-fs by using /dev/shm as the backing directory. This only removes
the user-selectable backend path.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
While the config knob is being parsed, it is being unused in the
rust shim. This renders the config knob useless. Remove the
file_mem_backend config option as there is no current users for it.
As this option is being usable in the go shim, we leave it intact.
For the rust shim, /dev/shm is still being used in a similar way to
the go shim when filesystem sharing is enabled (virtio-fs). Future
use cases where other file_mem_backends are being utilized are
currently planning to define these backends in a similar manner:
based on the configuration/platform, determine the proper file
memory backend, but do not let end users determine the file memory
backend.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Add comprehensive documentation for using virtio-fs-nydus shared
filesystem with Kata Containers. This guide covers:
(1) Clarify configuration options for virtio-fs-nydus and nydus image
preparation and usage.
(2) Update daemon configuration and lifecycle management and introduce
standalone, inline nydus architecture.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
kata-monitor is published as a standalone container image starting
with 3.32.0; point users at it from the metrics design doc and the
Prometheus-on-Kubernetes how-to, and switch the DaemonSet manifest to
the dedicated image (keeping the runtime endpoint/listen settings and
hostPath cleanups).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
Allow operators to provide per-shim drop-in TOML for built-in runtimes
and reconcile stale override files so upgrades and migrations remain
safe when drop-ins are added or removed.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex
Add an RFC document describing the composable image architecture that
replaces monolithic guest rootfs images with a lean base image plus
purpose-specific addon images cold-plugged as virtio-blk devices.
The proposal covers the runtime configuration (extra_images), host-side
cold-plugging, guest-side mounting via systemd and dm-verity, agent-side
dynamic path resolution, the image build pipeline, and the security
model.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Migrate trace-forwarder from the deprecated opentelemetry-jaeger
exporter to the modern opentelemetry-otlp exporter.
This change remediates GHSA-2f9f-gq7v-9h6m (CVE-2026-43868), a
medium-severity vulnerability in Apache Thrift. The opentelemetry-jaeger
crate is no longer maintained and depends on vulnerable thrift versions
(0.13.0 and 0.16.0). The opentelemetry-otlp exporter does not use thrift
and is actively maintained.
Changes:
- Replace opentelemetry-jaeger with opentelemetry-otlp in Cargo.toml
- Update tracer.rs to use OTLP exporter instead of Jaeger exporter
- Replace --jaeger-host/--jaeger-port flags with --otlp-endpoint flag
- Update server.rs to use TracerProvider instead of SpanExporter
- Update documentation to reflect OTLP migration
- Add examples for common OTLP-compatible collectors
Breaking change: Users must update their trace-forwarder invocations
to use --otlp-endpoint instead of --jaeger-host and --jaeger-port.
Default endpoint: http://localhost:4317 (OTLP gRPC)
Generated-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Update the debugging guidance to explain the shared enable_debug
baseline for Go and runtime-rs. Document runtime-rs component log_level
controls and clarify that containerd debug is not required for
runtime-rs Kata logs in journald.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
As max_unmerged_layers = 1 is just for fsmerge mode, as containerd
temperally unsupport fsmerge, we just reset it with default 0.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add k8s-nvidia-numa.bats with five tests that validate NUMA behaviour
on hosts where NUMA is configured by default (qemu-nvidia-gpu,
qemu-nvidia-gpu-snp, qemu-nvidia-gpu-tdx):
1. Multi-node sandbox (large workload spanning all host NUMA nodes):
- Guest NUMA node count matches host
- Guest vCPU distribution is balanced across nodes (max-min <= 1)
- Guest memory is distributed across NUMA nodes
- Host-side vCPU pinning is balanced across NUMA nodes
2. Right-sized single-node sandbox (small workload fitting one node):
- Guest collapses to a single NUMA node
- All host vCPU threads pinned to that one NUMA node
3. GPU passthrough with VFIO, multi-node:
- Guest NUMA topology is balanced (same as test 1)
- Guest GPU's NUMA node matches the host GPU's NUMA node
(resolved via the vfio-pci,host=<BDF> from the QEMU command
line and /sys/bus/pci/devices/<BDF>/numa_node)
- QEMU command line contains pxb-pcie and policy=bind
- Host vCPU pinning is balanced
4. GPU passthrough with VFIO, right-sized single-node: small workload
plus GPU that fits in a single host NUMA node:
- Guest collapses to a single NUMA node
- The chosen node is the GPU's host NUMA node, not just any node
that fits — verified by matching host-nodes= in the memory
backend and pxb-pcie numa_node= against the GPU's host node
- Guest GPU reports the same NUMA node as the host GPU
5. Explicit numa_mapping in the runtime TOML (QEMU-only):
- Drops a config.d/ fragment that sets numa_mapping = ["1"], so the
auto-derive + right-sizing path is bypassed entirely
- Guest sees exactly 1 NUMA node
- QEMU memory backend is bound to host node 1 (host-nodes=1,
policy=bind), not host node 0
- Host-side vCPU threads land on host node 1
- Drop-in is removed on teardown so subsequent tests are unaffected
Guest-side checks use a dedicated container image
(quay.io/kata-containers/numa) that reads sysfs and prints results to
stdout — no kubectl exec or CoCo policy overrides needed.
Host-side checks (crictl, pgrep, taskset) run directly on the host
via sudo; a standalone numa-pinning-check.sh script handles the vCPU
thread affinity inspection. The config.d/ helpers used by test 5 are
runtime-agnostic (probe Go vs runtime-rs layout on disk) but the test
is gated to qemu-* shims since runtime-rs does not yet implement
NUMA.
Skips cleanly on single-NUMA hosts, unsupported hypervisors, or when
no nvidia.com/pgpu resources are available (GPU tests only).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add a step-by-step how-to guide covering host inspection, Kata NUMA
drop-in setup (via kata-deploy Helm and manual config.d/), pod
deployment examples, and guest/host verification procedures.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
- Added click==8.3.3 to docs/requirements.txt
- Click 8.3.3 is the latest version for Python >=3.10
- Required for mkdocs toolchain compatibility and resolves vulnerability in indirect dependencies
- Ref : CVE-2026-7246
Signed-off-by: pavithiran34 <pavithiran.p@ibm.com>
The cdh_api_timeout_ms configuration parameter wasn't being used
anywhere, so add the logic to process it as an annotation into the runtime-rs
agent config and then use that as a kernel_param.
Assisted-by IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Keep virtio_fs_extra_args support in code, but remove it from default
enable_annotations and add explicit security warnings in Makefiles and
docs.
Release-note note: mirror this hardening in release notes so operators
know this remains opt-in and carries host-side risk when enabled.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Description: the config for gpu operator for Nvidia kata containers device
plugin needs to be revised. The older one attributes to vgpu/kubevirt use case.
Signed-off-by: Rajat Chopra <rajatc@nvidia.com>
Now that we're adding support for the rust runtime, let's also update
the docs.
We may also need to update the docs again once we start testing with
different VMMs, but that's not in the scope for this PR.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
We should not ship configurations that we do not actively test.
This commit drops the following from the kata-deploy helm chart:
values.yaml:
- arm64 from supportedArches for the clh shim
- arm64 from supportedArches for the cloud-hypervisor shim
- arm64 from supportedArches for the dragonball shim
- arm64 from supportedArches for the fc shim
- arm64 from supportedArches for the qemu-nvidia-gpu shim
- the entire qemu-cca shim definition
try-kata-tee.values.yaml:
- CCA from the file description comment
- qemu-cca from the TEE shims list comment
- the entire qemu-cca shim definition
- arm64: qemu-cca from the defaultShim mapping, replaced with
arm64: qemu-coco-dev-runtime-rs (which is tested)
try-kata-nvidia-gpu.values.yaml:
- arm64 from supportedArches for the qemu-nvidia-gpu shim
- arm64: qemu-nvidia-gpu from the defaultShim mapping
Once arm64 and qemu-cca support are properly tested, they can be
re-added.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
This commit prunes the documentation tree by removing file
that are either no longer relevant to the current architecture
or have been superseded by newer guides.
Specifically, the doc Intel-Discrete-GPU-passthrough-and-Kata.md
and update using-Intel-QAT-and-kata.md index in nav.yaml
Refining the documentation helps ensure that new contributors
find accurate and up-to-date information.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
- Restructure document with clearer sections and better readability
- Add configuration format examples for both runtimes
- Add technical details including data flow and implementation references
- Add debugging section for troubleshooting
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Document the end-to-end workflow for using the containerd EROFS
snapshotter with Kata Containers runtime-rs, covering containerd
configuration, Kata QEMU settings, and pod deployment examples
via crictl/ctr/Kubernetes.
Include prerequisites (containerd >= 2.2, runtime-rs main branch),
QEMU VMDK format verification command, architecture diagram,
VMDK descriptor format reference, and troubleshooting guide.
Note that Cloud Hypervisor, Firecracker, and Dragonball do not
support VMDK block devices and are currently unsupported for
fsmerged EROFS rootfs.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add aarch64/arm64 to the list of supported architectures for
qemu-coco-dev and qemu-coco-dev-runtime-rs shims across kata-deploy
configuration, Helm chart values, and test helper scripts.
Note that guest-components and the related build dependencies are not
yet wired for arm64 in these configurations; those will be addressed
separately.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Adding the pod annotation config to the doc site. A symlink is created
at docs/pod-annotations.md that points to
how-to/how-to-set-sandbox-config-kata.md so that the URL for this file will be
created at `/pod-annotations`. Also adding brief contrbuting guidelines and
how-to's for running the documentation site locally for local previews.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
Now, we include the nvrc.smi.srs=1 flag in the default kernel cmdline.
Thus, we can remove the guidance for people to add it themselves when
not using attestation. In fact, users don't really need to know about
this flag at all.
Signed-off-by: Tobin Feldman-Fitzthum <tfeldmanfitz@nvidia.com>
It becomes simple and flexible with mermaid codes to update
the pic or diagrams. And it also remove the legacy PNG pictures
to reduce the kata-statics release file size.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>