Pull the kata-deploy container's environment block and host
volume/volumeMount definitions out of the DaemonSet template into
reusable named templates in _helpers.tpl:
- kata-deploy.commonEnv
- kata-deploy.commonVolumeMounts
- kata-deploy.commonVolumes
These are derived purely from chart values and are independent of the
deployment model, so they can be shared verbatim by upcoming per-node
install/cleanup Jobs without duplicating the (large) env wiring.
Pure refactor: the rendered DaemonSet is byte-for-byte identical to
before (verified via normalized `helm template` diff across default and
multiInstallSuffix/userDropIn/customRuntimes permutations).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Phase 1 of migrating kata-deploy from a DaemonSet to a staged JobSet
workflow: refactor the binary's install/cleanup flows into discrete,
independently invocable stages while keeping the existing DaemonSet
path fully working.
Add new staged subcommands that each run one step and exit, so a JobSet
can drive them as ordered initContainers/Jobs per node:
install: host-check -> artifacts -> cri -> label
cleanup (reverse): unlabel -> revert-cri -> remove-artifacts
`install` becomes a compatibility wrapper composing the install stages
in the canonical order, so the DaemonSet deployment model is unchanged.
The DaemonSet `cleanup` (with its DaemonSet-presence gating) is left
intact; the staged cleanup actions are added alongside it and skip that
gating since the JobSet workflow only schedules them on a real uninstall.
Each stage has an idempotent skip check so reruns are safe:
- install label / cleanup unlabel: short-circuit via the node label
- cleanup remove-artifacts: skip when the install dir is already gone
- cleanup revert-cri: skip the disruptive runtime restart when the CRI
drop-ins are already absent (new cri_drop_in_present helper)
Introduce a shared KATA_RUNTIME_LABEL constant and add rstest-based
tests covering the subcommand-name -> Action mapping, rejection of
unknown actions, and the visible/hidden help semantics.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Package and ship the dispatcher built in the previous commit so the
job-mode Helm chart has an image to run.
- Dockerfile.components: build kata-deploy and kata-deploy-job-dispatcher
from the same rust-builder stage (one compile), and run fmt/clippy/
test for both crates.
- job-dispatcher/Dockerfile: a minimal distroless/static image containing
only the dispatcher binary and CA certs - it is an API client, so it
needs nothing from the host.
- local-build: kata-deploy-job-dispatcher becomes its own build component
with its own static tarball
(kata-deploy-static-kata-deploy-job-dispatcher.tar.zst); the shared
rust-builder output is reused so the two components do not recompile
the workspace locally. The payload script builds and pushes a separate
"<kata-deploy registry>-job-dispatcher" image with the same tag scheme,
and release.sh publishes its multi-arch manifest symmetrically.
- CI: add kata-deploy-job-dispatcher to the build-kata-deploy-components
matrices (its tarball is picked up by the existing kata-artifacts-*
glob), and gate it in the kata-deploy rust static checks.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Add a small, deployment-agnostic dispatcher binary that runs exactly one
Kubernetes Job per selected node and paces the rollout, so callers get
guaranteed per-node coverage without encoding the fan-out in Helm.
Motivation: templating one Job per node into a Helm release does not
scale (the release Secret hits etcd's 1 MiB limit and hooks run
sequentially), and a single Indexed Job cannot guarantee per-node
coverage when paced - the scheduler ignores completed pods when
evaluating topology spread, so nodes get uneven numbers of pods. A tiny
dispatcher that enumerates nodes live and creates node-pinned Jobs itself
sidesteps both problems and keeps the Helm release O(1) in fleet size.
The dispatcher:
- enumerates target nodes live (explicit --nodes list or
--node-selector label selector), paginating the API;
- stamps out one Job per node from a YAML template, pinning it with
nodeName and an owner label for server-side filtering;
- keeps at most --parallelism Jobs in flight, refilling as they finish,
and sets an OwnerReference to the owner Job so the per-node Jobs are
garbage-collected with it;
- is a plain API client (kube): it never touches the host, so it can
run fully unprivileged.
Node membership is resolved live on each run, not frozen at Helm
template-render time: re-running the dispatcher (e.g. via `helm upgrade`)
picks up nodes added since the last run and skips ones already done, as
the per-node stages are idempotent. The dispatcher is one-shot, however
- it does not watch the API, so nodes added while it is not running are
only covered by the next run.
job.rs holds the pure helpers (node-name sanitization, deterministic Job
naming, template instantiation, status interpretation) with rstest unit
tests; main.rs wires up the CLI and the fan-out loop.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Pin idna to 3.15 and pymdown-extensions to 10.21.3 to address
security vulnerabilities:
- GHSA-65pc-fj4g-8rjx (idna, severity 6.9)
- GHSA-62q4-447f-wv8h (pymdown-extensions, severity 4.3)
- GHSA-r6h4-mm7h-8pmq (pymdown-extensions, severity 2.7)
These dependencies were previously transitive and vulnerable.
They are now explicitly pinned to secure versions.
Generated-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
While checking the content of the vendor tarball artifact in the 3.31.0
release page, I realized that it is lacking most of the rust code and
all the go code. It turns out that the script is badly broken in many
ways :
1. Cargo workspace conflicts: Vendored dependencies were treated as
workspace members, causing "current package believes it's in a
workspace when it's not" errors. Fixed by adding vendor directory
exclusions to root Cargo.toml.
2. Missing Go vendoring: Script only searched for Cargo.lock files,
never processing go.mod files despite having a case statement for
them. Fixed by adding go.mod to the find command with '-o -name go.mod'.
3. Wrong tar execution directory: Script ran tar from release/ directory
but vendor_dir_list contained paths relative to repo root (./vendor,
./src/agent/vendor, etc.), causing "Cannot stat" errors. Fixed by
moving tar command before final popd.
4. Relative tarball path: Since tar now runs from repo root, converted
tarball path to absolute to ensure it's created in the release
directory.
5. Vendored go.mod pollution: Added '-path ./vendor -prune' to find
command to exclude vendor directory, preventing the script from
finding go.mod files inside vendored Rust dependencies.
The fixes are simple enough they can be squashed into a single
commit.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Remove the Go runtime file_mem_backend and valid_file_mem_backends
config knobs, along with the corresponding sandbox annotation handling.
The runtime still enables file-backed shared memory automatically for
virtio-fs by using /dev/shm as the backing directory. This only removes
the user-selectable backend path.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
While the config knob is being parsed, it is being unused in the
rust shim. This renders the config knob useless. Remove the
file_mem_backend config option as there is no current users for it.
As this option is being usable in the go shim, we leave it intact.
For the rust shim, /dev/shm is still being used in a similar way to
the go shim when filesystem sharing is enabled (virtio-fs). Future
use cases where other file_mem_backends are being utilized are
currently planning to define these backends in a similar manner:
based on the configuration/platform, determine the proper file
memory backend, but do not let end users determine the file memory
backend.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Increase memory request/limit values used by k8s memory and QoS
integration workloads so SNP/TDX static-sized sandboxes boot reliably
under the new sizing defaults.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The kata-monitor negative test creates a non-kata pod and asserts it does
not appear in the kata-monitor cache (built from /run/vc/sbs, where only
kata sandboxes register).
However, the workload was started without a runtime handler, so it used
containerd's default runtime, which in the CI containerd config is set
to kata, so the "runc" pod was actually launched as a kata sandbox,
registered under /run/vc/sbs, and tripped the assertion ("cache: got
runc pod ...").
Start the workload with an explicit runc handler (configurable via
RUNC_RUNTIME) so it is a genuine runc sandbox that never touches
/run/vc/sbs.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This commit is to enable qemu-runtime-rs/clh-runtime-rs and make it
compatiable with qemu-runtime-rs and clh-runtime-rs.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add comprehensive documentation for using virtio-fs-nydus shared
filesystem with Kata Containers. This guide covers:
(1) Clarify configuration options for virtio-fs-nydus and nydus image
preparation and usage.
(2) Update daemon configuration and lifecycle management and introduce
standalone, inline nydus architecture.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Introduce `ShareVirtioFsNydus` to enable standalone Nydus rootfs
support. This implementation acts as the bridge between runtime-rs
and the external `nydusd` daemon.
Key Capabilities:
(1) Trait Implementation: Implements `ShareFs` (for VM device/storage) and
`NydusShareFs` (for RAFS lifecycle) traits.
(2) Daemon Lifecycle Management: Handles `nydusd` spawning, supervision,
and graceful shutdown.
(3) Native Overlay Support: Configures `nydusd` with `passthrough_fs`
backend to provide native overlay (upperdir/workdir) support.
(4) API Integration: Utilizes `NydusClient` for granular control over RAFS
mount/umount operations.
(5) QEMU Integration: Enables `virtio-fs-nydus` device support,
facilitating standalone mode execution.
This implementation allows Kata containers to utilize an external `nydusd`
process for Nydus rootfs management, providing a cleaner separation between
the runtime and the Nydus daemon lifecycle.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Refactor the `ShareFs` trait to improve modularity and support
standalone Nydus mode:
(1) Added `stop()` method to manage daemon teardown.
(2) Introduced a dedicated trait for Nydus-specific data-plane
operations.
This refactoring cleans up the `ShareFs` trait by consolidating
daemon lifecycle handling and isolating Nydus-specific extensions,
paving the way for cleaner standalone Nydus implementation.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Implement NydusClient to interact with nydusd daemon via Unix
socket:
(1) check_status: query daemon state via GET /api/v1/daemon.
(2) mount/umount: manage filesystem mounts via POST/DELETE
/api/v1/mount.
(3) wait_until_ready: poll daemon until RUNNING state.
This provides a lightweight, stateless HTTP client layer for nydusd
API.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
In standalone nydusd mode with virtio-fs passthrough, the guest-side
mkdir may fail with ENOSYS. Update the overlayfs storage handler to
skip directory creation when the directory already exists, logging a
warning instead of failing.
This ensures container rootfs setup succeeds when nydusd's native
overlay manages the directory structure.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When using virtio-fs with nydusd's passthrough_fs, mkdir operations may
return ENOSYS on certain filesystem configurations. This causes mount
destination creation to fail unexpectedly.
Handle ENOSYS errors gracefully alongside AlreadyExists by verifying the
directory exists after the failed mkdir attempt, allowing the mount to
proceed if the directory is already present.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add "virtio-fs-nydus" as a recognized shared filesystem type in the
hypervisor configuration. This enables the standalone nydusd mode where
nydusd runs as a separate process alongside virtiofsd.
The key changes:
(1) Add VIRTIO_FS_NYDUS constant for the new shared fs type.
(2) Register virtio-fs-nydus in adjust() and validate() paths, reusing
the same virtio-fs validation logic since both use vhost-user protocol
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
As independent iothreads can work in both virtio-scsi and virtio-blk
devices, this commit aims to enable such feature in virtio-blk-pci
devices.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
1. Determine iothread for virtio-blk devices, only attach iothread
when:
(1) enable_iothreads is true
(2) indep_iothreads > 0
(3) block driver is not virtio-scsi (i.e., it's
virtio-blk)
And for more complex cases, some enhancements will be done in future
2. Add iothread parameter for virtio-blk devices if specified.
If iothreads set and passed, we will have to set it correctly for
virtio-blk devices via qmp with device_add arguments.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
To make it work well for independent IO threads for virtio-blk devices.
A new method for independent IO threads for virtio-blk hotplug devices
within qemu command line.
Note that as ObjectIoThread has been done for days, it can be directly
reused in this case.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
To make it more flexible when users want to set this feature, one
more way to make it valid is via annotations.
The dedicated annnotation of
"io.katacontainers.config.hypervisor.indep_iothreads" is introduced
within k8s clusters.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
It's useful and helpful to set indep_iothreads with enable_iothreads
for high IO performance. And we need provide an entry for people to
set it if needed.
This commit will introduce two configurable items:
- Makefile: DEFINDEPIOTHREADS when make build.
- configurations: indep_iothreads for people to set.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The 'indep_iothreads' field is introduced in Hypervisor to make it
configurable for number of independent IO threads for virtio-blk
devices. When set to a value greater than 0, creates independent
IO threads that can be attached to virtio-blk devices during hotplug.
Note that it requires 'enable_iothreads' to be true for virtio-blk
devices to use these threads.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add update_guest_filesystem_metrics() that collects disk space usage
(total/used/available) for all read-write mounted filesystems inside
the guest VM. This enables monitoring guest disk usage in kata/coco
pod through the existing GetMetrics RPC.
And its output metrics looks like as below:
- kata_guest_filesystem_bytes{mount="/",device="vda",item="total|used|available"}
- kata_guest_filesystem_inodes{mount="/",device="vda",item="total|used|available"}
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add two new GaugeVec metrics to expose guest filesystem space usage:
(1) kata_guest_filesystem_bytes{mount, device, item}: space in bytes
(total/used/available)
(2) kata_guest_filesystem_inodes{mount, device, item}: inode counts
(total/used/available)
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
In #13147, for some reason a test block was added in the middle of code
and the code was stale when merged, which meant that a second
`mod test` section was added, breaking our tests. Merge the two
to fix this.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
publish-kata-deploy-payload got renamed in #13107, which broke the CI.
Now, instead of tracking all those intermediate steps, let's make sure
we only track the tests themselves.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Run qemu-coco-dev-runtime-rs k8s test workflow on the zVSI
only during nightly builds.
Changes:
- Modified run-k8s-tests-on-zvsi.yaml to accept vmm as workflow
inputs instead of hardcoded matrix values
- run-k8s-tests-on-zvsi passes a conditional vmm value; 4 vmms
for nightly/dev builds and 3 vmms for all other PRs.
This ensures qemu-coco-dev-runtime-rs is only tested with nydus
snapshotter during nightly CI runs, reducing PR test time while
maintaining comprehensive nightly coverage.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Add qemu-coco-dev-runtime-rs to the VMM matrix in the zVSI K8S
test workflow, configured to run only with the nydus snapshotter.
Changes:
- Add qemu-coco-dev-runtime-rs to the vmm matrix
- Exclude overlayfs + qemu-coco-dev-runtime-rs combination
- Exclude devmapper + qemu-coco-dev-runtime-rs combination
- Update CoCo-related conditional steps to include the new VMM:
* KBS environment variable setup
* kbs-client uninstall/install steps
* CoCo KBS deployment
This ensures qemu-coco-dev-runtime-rs is only tested with nydus
snapshotter, while maintaining existing test configurations for
other VMMs.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>