Remove the Go runtime file_mem_backend and valid_file_mem_backends
config knobs, along with the corresponding sandbox annotation handling.
The runtime still enables file-backed shared memory automatically for
virtio-fs by using /dev/shm as the backing directory. This only removes
the user-selectable backend path.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
While the config knob is being parsed, it is being unused in the
rust shim. This renders the config knob useless. Remove the
file_mem_backend config option as there is no current users for it.
As this option is being usable in the go shim, we leave it intact.
For the rust shim, /dev/shm is still being used in a similar way to
the go shim when filesystem sharing is enabled (virtio-fs). Future
use cases where other file_mem_backends are being utilized are
currently planning to define these backends in a similar manner:
based on the configuration/platform, determine the proper file
memory backend, but do not let end users determine the file memory
backend.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Introduce `ShareVirtioFsNydus` to enable standalone Nydus rootfs
support. This implementation acts as the bridge between runtime-rs
and the external `nydusd` daemon.
Key Capabilities:
(1) Trait Implementation: Implements `ShareFs` (for VM device/storage) and
`NydusShareFs` (for RAFS lifecycle) traits.
(2) Daemon Lifecycle Management: Handles `nydusd` spawning, supervision,
and graceful shutdown.
(3) Native Overlay Support: Configures `nydusd` with `passthrough_fs`
backend to provide native overlay (upperdir/workdir) support.
(4) API Integration: Utilizes `NydusClient` for granular control over RAFS
mount/umount operations.
(5) QEMU Integration: Enables `virtio-fs-nydus` device support,
facilitating standalone mode execution.
This implementation allows Kata containers to utilize an external `nydusd`
process for Nydus rootfs management, providing a cleaner separation between
the runtime and the Nydus daemon lifecycle.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Refactor the `ShareFs` trait to improve modularity and support
standalone Nydus mode:
(1) Added `stop()` method to manage daemon teardown.
(2) Introduced a dedicated trait for Nydus-specific data-plane
operations.
This refactoring cleans up the `ShareFs` trait by consolidating
daemon lifecycle handling and isolating Nydus-specific extensions,
paving the way for cleaner standalone Nydus implementation.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Implement NydusClient to interact with nydusd daemon via Unix
socket:
(1) check_status: query daemon state via GET /api/v1/daemon.
(2) mount/umount: manage filesystem mounts via POST/DELETE
/api/v1/mount.
(3) wait_until_ready: poll daemon until RUNNING state.
This provides a lightweight, stateless HTTP client layer for nydusd
API.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
In standalone nydusd mode with virtio-fs passthrough, the guest-side
mkdir may fail with ENOSYS. Update the overlayfs storage handler to
skip directory creation when the directory already exists, logging a
warning instead of failing.
This ensures container rootfs setup succeeds when nydusd's native
overlay manages the directory structure.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When using virtio-fs with nydusd's passthrough_fs, mkdir operations may
return ENOSYS on certain filesystem configurations. This causes mount
destination creation to fail unexpectedly.
Handle ENOSYS errors gracefully alongside AlreadyExists by verifying the
directory exists after the failed mkdir attempt, allowing the mount to
proceed if the directory is already present.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add "virtio-fs-nydus" as a recognized shared filesystem type in the
hypervisor configuration. This enables the standalone nydusd mode where
nydusd runs as a separate process alongside virtiofsd.
The key changes:
(1) Add VIRTIO_FS_NYDUS constant for the new shared fs type.
(2) Register virtio-fs-nydus in adjust() and validate() paths, reusing
the same virtio-fs validation logic since both use vhost-user protocol
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
As independent iothreads can work in both virtio-scsi and virtio-blk
devices, this commit aims to enable such feature in virtio-blk-pci
devices.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
1. Determine iothread for virtio-blk devices, only attach iothread
when:
(1) enable_iothreads is true
(2) indep_iothreads > 0
(3) block driver is not virtio-scsi (i.e., it's
virtio-blk)
And for more complex cases, some enhancements will be done in future
2. Add iothread parameter for virtio-blk devices if specified.
If iothreads set and passed, we will have to set it correctly for
virtio-blk devices via qmp with device_add arguments.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
To make it work well for independent IO threads for virtio-blk devices.
A new method for independent IO threads for virtio-blk hotplug devices
within qemu command line.
Note that as ObjectIoThread has been done for days, it can be directly
reused in this case.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
To make it more flexible when users want to set this feature, one
more way to make it valid is via annotations.
The dedicated annnotation of
"io.katacontainers.config.hypervisor.indep_iothreads" is introduced
within k8s clusters.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
It's useful and helpful to set indep_iothreads with enable_iothreads
for high IO performance. And we need provide an entry for people to
set it if needed.
This commit will introduce two configurable items:
- Makefile: DEFINDEPIOTHREADS when make build.
- configurations: indep_iothreads for people to set.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The 'indep_iothreads' field is introduced in Hypervisor to make it
configurable for number of independent IO threads for virtio-blk
devices. When set to a value greater than 0, creates independent
IO threads that can be attached to virtio-blk devices during hotplug.
Note that it requires 'enable_iothreads' to be true for virtio-blk
devices to use these threads.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add update_guest_filesystem_metrics() that collects disk space usage
(total/used/available) for all read-write mounted filesystems inside
the guest VM. This enables monitoring guest disk usage in kata/coco
pod through the existing GetMetrics RPC.
And its output metrics looks like as below:
- kata_guest_filesystem_bytes{mount="/",device="vda",item="total|used|available"}
- kata_guest_filesystem_inodes{mount="/",device="vda",item="total|used|available"}
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add two new GaugeVec metrics to expose guest filesystem space usage:
(1) kata_guest_filesystem_bytes{mount, device, item}: space in bytes
(total/used/available)
(2) kata_guest_filesystem_inodes{mount, device, item}: inode counts
(total/used/available)
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
In #13147, for some reason a test block was added in the middle of code
and the code was stale when merged, which meant that a second
`mod test` section was added, breaking our tests. Merge the two
to fix this.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Switch qemu-se config templates to use the TEE/CoCo-specific
static_sandbox_resource_mgmt defaults instead of the generic
QEMU defaults.
qemu-se-runtime-rs config now uses DEFSTATICRESOURCEMGMT_COCO
while runtime qemu-se config now uses DEFSTATICRESOURCEMGMT_TEE.
This aligns static sandbox resource management behavior with confidential
container expectations for qemu-se variants.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The dragonball nerdctl CI job can race when creating and attaching the
runtime process to the sandbox cgroup, surfacing an os error 17
(AlreadyExists) during shim task creation.
Let's retry add_proc once on this pre-existing cgroup condition so
startup remains robust.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex <codex@openai.com>
Low-CPU sandboxes can take longer than a few seconds to complete guest
boot and start the agent.
Let's clamp the reconnect timeout to a safe minimum so sandbox startup
does not fail early with transient vsock ECONNRESET.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex <codex@openai.com>
When static sandbox resource management is enabled, CRI CPU/memory
sizing may live only in sandbox annotations and be missing from the OCI
spec.
Let's fill missing sizing fields from annotations before applying static
VM sizing so runtime-rs follows the expected Kubernetes behavior for
constrained pods.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Codex <codex@openai.com>
Add top-level runtime-rs Makefile options `DEFSANDBOXCGROUP_ONLY` and
`DEFSTATICRESOURCEMGMT`, both defaulting to true, and use them for the
runtime defaults that previously disabled these paths.
This aligns runtime-rs defaults with static sandbox resource management,
which sizes sandbox memory up front instead of relying on memory hotplug,
helping avoid architecture-specific hotplug limitations.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Migrate trace-forwarder from the deprecated opentelemetry-jaeger
exporter to the modern opentelemetry-otlp exporter.
This change remediates GHSA-2f9f-gq7v-9h6m (CVE-2026-43868), a
medium-severity vulnerability in Apache Thrift. The opentelemetry-jaeger
crate is no longer maintained and depends on vulnerable thrift versions
(0.13.0 and 0.16.0). The opentelemetry-otlp exporter does not use thrift
and is actively maintained.
Changes:
- Replace opentelemetry-jaeger with opentelemetry-otlp in Cargo.toml
- Update tracer.rs to use OTLP exporter instead of Jaeger exporter
- Replace --jaeger-host/--jaeger-port flags with --otlp-endpoint flag
- Update server.rs to use TracerProvider instead of SpanExporter
- Update documentation to reflect OTLP migration
- Add examples for common OTLP-compatible collectors
Breaking change: Users must update their trace-forwarder invocations
to use --otlp-endpoint instead of --jaeger-host and --jaeger-port.
Default endpoint: http://localhost:4317 (OTLP gRPC)
Generated-by: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
When CreateContainer fails before the runtime instance is registered
(e.g. a hypervisor/cgroup error), no sandbox exists to drive the normal
teardown. containerd's follow-up Shutdown RPC then reaches
get_runtime_instance(), fails with "runtime not ready", and returns
before the service loop is ever told to stop. Because the shim ignores
SIGTERM, the containerd-shim-kata-v2 daemon is left running and orphaned.
Make the Shutdown RPC force the daemon to exit when there is no runtime
instance, emitting the same Action::Shutdown that sandbox.shutdown()
sends on the normal path. This guarantees the shim process is reaped
after a failed create instead of leaking.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <noreply@cursor.com>
Bump the go version to resolve CVEs:
- GO-2026-5037
- GO-2026-5038
- GO-2026-5039
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
QEMU maxcpus enables CPU hotplug capabilities but it's unused when
confidential guest is enabled.
Change Go runtime code to skip setting maxcpus QEMU cmdline if CPU hotplug
is not needed.
Commit 07db945b09 built a relationship between kernel's cmdline nr_cpus and
the maxcpus config. Now that maxcpus is dropped for confidential guests, drop
nr_cpus from kernel commandline too. This hopefully helps with the reference
values computation too.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
QEMU maxcpus enables CPU hotplug capabilities but it's unused when
confidential guest is enabled.
Change runtime-rs code to skip setting maxcpus QEMU cmdline if CPU hotplug
is not needed.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys
from v0.43.0 to v0.44.0 to resolve CVEs:
- GO-2026-5024
- GO-2026-5025
- GO-2026-5026
- GO-2026-5027
- GO-2026-5028
- GO-2026-5029
- GO-2026-5030
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys
from v0.43.0 to v0.44.0 to resolve CVEs:
- GO-2026-5024
- GO-2026-5025
- GO-2026-5026
- GO-2026-5027
- GO-2026-5028
- GO-2026-5029
- GO-2026-5030
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
Make enable_debug promote the effective component log level from the
default info level to debug for runtime, agent, and hypervisor logs.
Keep an explicit log_level value authoritative so users can still choose
trace, warn, or another level.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
Run cargo fmt on runtime-rs to ensure consistent formatting
with Rust 1.94 toolchain.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Rust 1.94 now warns about unnecessary unsafe blocks around
__get_cpuid_max(), __cpuid_count(), and host_cpuid() calls.
Remove the unsafe blocks as they are no longer needed.
This fixes the following clippy warnings in dbs-arch:
- warning: unnecessary `unsafe` block at brand_string.rs:106
- warning: unnecessary `unsafe` block at brand_string.rs:114
- warning: unnecessary `unsafe` block at common.rs:28
- warning: unnecessary `unsafe` block at common.rs:36
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob