Add handling for multi-layer EROFS rootfs in RootFsResource
handler_rootfs method. It will correctly handle the multi-layers
erofs rootfs.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add erofs_rootfs.rs implementing ErofsMultiLayerRootfs for
multi-layer EROFS rootfs with VMDK descriptor generation.
It's the core implementation of Erofs rootfs within runtime.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Change Rootfs::get_storage to return Option<Vec<Storage>>
to support multi-layer rootfs with multiple storages.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add format argument to hotplug_block_device for flexibly specifying
different block formats.
With this, we can support kinds of formats, currently raw and vmdk are
supported, and some other formats will be supported in future.
Aside the formats, the corresponding handling logics are also required
to properly handle its options needed in QMP blockdev-add.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
In practice, we need more kinds of block formats, not limited to `Raw`.
This commit aims to add BlockDeviceFormat enum for kinds of block device
formats support, like RAW, VMDK, etc. And it will do some following actions
to make this changes work well, including format field in BlockConfig.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add RUNTIME_ALLOW_MOUNTS annotation to RuntimeInfo to specify
custom mount types allowed by the runtime.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The Go runtime's CoCo dev config uses dial_timeout = 45s, but all
runtime-rs confidential VM configs had reconnect_timeout_ms set to
3000ms (3s) or 5000ms (SE). This is too short for confidential VMs,
especially on arm64 where UEFI firmware (AAVMF) adds significant
boot time on top of the measured boot process, causing ECONNRESET
errors on the vsock connection before the agent is ready.
Bump reconnect_timeout_ms to 45000ms across all confidential VM
configs (coco-dev, SNP, TDX, SE) to match the Go runtime.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Add kernel_verity_params to the qemu-coco-dev-runtime-rs configuration
so the runtime can assemble dm-verity kernel parameters, and remove the
test skip that was disabling measured rootfs tests for this hypervisor.
Fixes: #12851
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add runtime-rs support for the GetDiagnosticData RPC. This extends
the Agent trait, types, and protocol translation layer with the new
request/response types.
During container stop, when shared_fs is "none" and the
terminationMessagePolicy annotation is "File", the runtime copies
the termination log from the guest via GetDiagnosticData. The call
is best-effort to avoid blocking container teardown.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The hardcoded DEFAULT_LAUNCH_PROCESS_TIMEOUT of 6 seconds in the kata
agent is insufficient for environments with NVIDIA GPUs and NVSwitches,
where the attestation-agent needs significantly more time to collect
evidence during initialization (e.g. ~2 seconds per NVSwitch).
When the timeout expires, the agent (PID 1) exits with an error, causing
the guest kernel to perform an orderly shutdown before the
attestation-agent has finished starting.
Make this timeout configurable via the kernel parameter
agent.launch_process_timeout (in seconds), preserving the 6-second
default for backward compatibility. The Go runtime is wired up to pass
this value from the TOML config's [agent.kata] section through to the
kernel command line.
The NVIDIA GPU configs set the new default to 15 seconds.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Add two new configuration knobs that control the logical and physical
sector sizes advertised by virtio-blk devices to the guest:
block_device_logical_sector_size (config file)
block_device_physical_sector_size (config file)
io.katacontainers.config.hypervisor.blk_logical_sector_size (annotation)
io.katacontainers.config.hypervisor.blk_physical_sector_size (annotation)
The annotation names are abbreviated relative to the config file keys
because Kubernetes enforces a 63-character limit on annotation name
segments, and the full names would exceed it.
Both settings default to 0 (let QEMU decide). When set, they are passed
as logical_block_size and physical_block_size in the QMP device_add
command during block device hotplug.
Setting logical_sector_size smaller then container filesystem
block size will cause EINVAL on mount. The physical_sector_size can
always be set independently.
Values must be 0 or a power of 2 in the range [512, 65536]; other
values are rejected with an error at sandbox creation time.
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
Follow-on to kata-containers/kata-containers#12396
Switch SNP config from initrd-based to image-based rootfs with
dm-verity. The runtime assembles the dm-mod.create kernel cmdline
from kernel_verity_params, and with kernel-hashes=on the root hash
is included in the SNP launch measurement.
Also add qemu-snp to the measured rootfs integration test.
Signed-off-by: Amanda Liem <aliem@amd.com>
runtime-rs memory hotplug hard-codes the `pc-dimm` device driver, which
is an x86-only QEMU device model. On s390x, the `s390-ccw-virtio`
machine type does not support `pc-dimm` at all — the Go runtime handles
this by using `virtio-mem-ccw` instead (controlled by the
`enable_virtio_mem` config knob, defaulting to true on s390x).
runtime-rs has no virtio-mem support, so any attempt to dynamically
hotplug memory on s390x fails with:
'pc-dimm' is not a valid device model name
This is a pre-existing limitation on main — it has never worked. It is
now visible because commit 45dfb6ff25 ("runtime-rs: Fix initial vCPU /
memory with static_sandbox_resource_mgmt") expanded runtime-rs test
coverage, causing k8s-memory.bats and k8s-oom.bats to actually exercise
this code path on s390x.
Let's enforce using static_sandbox_resources_mgmt also for s390x so the
VM is sized upfront at creation time, bypassing the broken dynamic
hotplug path entirely.
If someone decides to implement hotplug support for s390x, the work
would basically be an implemntation of virtio-mem-ccw support in the
runtime-rs QEMU backend (boot-time device creation, qom-set based
resize, and virtio-mem aware memory accounting), mirroring what the Go
runtime already does, but I'm not game for this (sorry).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
runtime-rs lacks several features needed for CPU hotplug on ARM:
pflash/UEFI firmware passthrough, SMP topology in -smp, nr_cpus
kernel parameter, and QMP vCPU add handling for the virt machine
type (which requires core-id only placement with socket/thread/die
set to -1).
Without static sandbox resource management, these gaps cause
failures in tests like k8s-memory.bats where the VM is not correctly
sized for the workload.
Enable static_sandbox_resource_mgmt for aarch64 in the QEMU
runtime-rs configuration so the VM is pre-sized at creation time,
sidestepping the need for hotplug entirely.
Together with this we're aligning the go runtime to the very same
behaviour.
Fixes: #10928
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
InitialSizeManager::setup_config() is responsible for applying the
sandbox workload sizing (computed from containerd/CRI-O sandbox
annotations) to the hypervisor configuration before VM creation.
Previously, the workload vCPU count was only logged but never actually
added to default_vcpus, so the VM was always created with only the base
vCPUs from the configuration/annotations. This caused the
k8s-sandbox-vcpus-allocation test to fail with qemu-snp-runtime-rs:
a pod with default_vcpus=0.75 and a container CPU limit of 1.2 should
see ceil(0.75 + 1.2) = 2 vCPUs, but only got 1.
Additionally, the workload memory was being added to default_memory
unconditionally, diverging from the Go runtime which only applies both
CPU and memory additions when static_sandbox_resource_mgmt is enabled.
In the non-static path, adding workload resources here would cause
double-counting: once from setup_config() at sandbox creation, and
again from update_cpu_resources()/update_mem_resources() when
individual containers are added.
Guard both additions behind static_sandbox_resource_mgmt, matching the
Go runtime's behavior in src/runtime/pkg/oci/utils.go:
if sandboxConfig.StaticResourceMgmt {
sandboxConfig.HypervisorConfig.NumVCPUsF += sandboxConfig.SandboxResources.WorkloadCPUs
sandboxConfig.HypervisorConfig.MemorySize += sandboxConfig.SandboxResources.WorkloadMemMB
}
Fixes: k8s-sandbox-vcpus-allocation test failure on qemu-snp-runtime-rs
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
As virtio-9p is never supported in runtime-rs, we have more choices to
replace it with blockfile snapshotter or erofs snapshotter(in future).
It's time to remove its documents and reduce misleading guidance.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The create_container_timeout key was placed after the
[agent.@PROJECT_TYPE@.mem_agent] TOML section header, which meant
TOML parsed it as a field of mem_agent rather than of the parent
agent table. This was silently ignored before, but now that
MemAgent has #[serde(deny_unknown_fields)] it causes a parse error.
Move the key above the [mem_agent] section so it belongs to the
correct [agent.@PROJECT_TYPE@] table.
Also fix configuration-qemu-coco-dev which had a duplicate entry:
keep only the correctly placed one with the COCO timeout value.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
..where possible. Failing on unknown fields makes migration easier,
as we do not silently ignore configuration options that previously
worked in runtime-go. However, serde can't deny unknown fields
where flatten is used, so this can't be used everywhere sadly.
There were also errors in test fixtures that were unnoticed.
These are fixed here, too.
Signed-off-by: Paul Meyer <katexochen0@gmail.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This commit aims to add support SEV-SNP host data within
cloud-hypervisor, which will help pass down the initdata coming
from upper layer.
In SEV-SNP cases, with it enabled, the Cloud-hypervisor will receive
host-data string, otherwise it's set default None.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This commit completes the integration of the protection device
configuration into the Cloud Hypervisor VM boot process.
Previously, the `protection_device` returned by `get_shared_devices()`
was ignored. This change now correctly extracts the `protection_device`
and passes it to the `VmConfig` creation during `boot_vm`.
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
This commit extends the Cloud Hypervisor integration within `runtime-rs`
to correctly process and extract CoCo related device configurations.
Specifically, it includes:
- Adds handling for `DeviceType::Protection` to `add_device`, allowing
these devices to be queued for processing.
- Retrieve the pending device list, identify and extract `ProtectionDevConfig`
and implements the logic to handle the `mrconfigid` from protection device.
This ensures that initdata related prameters are properly consumed by the
runtime and made it available for Cloud Hypervisor to utilize during the setup
of CVM.
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
This commit implements the conversion logic for the newly introduced
CoCo/Initdata configurations, ensuring that `ProtectionDevConfig` data,
particularly the `mrconfigid`, is correctly passed. And the main Changes
are as below:
- Importing `ProtectionDevConfig` for use in conversion logic.
- Extracting `protection_device` from `NamedHypervisorConfig` and
passing it during the conversion to `VmConfig`.
- Updating the `TryFrom` implementation for `PayloadConfig` to accept
the `ProtectionDevConfig` and use its `mrconfigid` to populate the
`PayloadConfig`'s `mrconfigid` field.
- Adjusting unit tests to reflect the updated `PayloadConfig` conversion
signature and `mrconfigid` handling.
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Introduces `ProtectionDevConfig` and `mrconfigid` fields to support
CoCo/InitData feature in Cloud Hypervisor.
This change enables the runtime-rs/cloud-hypervisor to:
- Configure and manage 'protection devices' associated with hypervisor
instances (Specially when the CVM starts).
- Include 'MRCONFIGID' in the payload configuration, allowing integrity
verification via initdata of the VM's payload within TEEs.
This is a foundational step towards robust CVM support, enhancing the
trustworthiness and integrity verification with initdata inside TEE platforms.
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
When TDX confidential guest support is enabled, set `kernel_irqchip=split`
for TDX CVM:
...
-machine \
q35,accel=kvm,kernel_irqchip=split,confidential-guest-support=tdx \
...
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
There's a typo in the error message which gets prompted when an
unsupported share_fs was configured. Fixed shred -> shared.
Signed-off-by: Yuting Nie <yuting.nie@spacemit.com>
This fix applies the config file value as a fallback when block_device_cache_direct annotation is not explicitly set on the pod.
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
A FC update caused bad requests for the runtime-rs runtime when
specifying the vcpu count and block rate limiter fields.
Signed-off-by: Anastassios Nanos <ananos@nubificus.co.uk>
While attaching the tap device, it fails on ppc64le with EBADF
"cannot create tap device. File descriptor in bad state (os error 77)\"): unknown”
Refactor the ioctl call to use the standard libc::TUNSETIFF constant.
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
After the qemu VM is booted, while storing the guest details,
it fails to set capabilities as it is not yet implemented
for QEMU, this change adds a default implementation for it.
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
Some fields were misspelled, misplaced at an outdated path or copied
over from runtime-go but aren't supported in runtime-rs.
Cleaning them up to avoid confusion and ease migration.
Signed-off-by: Paul Meyer <katexochen0@gmail.com>
Add comprehensive hypervisor support table (Dragonball, QEMU,
Cloud Hypervisor, Firecracker, Remote). Document all runtime handlers
(VirtContainer, LinuxContainer, WasmContainer) and resource types.
List all configuration files including CoCo variants (TDX, SNP, SE).
Add shim-ctl crate to crates table for development tooling reference.
Add Feature Flags section documenting dragonball and cloud-hypervisor
options.
Simplify and restructure content for clarity while preserving technical
accuracy.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Corrects the typo 'BUILDIN' to the standard 'BUILTIN' across the
codebase to improve code quality and documentation consistency.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>