Follow-on to kata-containers/kata-containers#12396
Switch SNP config from initrd-based to image-based rootfs with
dm-verity. The runtime assembles the dm-mod.create kernel cmdline
from kernel_verity_params, and with kernel-hashes=on the root hash
is included in the SNP launch measurement.
Also add qemu-snp to the measured rootfs integration test.
Signed-off-by: Amanda Liem <aliem@amd.com>
runtime-rs memory hotplug hard-codes the `pc-dimm` device driver, which
is an x86-only QEMU device model. On s390x, the `s390-ccw-virtio`
machine type does not support `pc-dimm` at all — the Go runtime handles
this by using `virtio-mem-ccw` instead (controlled by the
`enable_virtio_mem` config knob, defaulting to true on s390x).
runtime-rs has no virtio-mem support, so any attempt to dynamically
hotplug memory on s390x fails with:
'pc-dimm' is not a valid device model name
This is a pre-existing limitation on main — it has never worked. It is
now visible because commit 45dfb6ff25 ("runtime-rs: Fix initial vCPU /
memory with static_sandbox_resource_mgmt") expanded runtime-rs test
coverage, causing k8s-memory.bats and k8s-oom.bats to actually exercise
this code path on s390x.
Let's enforce using static_sandbox_resources_mgmt also for s390x so the
VM is sized upfront at creation time, bypassing the broken dynamic
hotplug path entirely.
If someone decides to implement hotplug support for s390x, the work
would basically be an implemntation of virtio-mem-ccw support in the
runtime-rs QEMU backend (boot-time device creation, qom-set based
resize, and virtio-mem aware memory accounting), mirroring what the Go
runtime already does, but I'm not game for this (sorry).
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
runtime-rs lacks several features needed for CPU hotplug on ARM:
pflash/UEFI firmware passthrough, SMP topology in -smp, nr_cpus
kernel parameter, and QMP vCPU add handling for the virt machine
type (which requires core-id only placement with socket/thread/die
set to -1).
Without static sandbox resource management, these gaps cause
failures in tests like k8s-memory.bats where the VM is not correctly
sized for the workload.
Enable static_sandbox_resource_mgmt for aarch64 in the QEMU
runtime-rs configuration so the VM is pre-sized at creation time,
sidestepping the need for hotplug entirely.
Together with this we're aligning the go runtime to the very same
behaviour.
Fixes: #10928
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
InitialSizeManager::setup_config() is responsible for applying the
sandbox workload sizing (computed from containerd/CRI-O sandbox
annotations) to the hypervisor configuration before VM creation.
Previously, the workload vCPU count was only logged but never actually
added to default_vcpus, so the VM was always created with only the base
vCPUs from the configuration/annotations. This caused the
k8s-sandbox-vcpus-allocation test to fail with qemu-snp-runtime-rs:
a pod with default_vcpus=0.75 and a container CPU limit of 1.2 should
see ceil(0.75 + 1.2) = 2 vCPUs, but only got 1.
Additionally, the workload memory was being added to default_memory
unconditionally, diverging from the Go runtime which only applies both
CPU and memory additions when static_sandbox_resource_mgmt is enabled.
In the non-static path, adding workload resources here would cause
double-counting: once from setup_config() at sandbox creation, and
again from update_cpu_resources()/update_mem_resources() when
individual containers are added.
Guard both additions behind static_sandbox_resource_mgmt, matching the
Go runtime's behavior in src/runtime/pkg/oci/utils.go:
if sandboxConfig.StaticResourceMgmt {
sandboxConfig.HypervisorConfig.NumVCPUsF += sandboxConfig.SandboxResources.WorkloadCPUs
sandboxConfig.HypervisorConfig.MemorySize += sandboxConfig.SandboxResources.WorkloadMemMB
}
Fixes: k8s-sandbox-vcpus-allocation test failure on qemu-snp-runtime-rs
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
As virtio-9p is never supported in runtime-rs, we have more choices to
replace it with blockfile snapshotter or erofs snapshotter(in future).
It's time to remove its documents and reduce misleading guidance.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The create_container_timeout key was placed after the
[agent.@PROJECT_TYPE@.mem_agent] TOML section header, which meant
TOML parsed it as a field of mem_agent rather than of the parent
agent table. This was silently ignored before, but now that
MemAgent has #[serde(deny_unknown_fields)] it causes a parse error.
Move the key above the [mem_agent] section so it belongs to the
correct [agent.@PROJECT_TYPE@] table.
Also fix configuration-qemu-coco-dev which had a duplicate entry:
keep only the correctly placed one with the COCO timeout value.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
..where possible. Failing on unknown fields makes migration easier,
as we do not silently ignore configuration options that previously
worked in runtime-go. However, serde can't deny unknown fields
where flatten is used, so this can't be used everywhere sadly.
There were also errors in test fixtures that were unnoticed.
These are fixed here, too.
Signed-off-by: Paul Meyer <katexochen0@gmail.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When TDX confidential guest support is enabled, set `kernel_irqchip=split`
for TDX CVM:
...
-machine \
q35,accel=kvm,kernel_irqchip=split,confidential-guest-support=tdx \
...
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
There's a typo in the error message which gets prompted when an
unsupported share_fs was configured. Fixed shred -> shared.
Signed-off-by: Yuting Nie <yuting.nie@spacemit.com>
This fix applies the config file value as a fallback when block_device_cache_direct annotation is not explicitly set on the pod.
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
A FC update caused bad requests for the runtime-rs runtime when
specifying the vcpu count and block rate limiter fields.
Signed-off-by: Anastassios Nanos <ananos@nubificus.co.uk>
While attaching the tap device, it fails on ppc64le with EBADF
"cannot create tap device. File descriptor in bad state (os error 77)\"): unknown”
Refactor the ioctl call to use the standard libc::TUNSETIFF constant.
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
After the qemu VM is booted, while storing the guest details,
it fails to set capabilities as it is not yet implemented
for QEMU, this change adds a default implementation for it.
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
Some fields were misspelled, misplaced at an outdated path or copied
over from runtime-go but aren't supported in runtime-rs.
Cleaning them up to avoid confusion and ease migration.
Signed-off-by: Paul Meyer <katexochen0@gmail.com>
Add comprehensive hypervisor support table (Dragonball, QEMU,
Cloud Hypervisor, Firecracker, Remote). Document all runtime handlers
(VirtContainer, LinuxContainer, WasmContainer) and resource types.
List all configuration files including CoCo variants (TDX, SNP, SE).
Add shim-ctl crate to crates table for development tooling reference.
Add Feature Flags section documenting dragonball and cloud-hypervisor
options.
Simplify and restructure content for clarity while preserving technical
accuracy.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Corrects the typo 'BUILDIN' to the standard 'BUILTIN' across the
codebase to improve code quality and documentation consistency.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Before this change, `make test` for runtime-rs used to test all crates
in the root workspace (due to the `--all` flag). This was not intended
but happened to be mostly working. However, genpolicy needs additional
steps before it can build, so this behavior blocks adding genpolicy to
the root workspace.
The solution here is to only build the inteded packages. For the build
and run commands, this is the runtime-rs crate itself. For testing, we
need to include the sub-crates, too, which needs a bit of cargo metadata
scraping.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
- Remove unused crates to reduce our size and the work needed
to do updates
- Also update package.metadata.cargo-machete with some crates
that are incorrectly coming up as unused
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Specify raw image format for all guest block devices.
- Attempting to auto-detect the image format from CLH would be riskier
for the Host.
- Creating a new raw image file, auto-detecting its format, and then
creating a filesystem from the Guest onto the block device is no
longer supported by CLH v51. Therefore, Kata CI's k8s-block-volume.bats
would fail without specifying the raw format when hot plugging its block
device.
- See cloud-hypervisor/cloud-hypervisor@b3e8e2a for additional information.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This commit enables the SEV-SNP guest policy to be explicitly
configured via the runtime configuration in runtime-rs.
To provide both ease of use and maximum flexibility, the following
logic is implemented:
1. If the user provides a custom `snp_guest_policy` in the
configuration, this value is passed directly to the QEMU SEV-SNP
guest object.
2. If the user does not specify a policy, the driver defaults to
`0x30000`, matching QEMU's standard default for SEV-SNP guests.
This enhancement allows users to fine-tune security constraints through
the policy bitmask, while ensuring a sensible and functional default
for standard SNP deployments.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
A bitmask for the SNP guest policy is introduced in ObjectSevSnpGuest
to help pass to Qemu cmdline.
And defaults to 0x30000 (QEMU's default) to maintain standard behavior
it just looks like as: "policy=0x30000"
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
As the memory related information has been serialized at the sandbox
initalization specially at the moment of parsing configuration toml.
This commit aims to refactor MemoryInfo initialization logics:
(1) Remove memory sizing/host-memory adjustment logic from QEMU cmdline
Memory::new()
(2) Initialize/adjust memory values via kata-types MemoryInfo (single
source of truth)
(3) Replace sysinfo::System::new_with_specifics with
nix::sys::sysinfo::sysinfo() to get host RAM
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Some constants are duplicated in runtime-rs even though they
are already defined in kata-types. Use the definitions from
kata-types as the single source of truth to avoid inconsistencies
between components (e.g. agent and runtime).
This change makes runtime-rs use the constants defined in
kata-types.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
`DRIVER_BLK_CCW_TYPE` is defined as `blk-ccw`
in src/libs/kata-types/src/device.rs, so set
the variable in runtime-rs accordingly.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The VcpuThreadIds struct expects a mapping from vcpu_id to thread_id,
but get_ch_vcpu_tids() was inserting (tid, vcpu_id) instead of
(vcpu_id, tid).
This caused move_vcpus_to_sandbox_cgroup() to interpret vcpu IDs
(0, 1, 2...) as process IDs when sandbox_cgroup_only=false, leading
to failed attempts to read /proc/0/status.
Fixes: #12479
Signed-off-by: Chiranjeevi Uddanti <244287281+chiranjeevi-max@users.noreply.github.com>