Modern container runtimes (Docker 29+) no longer advertise their
identity through OCI hooks or annotations. Rather than attempting
fragile runtime detection, check for the symptom: no network
endpoints after sandbox creation.
- Remove IsDockerContainer guard from RescanNetwork goroutine
- Remove container kill on timeout (too aggressive without reliable
runtime detection, breaks CNI on slow architectures)
- Restore original startVM endpoint scan condition (fixes CNI
regression on s390x)
- RescanNetwork returns nil on timeout with warning instead of error
Signed-off-by: llink5 <llink5@users.noreply.github.com>
Docker 26+ configures container networking (veth pair, IP addresses,
routes) after task creation rather than before. Kata's endpoint scan
runs during CreateSandbox, before the interfaces exist, resulting in
VMs starting without network connectivity (no -netdev passed to QEMU).
Add RescanNetwork() which runs asynchronously after the Start RPC.
It polls the network namespace until Docker's interfaces appear, then
hotplugs them to QEMU and informs the guest agent to configure them
inside the VM.
Additional fixes:
- mountinfo parser: find fs type dynamically instead of hardcoded
field index, fixing parsing with optional mount tags (shared:,
master:)
- IsDockerContainer: check CreateRuntime hooks for Docker 26+
- DockerNetnsPath: extract netns path from libnetwork-setkey hook
args with path traversal protection
- detectHypervisorNetns: verify PID ownership via /proc/pid/cmdline
to guard against PID recycling
- startVM guard: rescan when len(endpoints)==0 after VM start
Fixes: #9340
Signed-off-by: llink5 <llink5@users.noreply.github.com>
The govmm workflow isn't run by us and it and the other CI files
are just legacy from when it was a separate repo, so let's clean up
this debt rather than having to update it frequently.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Update the action to resolve the following warning in GHA:
> Node.js 20 actions are deprecated. The following actions are running
> on Node.js 20 and may not work as expected:
> actions/checkout@11bd71901b.
> Actions will be forced to run with Node.js 24 by default starting June 2nd, 2026.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Add two new configuration knobs that control the logical and physical
sector sizes advertised by virtio-blk devices to the guest:
block_device_logical_sector_size (config file)
block_device_physical_sector_size (config file)
io.katacontainers.config.hypervisor.blk_logical_sector_size (annotation)
io.katacontainers.config.hypervisor.blk_physical_sector_size (annotation)
The annotation names are abbreviated relative to the config file keys
because Kubernetes enforces a 63-character limit on annotation name
segments, and the full names would exceed it.
Both settings default to 0 (let QEMU decide). When set, they are passed
as logical_block_size and physical_block_size in the QMP device_add
command during block device hotplug.
Setting logical_sector_size smaller then container filesystem
block size will cause EINVAL on mount. The physical_sector_size can
always be set independently.
Values must be 0 or a power of 2 in the range [512, 65536]; other
values are rejected with an error at sandbox creation time.
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
The enablement of the trusted ephemeral storage for IBM SEL was
missed in #10559. Set the emptydir_mode properly for the TEE.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Enable VFIO device pass-through at VM creation time on Cloud Hypervisor,
in addition to the existing hot-plug path.
Signed-off-by: Roaa Sakr <romoh@microsoft.com>
Use the container data storage feature for the k8s-nvidia-nim.bats
test pod manifests. This reduces the pods' memory requirements.
For this, enable the block-encrypted emptydir_mode for the NVIDIA
GPU TEE handlers.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
As the NVIDIA stack has shifted to using an image for both the
confidential and non-confidential variants, we retire the initrd
build.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Replace hardcoded NVIDIA vendor ID (0x10de) and class (0x030) checks
with a vendor-agnostic lookup table (cdiDeviceKind) that maps PCI
vendor/class pairs to CDI device kinds. This makes it straightforward
to add support for new device types by adding entries to the table.
Refactor siblingAnnotation to resolve device BDFs once upfront and
reuse them for both CDI type detection and sibling matching, eliminating
redundant sysfs reads. Devices not in the lookup table (e.g. NVSwitches)
are skipped with errNoSiblingFound, while known device types that fail
to match a sibling produce a hard error.
Consolidate the hot-plug and cold-plug device loops into a single loop
over extracted container paths, removing duplicated filtering logic.
Export GetPCIDeviceProperty from the device drivers package to allow
vendor/class lookup from sysfs in the container annotation path.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
For cold-plug when running with nerdctl the timeouts in the config
are being used, increase the dial_timeout (e.g. for CreateSandbox) to match
create_container_timeout.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
* Introduces the `emptydir_mode` config flag to allow instructing the runtime
to create a block device for emptyDir volumes.
* The block device is created in the original emptyDir folder on the host
so that Kubelet can monitors its disk usage and evict the pod if it exceeds
its sizeLimit. This matches runc and virtio-fs.
* The block device's disk image file is sparse to minimize host disk
footprint.
Fixes: #10560
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Some VMMs support plugging a disk as an image file instead of a block device,
so we adapt the runtime to support that.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Co-authored-by: Aurélien Bombo <abombo@microsoft.com>
Virtio-mmio transport is not hardened for confidential computing (unlike
virtio-pci). Reject config that would use virtio-blk-mmio for rootfs/block
when confidential_guest is set, so CoCo guests only use virtio-blk-pci.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The remote hypervisor delegates VM creation to a remote service.
The VM runs on cloud infrastructure, not the local host kernel.
So requiring a KVM/MSHV device is semantically wrong and would
cause a hard failure on any host where these devices are absent
(e.g., a VM that doesn't expose nested virtualization).
Skip sandboxDevices() entirely when the configured hypervisor type
is remoteHypervisor{}.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Sometimes OVFM provides incorrect values to the kernel
we override it by telling the kernel to handle the PCI space setup
like allocating the proper window sizes and assigning the proper busses
to each device.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Stop computing and setting mem-reserve and pref64-reserve on PCIe root
ports and switch ports. Remove getBARsMaxAddressableMemory() which
scanned host GPU BARs to pre-calculate these values.
The previous approach only considered GPU devices (IsGPU(), class
0x0302) when scanning for BAR sizes, so devices like NVSwitches (class
0x0680) with their 32MB non-prefetchable BAR0 were not accounted for
and received the 4MB default. Additionally, GetTotalAddressableMemory()
classifies BARs by 32/64-bit address width rather than by the
prefetchable flag that QEMU's mem-reserve vs pref64-reserve maps to.
Modern QEMU introspects VFIO device BARs when they are attached to
root ports and sizes the MMIO windows accordingly. Modern OVMF
(edk2-stable202502+) automatically calculates the 64-bit PCI MMIO
aperture based on the BARs of actually present devices during PCI
enumeration. Omitting the reserve parameters lets QEMU and OVMF
handle MMIO window sizing correctly for all device types including
GPUs, NVSwitches, and NICs without requiring host-side BAR scanning.
This also removes the nvpci dependency from qemu_arch_base.go.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Modern OVMF (edk2-stable202502 and later) automatically sizes the
64-bit PCI MMIO aperture based on the BARs of actually attached
devices during PCI enumeration. The opt/ovmf/X-PciMmio64Mb fw_cfg
hint is no longer needed to ensure large-BAR devices like NVIDIA
GPUs receive adequate MMIO space.
The previous approach was fragile: the runtime scanned host PCI
devices to estimate the required aperture size, but only considered
GPU devices (class 0x0302), missing NVSwitches and other devices
with large BARs. Removing this code avoids confusion about MMIO
sizing responsibility.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Different kubernetes distributions, such as k0s, use a different kubelet
root dir location instead of the default /var/lib/kubelet, so ConfigMap
and Secret volume propagation were failing.
This adds a kubelet_root_dir config option that the go runtime uses when
matching volume paths and kata-deploy now sets it automatically for k0s
via a drop-in file.
runtime-rs does not need this option: it identifies ConfigMap/Secret,
projected, and downward-api volumes by volume-type path segment
(kubernetes.io~configmap, etc.), not by kubelet root prefix.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This commit adds logic to properly handle memory hotplug
for QemuCCWVirtio in the ExecMemdevAdd() path.
The new logic is triggered only when virtio-mem is enabled.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
ResizeMemory() already contains the virtio-mem resize logic.
However, hotplugAddMemory(), which is invoked via a different
path, lacked this handling and always fell back to the pc-dimm
path, even when virtio-mem was configured.
This commit adds virtio-mem resize handling to hotplugAddMemory().
It also adds corresponding unit tests.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Specify raw image format for all guest block devices.
- Attempting to auto-detect the image format from CLH would be riskier
for the Host.
- Creating a new raw image file, auto-detecting its format, and then
creating a filesystem from the Guest onto the block device is no
longer supported by CLH v51. Therefore, Kata CI's k8s-block-volume.bats
would fail without specifying the raw format when hot plugging its block
device.
- See cloud-hypervisor/cloud-hypervisor@b3e8e2a for additional information.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
```
v51.1
=====
This is a bug fix release. The following issues have been addressed:
* Fix image_type in OpenAPI definition (#7734)
v51.0
=====
This release has been tracked in v51.0 group of our roadmap project.
Security Fixes
This release fixes a security vulnerability in disk image handling.
Details can be found in GHSA-jmr4-g2hv-mjj6.
* A new `backing_files=on|off` option has been added to `--disk` to
explicitly control whether QCOW2 backing files are permitted. This
defaults to `off` to prevent the loading of backing files entirely.
(#7685)
* Explicit image type specification via the user interface, removing
reliance on format autodetection (#7728).
* Prevent sector-zero writes for autodetected raw images (#7728).
Significant QCOW2 v3 Improvements
A large number of QCOW2 v3 specification features have been implemented:
* RAW backing file support for QCOW2 overlays (#7570)
* Zero bit in L2 entries (#7627)
* Incompatible feature bit validation (#7612)
* Dirty bit support (#7636)
* Variable refcount widths (1 to 64-bit) (#7633)
* Corrupt bit detection and marking (#7639)
* Autoclear feature bits handling (#7648)
* Thread safety fix for multiple virtio queues (`num_queues > 1`)
(#7661)
* Correct zero-fill for reads beyond backing file size (#7678)
* Live disk resize support (#7687)
ACPI Generic Initiator Support
ACPI Generic Initiator Affinity (SRAT Type 5) support has been added
to associate VFIO-PCI devices with dedicated memory/CPU-less NUMA
nodes. This enables the guest OS to make NUMA-aware memory allocation
decisions for device workloads. A new `device_id` parameter has been
added to `--numa` for specifying VFIO devices. (#7626)
Block Device DISCARD and WRITE_ZEROES Support
The `virtio-blk` device now supports `DISCARD` and `WRITE_ZEROES`
operations for QCOW2 and RAW image formats. This enables thin
provisioning and efficient space reclamation when guests trim
filesystems. A new `sparse=on|off` option has been added to `--disk` to
control disk space management: `sparse=on` (default) enables thin
provisioning with space reclamation, while `sparse=off` provides thick
provisioning with consistent I/O latency. (#7666)
Notable Performance Improvements
* Transparent Huge Pages (THP) support has been extended to cover
anonymous shared memory (`shared=on`) via `madvise`. Previously, THP
was only used for non-shared memory. (#7646)
* The `vhost-user-net` device now uses the default set of vhost-user
virtio features, including `VIRTIO_F_RING_INDIRECT_DESC`, which
provides a performance improvement. (#7653)
MSHV Support Improvements
* Optimize CPU state update after emulation by only updating special
registers when changed (#7603)
* Enable SMT for guests with `threads_per_core > 1` (#7668)
* Stub `save_data_tables()` to unblock VM pause/resume (#7692)
* Handle `GHCB_INFO_SPECIAL_DBGPRINT` VMG exit in SEV-SNP guest exit
handler (#7703)
* Fix CVM boot failure on MSHV (#7548)
* Fix CPU topology detection for multithreaded configurations (#7576)
Notable Bug Fixes
* Fix VFIO device hot-remove leaving group and container file
descriptors open, preventing re-add (#7676)
* Fix snapshot restore when backing file is on read-only storage with
`shared=false` (#7674)
* Enforce `VIRTIO_BLK_F_RO` even if guest does not negotiate it
(#7705)
* Fix read-only block device FLUSH requests from OVMF preventing VMs
from booting (#7706)
* Fix vhost-user device not properly dropping unowned file descriptors
(#7679)
* Fix `vhost-user-block` `get_config` interoperability (#7617)
* Fix vsock TOCTOU race condition by copying packet header from guest
memory before processing (#7530)
* Fix vsock handling of large TX packets spanning multiple data
descriptors (#7680)
* Add `gettid()` to all seccomp filters (#7596)
* Fix MAC address parsing that wrongly allowed `+` instead of hex
characters (#7579)
* Improve UUID parse error message and `--net` fd help text (#7702)
* Fix various inconsistencies in our OpenAPI specification file
(#7716, #7726)
* Various documentation fixes (#7602, #7606)
```
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
I think that c727332b0e
broke the arm unit test by removing the arm specific overrides,
so update the expected output
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
QF1001: Distributing negation across terms and flipping operators, makes it
easy for humans to process expressions at a time, vs evaluating a whole block
and then flipping it and can allow for earlier exit
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
fixup: demorgans