kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-03-17 10:12:24 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	83dd7dcc75	runtimes: reject virtio-blk-mmio when confidential_guest is true Virtio-mmio transport is not hardened for confidential computing (unlike virtio-pci). Reject config that would use virtio-blk-mmio for rootfs/block when confidential_guest is set, so CoCo guests only use virtio-blk-pci. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-03-04 21:41:27 +01:00
Hyounggyu Choi	347ce5e3bc	runtime: Skip to call sandboxDevices() for remote hypervisor The remote hypervisor delegates VM creation to a remote service. The VM runs on cloud infrastructure, not the local host kernel. So requiring a KVM/MSHV device is semantically wrong and would cause a hard failure on any host where these devices are absent (e.g., a VM that doesn't expose nested virtualization). Skip sandboxDevices() entirely when the configured hypervisor type is remoteHypervisor{}. Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2026-03-03 13:44:12 +01:00
Steve Horsman	b147cb1319	Merge pull request #12587 from fidencio/topic/runtime-add-configurable-kubelet-root-dir runtimes: add configurable kubelet root dir	2026-02-28 19:06:14 +00:00
Zvonko Kaiser	eec397ac08	qemu: Remove PCIe root port BAR reserve sizing Stop computing and setting mem-reserve and pref64-reserve on PCIe root ports and switch ports. Remove getBARsMaxAddressableMemory() which scanned host GPU BARs to pre-calculate these values. The previous approach only considered GPU devices (IsGPU(), class 0x0302) when scanning for BAR sizes, so devices like NVSwitches (class 0x0680) with their 32MB non-prefetchable BAR0 were not accounted for and received the 4MB default. Additionally, GetTotalAddressableMemory() classifies BARs by 32/64-bit address width rather than by the prefetchable flag that QEMU's mem-reserve vs pref64-reserve maps to. Modern QEMU introspects VFIO device BARs when they are attached to root ports and sizes the MMIO windows accordingly. Modern OVMF (edk2-stable202502+) automatically calculates the 64-bit PCI MMIO aperture based on the BARs of actually present devices during PCI enumeration. Omitting the reserve parameters lets QEMU and OVMF handle MMIO window sizing correctly for all device types including GPUs, NVSwitches, and NICs without requiring host-side BAR scanning. This also removes the nvpci dependency from qemu_arch_base.go. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-02-27 22:54:31 +01:00
Zvonko Kaiser	bb7fd335f3	qemu: Remove OVMF X-PciMmio64Mb fw_cfg hint Modern OVMF (edk2-stable202502 and later) automatically sizes the 64-bit PCI MMIO aperture based on the BARs of actually attached devices during PCI enumeration. The opt/ovmf/X-PciMmio64Mb fw_cfg hint is no longer needed to ensure large-BAR devices like NVIDIA GPUs receive adequate MMIO space. The previous approach was fragile: the runtime scanned host PCI devices to estimate the required aperture size, but only considered GPU devices (class 0x0302), missing NVSwitches and other devices with large BARs. Removing this code avoids confusion about MMIO sizing responsibility. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-02-27 22:54:31 +01:00
Fabiano Fidêncio	0a73638744	runtime: add configurable kubelet root dir Different kubernetes distributions, such as k0s, use a different kubelet root dir location instead of the default /var/lib/kubelet, so ConfigMap and Secret volume propagation were failing. This adds a kubelet_root_dir config option that the go runtime uses when matching volume paths and kata-deploy now sets it automatically for k0s via a drop-in file. runtime-rs does not need this option: it identifies ConfigMap/Secret, projected, and downward-api volumes by volume-type path segment (kubernetes.io~configmap, etc.), not by kubelet root prefix. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-27 14:10:57 +01:00
Hyounggyu Choi	be5ae7d1e1	Merge pull request #12573 from BbolroC/support-memory-hotplug-go-runtime-s390x runtime: Support memory hotplug via virtio-mem on s390x	2026-02-27 09:59:40 +01:00
Hyounggyu Choi	b9f3d5aa67	runtime: Support memory hotplug with virtio-mem on s390x This commit adds logic to properly handle memory hotplug for QemuCCWVirtio in the ExecMemdevAdd() path. The new logic is triggered only when virtio-mem is enabled. Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2026-02-26 14:21:34 +01:00
Hyounggyu Choi	19771671c2	runtime: Handle virtio-mem resize in hotplugAddMemory() ResizeMemory() already contains the virtio-mem resize logic. However, hotplugAddMemory(), which is invoked via a different path, lacked this handling and always fell back to the pc-dimm path, even when virtio-mem was configured. This commit adds virtio-mem resize handling to hotplugAddMemory(). It also adds corresponding unit tests. Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2026-02-26 14:21:34 +01:00
Dan Mihai	7973e4e2a8	runtime: clh: disable nested vCPUs on MSHV The recently-added nested property is true by default, but is not supported yet on MSHV. See cloud-hypervisor/cloud-hypervisor#7408 for additional information. Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2026-02-25 21:01:25 +00:00
Dan Mihai	dc398e801c	runtime: clh: specify raw image format Specify raw image format for all guest block devices. - Attempting to auto-detect the image format from CLH would be riskier for the Host. - Creating a new raw image file, auto-detecting its format, and then creating a filesystem from the Guest onto the block device is no longer supported by CLH v51. Therefore, Kata CI's k8s-block-volume.bats would fail without specifying the raw format when hot plugging its block device. - See cloud-hypervisor/cloud-hypervisor@b3e8e2a for additional information. Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2026-02-25 21:01:25 +00:00
Dan Mihai	0629354ca0	versions: update cloud hypervisor to v51.1 ``` v51.1 ===== This is a bug fix release. The following issues have been addressed: * Fix image_type in OpenAPI definition (#7734) v51.0 ===== This release has been tracked in v51.0 group of our roadmap project. Security Fixes This release fixes a security vulnerability in disk image handling. Details can be found in GHSA-jmr4-g2hv-mjj6. * A new `backing_files=on\|off` option has been added to `--disk` to explicitly control whether QCOW2 backing files are permitted. This defaults to `off` to prevent the loading of backing files entirely. (#7685) * Explicit image type specification via the user interface, removing reliance on format autodetection (#7728). * Prevent sector-zero writes for autodetected raw images (#7728). Significant QCOW2 v3 Improvements A large number of QCOW2 v3 specification features have been implemented: * RAW backing file support for QCOW2 overlays (#7570) * Zero bit in L2 entries (#7627) * Incompatible feature bit validation (#7612) * Dirty bit support (#7636) * Variable refcount widths (1 to 64-bit) (#7633) * Corrupt bit detection and marking (#7639) * Autoclear feature bits handling (#7648) * Thread safety fix for multiple virtio queues (`num_queues > 1`) (#7661) * Correct zero-fill for reads beyond backing file size (#7678) * Live disk resize support (#7687) ACPI Generic Initiator Support ACPI Generic Initiator Affinity (SRAT Type 5) support has been added to associate VFIO-PCI devices with dedicated memory/CPU-less NUMA nodes. This enables the guest OS to make NUMA-aware memory allocation decisions for device workloads. A new `device_id` parameter has been added to `--numa` for specifying VFIO devices. (#7626) Block Device DISCARD and WRITE_ZEROES Support The `virtio-blk` device now supports `DISCARD` and `WRITE_ZEROES` operations for QCOW2 and RAW image formats. This enables thin provisioning and efficient space reclamation when guests trim filesystems. A new `sparse=on\|off` option has been added to `--disk` to control disk space management: `sparse=on` (default) enables thin provisioning with space reclamation, while `sparse=off` provides thick provisioning with consistent I/O latency. (#7666) Notable Performance Improvements * Transparent Huge Pages (THP) support has been extended to cover anonymous shared memory (`shared=on`) via `madvise`. Previously, THP was only used for non-shared memory. (#7646) * The `vhost-user-net` device now uses the default set of vhost-user virtio features, including `VIRTIO_F_RING_INDIRECT_DESC`, which provides a performance improvement. (#7653) MSHV Support Improvements * Optimize CPU state update after emulation by only updating special registers when changed (#7603) * Enable SMT for guests with `threads_per_core > 1` (#7668) * Stub `save_data_tables()` to unblock VM pause/resume (#7692) * Handle `GHCB_INFO_SPECIAL_DBGPRINT` VMG exit in SEV-SNP guest exit handler (#7703) * Fix CVM boot failure on MSHV (#7548) * Fix CPU topology detection for multithreaded configurations (#7576) Notable Bug Fixes * Fix VFIO device hot-remove leaving group and container file descriptors open, preventing re-add (#7676) * Fix snapshot restore when backing file is on read-only storage with `shared=false` (#7674) * Enforce `VIRTIO_BLK_F_RO` even if guest does not negotiate it (#7705) * Fix read-only block device FLUSH requests from OVMF preventing VMs from booting (#7706) * Fix vhost-user device not properly dropping unowned file descriptors (#7679) * Fix `vhost-user-block` `get_config` interoperability (#7617) * Fix vsock TOCTOU race condition by copying packet header from guest memory before processing (#7530) * Fix vsock handling of large TX packets spanning multiple data descriptors (#7680) * Add `gettid()` to all seccomp filters (#7596) * Fix MAC address parsing that wrongly allowed `+` instead of hex characters (#7579) * Improve UUID parse error message and `--net` fd help text (#7702) * Fix various inconsistencies in our OpenAPI specification file (#7716, #7726) * Various documentation fixes (#7602, #7606) ``` Signed-off-by: Dan Mihai <dmihai@microsoft.com>	2026-02-25 21:01:25 +00:00
stevenhorsman	ef1b0b2913	runtime: Fix mismatch in receiver names Fix: `ST1016: methods on the same type should have the same receiver name` Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	1b2ca678e5	runtime: Fix identifier names Fix identifiers that are non compliant with go's conventions e.g. not capitalising initialisations Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	69fea195f9	runtime: Fix arm unit test I think that `c727332b0e` broke the arm unit test by removing the arm specific overrides, so update the expected output Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	8f7a2b3d5d	runtime: Add copyright & licenses Add missing headers Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	f84b462b95	runtime: Fix typo in comment Fix `requiered` is a misspelling of `required` (misspell) Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	15813564f7	runtime: Avoid using fmt.Sprintf("%s", x) It's more efficient and concise to just call .String() Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	a577685a8a	runtime: Apply De Morgan's law QF1001: Distributing negation across terms and flipping operators, makes it easy for humans to process expressions at a time, vs evaluating a whole block and then flipping it and can allow for earlier exit Signed-off-by: stevenhorsman <steven@uk.ibm.com> fixup: demorgans	2026-02-24 14:33:04 +00:00
stevenhorsman	e86338c9c0	runtime: Remove explicit types in variable declarations QF1011 - use the short declaration as the type can be inferred Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	f60ee411f0	runtime: Update poorly chosen Duration names ST1011 - having time.Duration values with variable names of MS/Secs is misleading Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	6562ec5b61	runtime: Merge conditional assignment Fix `QF1007: could merge conditional assignment into variable declaration` Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	8cd3aa8c84	runtime: Remove embedded field from selector GenericDevice is an embedded (anonymous) field in the device struct, so its fields and methods are "promoted" to the outer struct, so we go straight to it. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	312567a137	runtime: Fix double imports Remove one of the double imports to tidy up the code Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	93c77a7d4e	runtime: Improve print statement fix `QF1012: Use fmt.Fprintf(...) instead of Write([]byte(fmt.Sprintf(...))) (staticcheck)` Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	cff8994336	runtime: Switch to switch statements Resolve: `QF1003: could use tagged switch on major (staticcheck)` Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:22:10 +00:00
Fabiano Fidêncio	cadbf51015	versions: Update Cloud Hypervisor to v50.0 ``` This release has been tracked in v50.0 group of our roadmap project. Configurable Nested Virtualization Option on x86_64 The nested=on\|off option has been added to --cpu to allow users to configure nested virtualization support in the guest on x86_64 hosts (for both KVM and MSHV). The default value is on to maintain consistency with existing behavior. (#7408) Compression Support for QCOW2 QCOW2 support has been extended to handle compression clusters based on zlib and zstd. (#7462) Notable Performance Improvements Performance of live migration has been improved via an optimized implementation of dirty bitmap maintenance. (#7468) Live Disk Resizing Support for Raw Images The /vm.resize-disk API has been introduced to allow users to resize block devices backed by raw images while a guest is running. (#7476) Developer Experience Improvements Significant improvements have been made to developer experience and productivity. These include a simplified root manifest, codified and tightened Clippy lints, and streamlined workflows for cargo clippy and cargo test. (#7489) Improved File-level Locking Support Block devices now use byte-range advisory locks instead of whole-file locks. While both approaches prevent multiple Cloud Hypervisor instances from simultaneously accessing the same disk image with write permissions, byte-range locks provide better compatibility with network storage backends. (#7494) Logging Improvements Logs now include event information generated by the event-monitor module. (#7512) Notable Bug Fixes * Fix several issues around CPUID in the guest (#7485, #7495, #7508) * Fix snapshot/restore for Windows Guest (#7492) * Respect queue size in block performance tests (#7515) * Fix several Serial Manager issues (#7502) * Fix several seccomp violation issues (#7477, #7497, #7518) * Fix various issues around block and qcow (#7526, #7528, #7537, #7546, #7549) * Retrieve MSRs list correctly on MSHV (#7543) * Fix live migration (and snapshot/restore) with AMX state (#7534) ``` Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-02-19 20:42:50 +01:00
Aurélien Bombo	48aa077e8c	runtime{,-rs}/qemu/arm64: Disable DAX Enabling full-featured QEMU NVDIMM support on ARM with DAX enabled causes a kernel panic in caches_clean_inval_pou (see below, different issue from `33b1f07`), so we disable DAX in that environment. [ 1.222529] EXT4-fs (pmem0p1): mounted filesystem e5a4892c-dac8-42ee-ba55-27d4ff2f38c3 ro with ordered data mode. Quota mode: disabled. [ 1.222695] VFS: Mounted root (ext4 filesystem) readonly on device 259:1. [ 1.224890] devtmpfs: mounted [ 1.225175] Freeing unused kernel memory: 1920K [ 1.226102] Run /sbin/init as init process [ 1.226164] with arguments: [ 1.226204] /sbin/init [ 1.226235] with environment: [ 1.226268] HOME=/ [ 1.226295] TERM=linux [ 1.230974] Internal error: synchronous external abort: 0000000096000010 [#1] SMP [ 1.231963] CPU: 0 UID: 0 PID: 1 Comm: init Tainted: G M 6.18.5 #1 NONE [ 1.232965] Tainted: [M]=MACHINE_CHECK [ 1.233428] pstate: 43400005 (nZcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 1.234273] pc : caches_clean_inval_pou+0x68/0x84 [ 1.234862] lr : sync_icache_aliases+0x30/0x38 [ 1.235412] sp : ffff80008000b9a0 [ 1.235842] x29: ffff80008000b9a0 x28: 0000000000000000 x27: 00000000019a00e1 [ 1.236912] x26: ffff80008000bc08 x25: ffff80008000baf0 x24: fffffdffc0000000 [ 1.238064] x23: ffff000001671ab0 x22: ffff000001663480 x21: fffffdffc23401c0 [ 1.239356] x20: fffffdffc23401c0 x19: fffffdffc23401c0 x18: 0000000000000000 [ 1.240626] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 1.241762] x14: ffffaae8f021b3b0 x13: 0000000000000000 x12: ffffaae8f021b3b0 [ 1.242874] x11: ffffffffffffffff x10: 0000000000000000 x9 : 0000ffffbb53c000 [ 1.244022] x8 : 0000000000000000 x7 : 0000000000000012 x6 : ffff55178f5e5000 [ 1.245157] x5 : ffff80008000b970 x4 : ffff00007fa4f680 x3 : ffff00008d007000 [ 1.246257] x2 : 0000000000000040 x1 : ffff00008d008000 x0 : ffff00008d007000 [ 1.247387] Call trace: [ 1.248056] caches_clean_inval_pou+0x68/0x84 (P) [ 1.248923] __sync_icache_dcache+0x7c/0x9c [ 1.249578] insert_page_into_pte_locked+0x1e4/0x284 [ 1.250432] insert_page+0xa8/0xc0 [ 1.251080] vmf_insert_page_mkwrite+0x40/0x7c [ 1.251832] dax_iomap_pte_fault+0x598/0x804 [ 1.252646] dax_iomap_fault+0x28/0x30 [ 1.253293] ext4_dax_huge_fault+0x80/0x2dc [ 1.253988] ext4_dax_fault+0x10/0x3c [ 1.254679] __do_fault+0x38/0x12c [ 1.255293] __handle_mm_fault+0x530/0xcf0 [ 1.255990] handle_mm_fault+0xe4/0x230 [ 1.256697] do_page_fault+0x17c/0x4dc [ 1.257487] do_translation_fault+0x30/0x38 [ 1.258184] do_mem_abort+0x40/0x8c [ 1.258895] el0_ia+0x4c/0x170 [ 1.259420] el0t_64_sync_handler+0xd8/0xdc [ 1.260154] el0t_64_sync+0x168/0x16c [ 1.260795] Code: d2800082 9ac32042 d1000443 8a230003 (d50b7523) [ 1.261756] ---[ end trace 0000000000000000 ]--- Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-02-18 11:52:43 -06:00
Aurélien Bombo	c727332b0e	runtime/qemu/arm64: Align NVDIMM usage on amd64 Nowadays on arm64 we use a modern QEMU version which supports the features we require for NVDIMM, so we remove the arm64-specific code and use the generic implementation. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-02-18 11:47:53 -06:00
Aurélien Bombo	e17f96251d	runtime{,-rs}/clh: Disable virtio-pmem This disables virtio-pmem support for Cloud Hypervisor by changing Kata config defaults and removing the relevant code paths. Signed-off-by: Aurélien Bombo <abombo@microsoft.com>	2026-02-18 11:47:53 -06:00
Markus Rudy	8365afa336	qemu: log exit code after failure When qemu exits prematurely, we usually see a message like msg="Cannot start VM" error="exiting QMP loop, command cancelled" This is an indirect hint, caused by the QMP server shutting down. It takes experience to understand what it even means, and it still does not show what's actually the problem. With this commit, we're taking the error return from the qemu subprocess and surface it in the logs, if it's not nil. This means we automatically capture any non-zero exit codes in the logs. Signed-off-by: Markus Rudy <mr@edgeless.systems>	2026-02-17 21:03:13 +01:00
Aurélien Bombo	981f693a88	Merge pull request #11140 from balintTobik/hyperv_warning runtime: refactor hypervisor devices cgroup creation	2026-02-13 15:16:09 -06:00
stevenhorsman	55a89f6836	runtime: doc: Remove usage of golang.org/x/net/context This package is deprecated and we aren't using it any more Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-13 17:55:23 +01:00
Balint Tobik	295a6a81d0	runtime: refactor hypervisor devices cgroup creation Separatly added hypervisor devices to cgroup to omit not relevant warnings and fail if none of them are available. Also fix a testcase reload removed kernel modules to later testcases and skip some tests on ARM because lack of virtualization support Fixes #6656 Signed-off-by: Balint Tobik <btobik@redhat.com>	2026-02-13 09:23:08 +01:00
stevenhorsman	e84d234721	doc: Update broken/slow URLs Update the URLs to better/existing links Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-10 21:58:28 +01:00
Fabiano Fidêncio	5c0269881e	tests: Make editorconfig-checker happy - Trim trailing whitespace and ensure final newline in non-vendor files - Add .editorconfig-checker.json excluding vendor dirs, .patch, .img, .dtb, .drawio, *.svg, and pkg/cloud-hypervisor/client so CI only checks project code - Leave generated and binary assets unchanged (excluded from checker) Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-02-10 21:58:28 +01:00
Konstantin Khlebnikov	5d99a141d9	runtime: add hypervisor options for NUMA topology With enable_numa=true hypervisor will expose host NUMA topology as is: map vm NUMA nodes to host 1:1 and bind vpus to relates CPUS. Option "numa_mapping" allows to redefine NUMA nodes mapping: - map each vm node to particular host node or several numa nodes - emulate numa on host without numa (useful for tests) Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Co-authored-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-02-09 20:09:25 +01:00
Alex Lyn	41e8acbc5e	runtime: Map empty ReadStdout/ReadStderr response to io.EOF After the kata-agent "drain-after-exit" change, stdout/stderr EOF is signaled by a successful ReadStdout/ReadStderr reply with empty Data (len==0), instead of an RPC error. However, runtime-go currently returns (0, nil) to io.CopyBuffer() when resp.Data is empty, which violates Go io.Reader semantics and can cause `kubectl exec` to hang after the command output is already printed. To avoid exec hang: In readProcessStream(), map an empty response (len(resp.Data)==0) into (0, io.EOF). This allows the stdout/stderr copy goroutines to terminate, closes exitIOch, and unblocks the wait path so exec can complete normally. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-02-09 15:56:13 +01:00
Manuel Huber	a786582d0b	rootfs: deprecate initramfs dm-verity mode Remove the initramfs folder, its build steps, and use the kernel based dm-verity enforcement for the handlers which used the initramfs mode. Also, remove the initramfs verity mode capability from the shims and their configs. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-02-05 23:04:35 +01:00
Manuel Huber	7958be8634	runtime: Make kernel_verity_params overwritable Similar to the kernel_params annotation, add a kernel_verity_params annotation and add logic to make these parameters overwritable. For instance, this can be used in test logic to provide bogus dm-verity hashes for negative tests. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-02-05 23:04:35 +01:00
Manuel Huber	f639c3fa17	runtime: Enable kernelinit dm-verity variant This change introduces the kernel_verity_parameters knob to the Go based shim, picking up dm-verity information in a new config field (the corresponding build variable is already produced by the shim build). The change extends the shim to parse dm-verity information from this parameter and to construct the kernel command line appropriately, based on the indicated initramfs or kernelinit build variant. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-02-05 23:04:35 +01:00
Joji Mekkattuparamban	1440dd7468	shim: enforce iommufd for confidential guest vfio Confidential guests cannot use traditional IOMMU Group based VFIO. Instead, they need to use IMMUFD. This is mainly because the group abstraction is incompatible with a confidential device model. If traditional VFIO is specified for a confidential guest, detect the error and bail out early. Fixes #12393 Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>	2026-01-28 00:11:38 +01:00
XanderC	93beb58c5d	runtime: fix network initialization for non-hotplug VMMs In startVM(), for VMMs without hotplug support (e.g., Firecracker or QEMU microvm), the runtime runs prestart hooks but misses rescanning the network namespace. This causes VMs to boot with uninitialized network configs, as updates from CNI plugins are not captured. This patch adds a network rescan via AddEndpoints after prestart hooks for the non-hotplug path, ensuring correct network info is passed to the VMM configuration before the VM starts. Fixes #11500 Signed-off-by: XanderC <xanderc@qq.com>	2026-01-17 23:56:59 +01:00
Fabiano Fidêncio	33b1f0786e	Revert "arm64: Do not use DAX with the rootfs image" This reverts commit `2acb94ef2d`, as we have a kernel patch approved fixing the issue. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-17 19:15:53 +01:00
Manuel Huber	956f43c6c6	runtime: skip MoveTo for systemd cgroups Systemd-managed cgroups use the slice:prefix:name format, which is not a filesystem path. Calling MoveTo() on such paths fails with "invalid group path" and can abort cleanup before Delete() runs. In some cases, this causes pod teardown delays. Skip MoveTo for systemd-formatted sandbox/overhead cgroup paths when sandbox_cgroup_only is true; systemd moves tasks on unit deletion. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-01-16 16:41:38 +01:00
Fabiano Fidêncio	2acb94ef2d	arm64: Do not use DAX with the rootfs image Kernel 6.18.x has an issue with DAX, which is not yet fixed upstream: ``` [ 0.737679] EXT4-fs (pmem0p1): mounted filesystem 79676804-7c8b-491a-b2a6-9bae3c72af70 ro with ordered data mode. Quota mode: disabled. [ 0.737891] VFS: Mounted root (ext4 filesystem) readonly on device 259:1. [ 0.739119] devtmpfs: mounted [ 0.739476] Freeing unused kernel memory: 1920K [ 0.740156] Run /sbin/init as init process [ 0.740229] with arguments: [ 0.740286] /sbin/init [ 0.740321] with environment: [ 0.740369] HOME=/ [ 0.740400] TERM=linux [ 0.743162] Unable to handle kernel paging request at virtual address fffffdffbf000008 [ 0.743285] Mem abort info: [ 0.743316] ESR = 0x0000000096000006 [ 0.743371] EC = 0x25: DABT (current EL), IL = 32 bits [ 0.743444] SET = 0, FnV = 0 [ 0.743489] EA = 0, S1PTW = 0 [ 0.743545] FSC = 0x06: level 2 translation fault [ 0.743610] Data abort info: [ 0.743656] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000 [ 0.743720] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 0.743785] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 0.743848] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000b9d17000 [ 0.743931] [fffffdffbf000008] pgd=10000000bfa3d403, p4d=10000000bfa3d403, pud=1000000040bfe403, pmd=0000000000000000 [ 0.744070] Internal error: Oops: 0000000096000006 [#1] SMP [ 0.748888] CPU: 0 UID: 0 PID: 1 Comm: init Not tainted 6.18.4 #1 NONE [ 0.749421] pstate: 004000c5 (nzcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 0.749969] pc : dax_disassociate_entry.constprop.0+0x20/0x50 [ 0.750444] lr : dax_insert_entry+0xcc/0x408 [ 0.750802] sp : ffff80008000b9e0 [ 0.751083] x29: ffff80008000b9e0 x28: 0000000000000000 x27: 0000000000000000 [ 0.751682] x26: 0000000001963d01 x25: ffff0000004f7d90 x24: 0000000000000000 [ 0.752264] x23: 0000000000000000 x22: ffff80008000bcc8 x21: 0000000000000011 [ 0.752836] x20: ffff80008000ba90 x19: 0000000001963d01 x18: 0000000000000000 [ 0.753407] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 0.753970] x14: ffffbf3154b9ae70 x13: 0000000000000000 x12: ffffbf3154b9ae70 [ 0.754548] x11: ffffffffffffffff x10: 0000000000000000 x9 : 0000000000000000 [ 0.755122] x8 : 000000000000000d x7 : 000000000000001f x6 : 0000000000000000 [ 0.755707] x5 : 0000000000000000 x4 : 0000000000000000 x3 : fffffdffc0000000 [ 0.756287] x2 : 0000000000000008 x1 : 0000000040000000 x0 : fffffdffbf000000 [ 0.756871] Call trace: [ 0.757107] dax_disassociate_entry.constprop.0+0x20/0x50 (P) [ 0.757592] dax_iomap_pte_fault+0x4fc/0x808 [ 0.757951] dax_iomap_fault+0x28/0x30 [ 0.758258] ext4_dax_huge_fault+0x80/0x2dc [ 0.758594] ext4_dax_fault+0x10/0x3c [ 0.758892] __do_fault+0x38/0x12c [ 0.759175] __handle_mm_fault+0x530/0xcf0 [ 0.759518] handle_mm_fault+0xe4/0x230 [ 0.759833] do_page_fault+0x17c/0x4dc [ 0.760144] do_translation_fault+0x30/0x38 [ 0.760483] do_mem_abort+0x40/0x8c [ 0.760771] el0_ia+0x4c/0x170 [ 0.761032] el0t_64_sync_handler+0xd8/0xdc [ 0.761371] el0t_64_sync+0x168/0x16c [ 0.761677] Code: f9453021 f2dfbfe3 cb813080 8b001860 (f9400401) [ 0.762168] ---[ end trace 0000000000000000 ]--- [ 0.762550] note: init[1] exited with irqs disabled [ 0.762631] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ``` For now, we limit the rootfs that we ship to ARM64 to not use DAX, in the future we'll re-enable it as soon as the patch lands on mainstream kernel. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-01-14 11:46:40 +01:00
LandonTClipp	137866f793	runtime: Allow QMP commands to be logged in debug level Logging the QMP commands gives us a lot of flexibility to troubleshoot issues with what is being sent to QEMU. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2025-12-10 15:46:28 +01:00
LandonTClipp	a3b5764f67	runtime: Fix import cycle and add unit test for IOMMUFDID() An import cycle was introduced because of a mutual need for the constant that describes the prefix of IOMMUFD files. We need to extract this out into a higher-level package. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2025-12-10 15:46:28 +01:00
LandonTClipp	09438fd54f	runtime: Add IOMMUFD Object Creation for QEMU QMP Commands The QMP commands sent to QEMU did not properly set up IOMMUFD objects in the codepath that handles VFIO device hot-plugging. This is mainly relevant in the Kubernetes use-case where the VFIO devices are not available when QEMU is first launched. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2025-12-10 15:46:28 +01:00
Manuel Huber	cb8fd2e3b1	runtime: gpu: Skip CDI annos for pause container The pause container does not need CDI annotations, these are only intended for workload containers. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2025-12-10 13:26:04 +01:00

1 2 3 4 5 ...

1317 Commits