kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-07-01 22:50:54 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	bddf1ecab4	build: stop producing cloud-hypervisor-glibc artifacts Drop cloud-hypervisor-glibc from local and CI kata-deploy build targets now that Azure CLH uses the standard cloud-hypervisor artifact set. This removes obsolete build matrix entries and installer target handling. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-28 23:32:37 +02:00
Fabiano Fidêncio	81ce51a9aa	ci: target Azure CLH runtimes directly in AKS tests Switch AKS Mariner matrix entries to clh-azure handlers and remove the temporary host-OS based helm value overrides. Update integration test wiring and required test labels so CI tracks the new runtime names. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-28 23:32:37 +02:00
Fabiano Fidêncio	8c3a2c1a95	kata-deploy: register clh-azure shim families Add clh-azure and clh-azure-runtime-rs as first-class shims across installer logic, helm defaults, runtimeclass overhead mapping, and shim component catalogs. This aligns deploy payload selection with the new native Azure-specific CLH configs. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-28 23:32:37 +02:00
Fabiano Fidêncio	f36c383b4f	runtime: generate dedicated CLH Azure config variants Create configuration-clh-azure{,-runtime-rs}.toml from the base CLH configs during build. This keeps Mariner-specific defaults in explicit config artifacts instead of ad-hoc runtime mutation. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-28 23:32:37 +02:00
Fabiano Fidêncio	a423cf9526	Merge pull request #13087 from bpradipt/landlock kernel: Enable landlock LSM	2026-05-27 17:34:47 +02:00
Pradipta Banerjee	1487eaaaa2	kernel: Enable landlock LSM Allows using landlock LSM for the container process Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>	2026-05-27 13:33:46 +02:00
Fabiano Fidêncio	5adfb27297	Merge pull request #13118 from PiotrProkop/fix-missing-cwd agent: restore process CWD auto-creation	2026-05-27 13:32:05 +02:00
Fabiano Fidêncio	614dff4bfc	Merge pull request #13119 from manuelh-dev/mahuber/erofs-multi-layer-fix agent: compact EROFS overlay lowerdirs	2026-05-27 11:27:46 +02:00
Fabiano Fidêncio	238dd51039	Merge pull request #13108 from thebigbone/containerd-config containerd: use /etc/containerd/conf.d/ drop-in for containerd >= 2.2.0	2026-05-27 10:14:51 +02:00
PiotrProkop	60a2e27f02	agent: Restore process CWD auto-creation Commit `b56313472` ("agent: Align agent OCI spec with oci-spec-rs", PR #9944) inverted the condition guarding the create_dir_all call for process.cwd: the leading `!` was dropped during the refactor. As a result, the CWD is created only when process.cwd is the empty string. When the guest then runs chdir(process.cwd) and CWD doesn't exist it returns ENOENT. The agent propagates that to the shim, which surfaces it to containerd as "failed to create shim task: ENOENT: No such file or directory" — indistinguishable from a missing argv[0]. This regressed the original fix in PR #2375 (Fixes #2374), which deliberately mirrored runc's behavior. Put the `!` back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: PiotrProkop <pprokop@nvidia.com>	2026-05-27 09:59:15 +02:00
Fabiano Fidêncio	f1c100797b	Merge pull request #12955 from zvonkok/nvgpu-target build: add nvgpu-tarball target	2026-05-27 09:44:37 +02:00
Fabiano Fidêncio	64056add0d	build: add passthrough mode to kata-deploy-merge-builds kata-deploy now unpacks individual component tarballs itself, so the final `kata-static.tar.zst` no longer needs to be a merged filesystem payload. Merging everything has two downsides for that flow: - It pulls in everything kept on disk under build/, which previously forced us to also drop agent/busybox/coco-guest-components/nydus from the build set to keep them out of the final tarball. - The merged tarball duplicates content kata-deploy will repack on its own anyway. Add a `passthrough` mode to kata-deploy-merge-builds.sh that, instead of untarring each `kata-static-*.tar.zst` into a single filesystem tree, copies the selected component tarballs into the final tarball as-is. The existing `merge` mode remains the default to preserve the non-kata-deploy install paths (e.g. `make install-tarball`). Wire `nvgpu-tarball` to the new mode via `FINAL_TARBALL_MERGE_MODE= passthrough`, paired with the existing `FINAL_TARBALL_INPUTS` allowlist. This lets us keep agent/busybox/coco as build prereqs of the GPU rootfs while shipping a final tarball that only contains the NVIDIA-relevant components. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-26 21:55:08 +02:00
Zvonko Kaiser	9b85bff2b4	build: don't double-prefix absolute versions.yaml path in merge-builds The Makefile passes $(MK_DIR)/../../../../versions.yaml — already an absolute path — to kata-deploy-merge-builds.sh. The script then unconditionally prepended ${PWD}/, producing a malformed path like: /repo//repo/tools/.../local-build//../../../../versions.yaml which made cp fail with "No such file or directory" at the merge-builds step (the very last step of `make nvgpu-tarball`). Only prepend ${PWD}/ when the input is relative — that preserves the original fix for the pushd-changes-cwd issue (commit `ae6e8d2b3`) without mangling absolute paths from Makefile callers. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> Assisted-By: Claude <noreply@anthropic.com>	2026-05-26 21:55:08 +02:00
Zvonko Kaiser	5aa6229eba	build: group parallel build output by target With `make all -j N` running multiple tarballs concurrently and silent mode redirecting each build's stdio to its per-target log, a failing target's "Failed to build: <name>, logs:" banner gets interleaved with other in-flight jobs' output, making it hard to tell which target failed. Pass `--output-sync=target` to the recursive make so each sub-make's output is buffered and emitted as one block when the target finishes, keeping the failure banner contiguous with its log dump. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> Assisted-By: Claude <noreply@anthropic.com>	2026-05-26 21:55:08 +02:00
Zvonko Kaiser	3be370d2d6	qemu: clean stale clone before fetching sources build-qemu.sh runs in the per-target builddir (e.g. build/qemu-tarball/builddir/), which persists across runs. If a previous build left the cloned `qemu` tree behind (e.g. after an interrupted build), the next run errors out with: fatal: destination path 'qemu' already exists and is not an empty directory. Wipe `qemu` before cloning so the build is repeatable from a dirty builddir. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> Assisted-By: Claude <noreply@anthropic.com>	2026-05-26 21:55:08 +02:00
Zvonko Kaiser	18cee00df9	build: guard parallel races on build symlink and ~/.docker Parallel make jobs invoke kata-deploy-binaries-in-docker.sh concurrently and collide on two shared paths: ln: Already exists mkdir: /home/$USER/.docker: File exists Skip the symlink creation when the link is already in place. If a parallel job wins the create race in the cold-start window, fall back to re-checking that the link exists so a real ln failure (permission, disk full, etc.) still propagates rather than being silently swallowed. The `~/.docker` mkdir is guarded by a `[[ ! -d ]]` check that two processes can pass simultaneously, after which one bare `mkdir` fails. Switch to `mkdir -p` so the second invocation is a no-op. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-26 21:55:08 +02:00
Zvonko Kaiser	815ebc340d	build: add nvgpu-tarball target serial-targets now waits for the other BASE_TARBALLS items so the inner rootfs assembly runs with DEPS= against already-built artifacts. This also fixes a pre-existing race in the main flows where the outer parallel and inner -j 1 makes could both build kernel-tarball at the same time. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-26 21:55:08 +02:00
Zvonko Kaiser	6a367ab777	build: declare install-prebuilt-artifacts as .PHONY Leftover from #12954's rebase: the substantive sed-hack -> DEPS= change landed on main, but the .PHONY declaration didn't make it. Add it so the recipe always runs even if a stale `kata-artifacts` file exists in CWD. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> Assisted-By: Claude <noreply@anthropic.com>	2026-05-26 21:55:08 +02:00
thebigbone	d9f2aa895e	containerd: use /etc/containerd/conf.d/ drop-in for containerd >= 2.2.0 containerd 2.2.0+ always imports /etc/containerd/conf.d/*.toml, so write kata-deploy runtime config there directly, avoiding modification of the main containerd config's imports array. Signed-off-by: thebigbone <pacman@duck.com>	2026-05-26 21:29:46 +02:00
Manuel Huber	e838cd7d8d	agent: compact EROFS overlay lowerdirs Use kata_types::mount::Mount for the final multi-layer EROFS overlay mount instead of calling baremount() directly. The mount helper detects overlay option strings close to the kernel mount data limit. When lowerdir entries share a common parent, it changes into that directory and rewrites lowerdir to relative paths. That avoids repeating the same long prefix for every layer. Multi-layer EROFS images can have many lower layers under /run/kata-containers/<cid>/multi-layer. Passing the raw absolute lowerdir list can exceed the mount option buffer and fail the final overlay mount, even after all layer devices mounted successfully. Reuse the helper so this path follows Kata's normal overlay mount handling, including lowerdir compaction before mount(2). Assisted-by: OpenAI Codex <codex@openai.com> Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-26 18:42:11 +00:00
Fabiano Fidêncio	d75a91ee09	Merge pull request #13114 from manuelh-dev/mahuber/nv-fix-policy-check tests: nvidia: No policy for runtime-rs path	2026-05-26 20:00:02 +02:00
Dan Mihai	c81dadaba1	Merge pull request #13064 from burgerdev/add-arp-neighbour agent: use rtnetlink to add ARP neighbour	2026-05-26 09:59:44 -07:00
Manuel Huber	6a715cf4f7	tests: nvidia: No policy for runtime-rs path The current if condition causes agent security policies to be attached to the non-TEE NVIDIA runtime-rs runtime class. While this is good to see that it works, this is not intended. Thus, replacting the condition with is_confidential_gpu_hypervisor. Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2026-05-25 16:00:49 -07:00
Fabiano Fidêncio	25491fc20c	Merge pull request #13104 from kata-containers/topic/kata-deploy-build-as-an-artefact kata-deploy: prebuild payload-specific component artifacts	2026-05-25 22:56:55 +02:00
Fabiano Fidêncio	c65d64873b	kata-deploy: prebuild payload-specific component artifacts Build and publish the kata-deploy binary and CoCo guest-pull nydus snapshotter as dedicated per-arch artifacts, then consume those tarballs when assembling the kata-deploy image. This avoids rebuilding those components in the payload image (which would happen in serial) path and reduces overall CI build time. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-25 22:13:41 +02:00
Fabiano Fidêncio	3dc02a8604	Merge pull request #13085 from Apokleos/erofs-gpt-vmdk-only runtime-rs: Support erofs snapshotter with gpt vmdk mode	2026-05-25 16:29:59 +02:00
Zvonko Kaiser	6c6c5809f1	Merge pull request #13109 from fidencio/topic/build-validate-measured-rootfs-root-hashes-for-all-shims build: Validate measured-rootfs root hashes all shims	2026-05-25 15:58:35 +02:00
Zvonko Kaiser	aeadb1af35	Merge pull request #12948 from fidencio/topic/numa runtime (go): agent: Add NUMA support for QEMU	2026-05-25 15:33:14 +02:00
Alex Lyn	53699b0170	docs: Reset max_unmerged_layers = 0 for gpt+vmdk mode As max_unmerged_layers = 1 is just for fsmerge mode, as containerd temperally unsupport fsmerge, we just reset it with default 0. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:13:28 +08:00
Alex Lyn	a359d13476	build: Validate measured-rootfs root hashes all shims The cached shim-v2 tarballs ship per-variant `root_hash_.txt` files embedded in the matching measured-rootfs image. Until now only shim-v2-rust validated those hashes against the freshly built rootfs images on a cache hit; shim-v2-go reused whatever was cached without checking, even though its bundled configuration files contain the `KERNELVERITYPARAMS_` values baked in at build time. When a PR changes the agent (and therefore the rootfs image and its dm-verity hash) but does not touch `src/runtime`, the shim-v2-go cache key stays the same and the stale tarball is reused. The resulting guest cmdline carries a verity hash that no longer matches the new rootfs image, so the VM panics very early in boot: device-mapper: verity: 254:1: metadata block 0 is corrupted erofs (device dm-0): cannot read erofs superblock Kernel panic - not syncing: VFS: Unable to mount root fs ... Generalize the shim-v2-rust cache validation so it also runs for shim-v2-go, push the per-variant root-hash sidecar files for both shims, and fall back to a full rebuild whenever the cached hash is missing or differs from the image one. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:12:52 +08:00
Alex Lyn	fd139a1143	kata-deploy: Reset max_unmerged_layers to "0" within erofs snapshotter we should set max_unmerged_layers = 0 for erofs snapshotter gpt-vmdk mode. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	2036e66bc3	kata-agent: Integrate GPT partition support into multi-layer handler In GPT mode, all partitions share the same base block device, so resolving it once per uevent source and caching the result avoids redundant hotplug waits that would otherwise scale linearly with layer count. Layers are sorted by partition number before mounting to guarantee correct overlay lowerdir precedence regardless of the order the host emits Storage entries. And it will remove dead_code attributes to mark the codes working. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	17fadde6d8	kata-agent: Add GPT partition utility functions The guest agent needs to resolve individual partition devices from a single GPT-partitioned block device, but the kernel does not always create partition nodes immediately after the base device appears, especially when another fd holds the device open during hot-plug. Add utility functions that handle two problems: (1) Mapping a base device path to its partition path following the kernel naming convention (bare suffix vs 'p' separator). (2) And ensuring the partition node exists before mount. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	8119a561ae	kata-agent: Refactor wait_and_mount_layer to return LayerMountInfo This commit has No functional change — all callers pass None, so every call still resolves the device via uevent exactly as before. It just prepare the multi-layer EROFS handler for GPT partition and dm-verity support by widening the wait_and_mount_layer() interface without changing behavior. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	0bd150e5f1	runtime-rs: Integrate GPT+VMDK mode for multi-layer EROFS rootfs When multiple EROFS layers are present, wrap them into a single GPT-partitioned virtual disk delivered via one VMDK descriptor and a single block device hotplug which significantly reduce pci bus slots compared with the previous one-device-per-layer approach that exhausts virtio-blk slots for large layer counts. The host detects multi-layer mounts, computes the GPT layout, generates head metadata plus a VMDK descriptor referencing all EROFS images, and hot-plugs the composite disk. Per-partition Storage entries are created with X-kata.gpt-partitioned and X-kata.partition-number options so the guest agent can resolve each layer to its partition device. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	c3b06af4c7	kata-types: Add gpt_disk module for GPT metadata generation Introduce gpt_disk.rs to compute GPT partition layouts and generate metadata files for multi-layer EROFS rootfs. The module creates GPT head metadata that are combined with EROFS layer images via VMDK descriptors, presenting a single GPT-partitioned virtual disk to the guest VM — each EROFS layer mapped to its own partition. The layout engine calculates LBA positions for an arbitrary number of EROFS layers, then writes a full protective-MBR + GPT image and extracts the head (MBR + primary GPT table) segments as standalone files for VMDK extent assembly. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	148810312d	runtime-rs: Refactor VMDK writer and erofs rootfs handling logic Restructure the erofs rootfs handler to support multi-layer GPT+VMDK mode where multiple EROFS layers are wrapped into a single virtual disk with a GPT partition table. Extract VmdkDescriptorWriter as a reusable struct for atomic VMDK descriptor generation. Change erofs_storage from Option<Storage> to Vec<Storage> to hold per-layer metadata, and add GPT metadata path tracking for proper cleanup with path-traversal guards. Bump MAX_VIRTIO_BLK_DEVICES from 10 to 127 to accommodate GPT disks carrying many partitions. Pre-extract mkdir directives from overlay mounts before the main loop to avoid redundant option parsing. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	7086caaddf	kata-agent: Remove unused mode field from MkdirDirective As previous unused codes are with attribute of dead_code which actually are never used, we'd better remove them totally. It will remove the mode field from MkdirDirective structure and also remove its relavent test cases. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	39c512bc36	kata-agent: Enhance virtio block matcher to reject partition uevents Enhance VirtioBlkPciMatcher to only match whole-disk uevents. This prevents the matcher from incorrectly matching partition uevents (e.g., /dev/vdaX) which is critical for partitioned disks where partition uevents appear alongside whole-disk uevents. This commit aims to eliminate such bad cases. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Alex Lyn	56f05aa534	kata-agent: Enhance SCSI block device matcher to reject partition uevents Refactor ScsiBlockMatcher to only match whole-disk uevents. This prevents the matcher from incorrectly matching partition uevents (e.g., block/sdd/sdd9) which is critical for partitioned disks where partition uevents appear alongside whole-disk uevents. Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2026-05-25 19:08:31 +08:00
Fabiano Fidêncio	72be31c384	build: Validate measured-rootfs root hashes all shims The cached shim-v2 tarballs ship per-variant `root_hash_.txt` files embedded in the matching measured-rootfs image. Until now only shim-v2-rust validated those hashes against the freshly built rootfs images on a cache hit; shim-v2-go reused whatever was cached without checking, even though its bundled configuration files contain the `KERNELVERITYPARAMS_` values baked in at build time. When a PR changes the agent (and therefore the rootfs image and its dm-verity hash) but does not touch `src/runtime`, the shim-v2-go cache key stays the same and the stale tarball is reused. The resulting guest cmdline carries a verity hash that no longer matches the new rootfs image, so the VM panics very early in boot: device-mapper: verity: 254:1: metadata block 0 is corrupted erofs (device dm-0): cannot read erofs superblock Kernel panic - not syncing: VFS: Unable to mount root fs ... Generalize the shim-v2-rust cache validation so it also runs for shim-v2-go, push the per-variant root-hash sidecar files for both shims, and fall back to a full rebuild whenever the cached hash is missing or differs from the image one. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-25 11:04:08 +02:00
Fabiano Fidêncio	7ddea26137	Merge pull request #13086 from fvichot/flo-kata-monitor-fix kata-monitor: use full URI for connecting to containerd	2026-05-25 10:16:11 +02:00
Fabiano Fidêncio	513d87db7e	Merge pull request #13106 from fidencio/topic/runtime-rs-ensure-bios-is-passed-to-qemu-on-non-CC-cases runtime-rs: qemu: pass -bios for non-confidential guests	2026-05-25 09:56:11 +02:00
Fabiano Fidêncio	407a6946f2	Merge pull request #13077 from hdp617/fix-kata-deploy-build packaging: fix parallel kernel build race and kata-deploy script bugs	2026-05-25 09:53:38 +02:00
Fabiano Fidêncio	f763e9cca9	tests: Add NUMA topology / GPU placement tests to the NV CIs Add k8s-nvidia-numa.bats with five tests that validate NUMA behaviour on hosts where NUMA is configured by default (qemu-nvidia-gpu, qemu-nvidia-gpu-snp, qemu-nvidia-gpu-tdx): 1. Multi-node sandbox (large workload spanning all host NUMA nodes): - Guest NUMA node count matches host - Guest vCPU distribution is balanced across nodes (max-min <= 1) - Guest memory is distributed across NUMA nodes - Host-side vCPU pinning is balanced across NUMA nodes 2. Right-sized single-node sandbox (small workload fitting one node): - Guest collapses to a single NUMA node - All host vCPU threads pinned to that one NUMA node 3. GPU passthrough with VFIO, multi-node: - Guest NUMA topology is balanced (same as test 1) - Guest GPU's NUMA node matches the host GPU's NUMA node (resolved via the vfio-pci,host=<BDF> from the QEMU command line and /sys/bus/pci/devices/<BDF>/numa_node) - QEMU command line contains pxb-pcie and policy=bind - Host vCPU pinning is balanced 4. GPU passthrough with VFIO, right-sized single-node: small workload plus GPU that fits in a single host NUMA node: - Guest collapses to a single NUMA node - The chosen node is the GPU's host NUMA node, not just any node that fits — verified by matching host-nodes= in the memory backend and pxb-pcie numa_node= against the GPU's host node - Guest GPU reports the same NUMA node as the host GPU 5. Explicit numa_mapping in the runtime TOML (QEMU-only): - Drops a config.d/ fragment that sets numa_mapping = ["1"], so the auto-derive + right-sizing path is bypassed entirely - Guest sees exactly 1 NUMA node - QEMU memory backend is bound to host node 1 (host-nodes=1, policy=bind), not host node 0 - Host-side vCPU threads land on host node 1 - Drop-in is removed on teardown so subsequent tests are unaffected Guest-side checks use a dedicated container image (quay.io/kata-containers/numa) that reads sysfs and prints results to stdout — no kubectl exec or CoCo policy overrides needed. Host-side checks (crictl, pgrep, taskset) run directly on the host via sudo; a standalone numa-pinning-check.sh script handles the vCPU thread affinity inspection. The config.d/ helpers used by test 5 are runtime-agnostic (probe Go vs runtime-rs layout on disk) but the test is gated to qemu-* shims since runtime-rs does not yet implement NUMA. Skips cleanly on single-NUMA hosts, unsupported hypervisors, or when no nvidia.com/pgpu resources are available (GPU tests only). Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Assisted-by: Cursor <cursoragent@cursor.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	20705470e9	docs: Add NUMA support guide for Kata Containers with QEMU Add a step-by-step how-to guide covering host inspection, Kata NUMA drop-in setup (via kata-deploy Helm and manual config.d/), pod deployment examples, and guest/host verification procedures. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	8787da13a9	agent: Add NUMA-aware PCI path parsing Extend pcipath_from_dev_tree_path() to support the full NUMA-aware path format "root_complex/bus/device" (e.g. "10/00/02") in addition to the legacy "bus/device" format, defaulting to root complex "00" for backward compatibility. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	1cbe930fc9	runtime: Add pxb-pcie NUMA-aware PCIe topology for VFIO devices When NUMA placement is active and VFIO devices are cold-plugged, create a pxb-pcie (PCIe Expander Bridge) per NUMA node that has devices. Each pxb-pcie carries a numa_node property that gives the guest kernel correct NUMA affinity for all PCI devices beneath it. Root ports are created on each pxb-pcie bus instead of pcie.0, and VFIODevice.Attach() assigns each device to the root port on its host NUMA node's pxb bridge. Non-VFIO devices remain on pcie.0. NUMA placement is "active" when there is more than one guest NUMA node OR a single guest node mapped to a specific host node (the latter happens when maybeRightSizeAutoNUMA() collapses a multi-node sandbox to the GPU's host NUMA node). In both cases buildNUMATopology() also emits the matching memory-backend-ram,host-nodes=,policy=bind entries so guest memory is sourced from the right host node. So pxb-pcie can never capture a leaf virtio-pci device as the default bus, every virtio-pci device emitter (NetDevice, VSOCK, vhost-user-{net,scsi,blk,fs}) now appends bus=pcie.0 explicitly when the machine actually exposes a pcie.0 root. Detection is done via a new hasPCIeRoot() helper that returns true only for q35/virt machine types — ppc64le's pseries (pci.0), s390x's s390-ccw-virtio (CCW transport) and microvm (no PCI) intentionally skip the pin to avoid "Bus 'pcie.0' not found" at startup. This is the only QEMU mechanism that works for both regular and confidential (TDX/SNP) guests, as it operates through the PCI bus hierarchy rather than ACPI table injection. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	15292da217	config: Enable NUMA by default for nvidia-gpu configurations Enable enable_numa=true in the three nvidia-gpu QEMU configuration templates (base, SNP, TDX). On single-NUMA hosts this is a no-op since buildNUMATopology() returns nil when there is only one node. On multi-NUMA hosts it ensures GPU memory accesses are NUMA-local. Add documentation to all QEMU config templates explaining the VFIO device NUMA placement validation that occurs when NUMA is enabled. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-05-24 22:00:46 +02:00
Fabiano Fidêncio	feeb5d8ecc	runtime-rs: Fix vCPU pinning race with backoff retry QEMU can report fewer vCPU threads during early startup, causing partial affinity setup. Let's retry with exponential backoff until the expected thread count is visible, then continue with best-effort pinning if the window is exhausted. Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>	2026-05-24 22:00:46 +02:00

1 2 3 4 5 ...

19166 Commits