Commit Graph

19166 Commits

Author SHA1 Message Date
Fabiano Fidêncio
bddf1ecab4 build: stop producing cloud-hypervisor-glibc artifacts
Drop cloud-hypervisor-glibc from local and CI kata-deploy build targets
now that Azure CLH uses the standard cloud-hypervisor artifact set.

This removes obsolete build matrix entries and installer target
handling.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-28 23:32:37 +02:00
Fabiano Fidêncio
81ce51a9aa ci: target Azure CLH runtimes directly in AKS tests
Switch AKS Mariner matrix entries to clh-azure handlers and remove the
temporary host-OS based helm value overrides.

Update integration test wiring and required test labels so CI tracks the
new runtime names.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-28 23:32:37 +02:00
Fabiano Fidêncio
8c3a2c1a95 kata-deploy: register clh-azure shim families
Add clh-azure and clh-azure-runtime-rs as first-class shims across
installer logic, helm defaults, runtimeclass overhead mapping, and shim
component catalogs.

This aligns deploy payload selection with the new native Azure-specific
CLH configs.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-28 23:32:37 +02:00
Fabiano Fidêncio
f36c383b4f runtime: generate dedicated CLH Azure config variants
Create configuration-clh-azure{,-runtime-rs}.toml from the base CLH
configs during build.

This keeps Mariner-specific defaults in explicit config artifacts
instead of ad-hoc runtime mutation.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-28 23:32:37 +02:00
Fabiano Fidêncio
a423cf9526 Merge pull request #13087 from bpradipt/landlock
kernel: Enable landlock LSM
2026-05-27 17:34:47 +02:00
Pradipta Banerjee
1487eaaaa2 kernel: Enable landlock LSM
Allows using landlock LSM for the container process

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
2026-05-27 13:33:46 +02:00
Fabiano Fidêncio
5adfb27297 Merge pull request #13118 from PiotrProkop/fix-missing-cwd
agent: restore process CWD auto-creation
2026-05-27 13:32:05 +02:00
Fabiano Fidêncio
614dff4bfc Merge pull request #13119 from manuelh-dev/mahuber/erofs-multi-layer-fix
agent: compact EROFS overlay lowerdirs
2026-05-27 11:27:46 +02:00
Fabiano Fidêncio
238dd51039 Merge pull request #13108 from thebigbone/containerd-config
containerd: use /etc/containerd/conf.d/ drop-in for containerd >= 2.2.0
2026-05-27 10:14:51 +02:00
PiotrProkop
60a2e27f02 agent: Restore process CWD auto-creation
Commit b56313472 ("agent: Align agent OCI spec with oci-spec-rs",
PR #9944) inverted the condition guarding the create_dir_all call
for process.cwd: the leading `!` was dropped during the refactor.
As a result, the CWD is created only when process.cwd is the empty
string.

When the guest then runs chdir(process.cwd) and CWD doesn't exist
it returns ENOENT.  The agent propagates that to the shim, which
surfaces it to containerd as "failed to create shim task: ENOENT:
No such file or directory" — indistinguishable from a missing
argv[0].
This regressed the original fix in PR #2375 (Fixes #2374), which
deliberately mirrored runc's behavior.  Put the `!` back.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2026-05-27 09:59:15 +02:00
Fabiano Fidêncio
f1c100797b Merge pull request #12955 from zvonkok/nvgpu-target
build: add nvgpu-tarball target
2026-05-27 09:44:37 +02:00
Fabiano Fidêncio
64056add0d build: add passthrough mode to kata-deploy-merge-builds
kata-deploy now unpacks individual component tarballs itself, so the
final `kata-static.tar.zst` no longer needs to be a merged filesystem
payload. Merging everything has two downsides for that flow:

  - It pulls in everything kept on disk under build/, which previously
    forced us to also drop agent/busybox/coco-guest-components/nydus
    from the build set to keep them out of the final tarball.
  - The merged tarball duplicates content kata-deploy will repack on
    its own anyway.

Add a `passthrough` mode to kata-deploy-merge-builds.sh that, instead
of untarring each `kata-static-*.tar.zst` into a single filesystem
tree, copies the selected component tarballs into the final tarball
as-is. The existing `merge` mode remains the default to preserve the
non-kata-deploy install paths (e.g. `make install-tarball`).

Wire `nvgpu-tarball` to the new mode via `FINAL_TARBALL_MERGE_MODE=
passthrough`, paired with the existing `FINAL_TARBALL_INPUTS`
allowlist. This lets us keep agent/busybox/coco as build prereqs of
the GPU rootfs while shipping a final tarball that only contains the
NVIDIA-relevant components.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-26 21:55:08 +02:00
Zvonko Kaiser
9b85bff2b4 build: don't double-prefix absolute versions.yaml path in merge-builds
The Makefile passes $(MK_DIR)/../../../../versions.yaml — already an
absolute path — to kata-deploy-merge-builds.sh. The script then
unconditionally prepended ${PWD}/, producing a malformed path like:

  /repo//repo/tools/.../local-build//../../../../versions.yaml

which made cp fail with "No such file or directory" at the merge-builds
step (the very last step of `make nvgpu-tarball`).

Only prepend ${PWD}/ when the input is relative — that preserves the
original fix for the pushd-changes-cwd issue (commit ae6e8d2b3) without
mangling absolute paths from Makefile callers.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Assisted-By: Claude <noreply@anthropic.com>
2026-05-26 21:55:08 +02:00
Zvonko Kaiser
5aa6229eba build: group parallel build output by target
With `make all -j N` running multiple tarballs concurrently and silent
mode redirecting each build's stdio to its per-target log, a failing
target's "Failed to build: <name>, logs:" banner gets interleaved with
other in-flight jobs' output, making it hard to tell which target
failed.

Pass `--output-sync=target` to the recursive make so each sub-make's
output is buffered and emitted as one block when the target finishes,
keeping the failure banner contiguous with its log dump.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Assisted-By: Claude <noreply@anthropic.com>
2026-05-26 21:55:08 +02:00
Zvonko Kaiser
3be370d2d6 qemu: clean stale clone before fetching sources
build-qemu.sh runs in the per-target builddir (e.g.
build/qemu-tarball/builddir/), which persists across runs. If a previous
build left the cloned `qemu` tree behind (e.g. after an interrupted
build), the next run errors out with:

  fatal: destination path 'qemu' already exists and is not an empty
  directory.

Wipe `qemu` before cloning so the build is repeatable from a dirty
builddir.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Assisted-By: Claude <noreply@anthropic.com>
2026-05-26 21:55:08 +02:00
Zvonko Kaiser
18cee00df9 build: guard parallel races on build symlink and ~/.docker
Parallel make jobs invoke kata-deploy-binaries-in-docker.sh concurrently
and collide on two shared paths:

  ln: Already exists
  mkdir: /home/$USER/.docker: File exists

Skip the symlink creation when the link is already in place. If a
parallel job wins the create race in the cold-start window, fall back to
re-checking that the link exists so a real ln failure (permission, disk
full, etc.) still propagates rather than being silently swallowed.

The `~/.docker` mkdir is guarded by a `[[ ! -d ]]` check that two
processes can pass simultaneously, after which one bare `mkdir` fails.
Switch to `mkdir -p` so the second invocation is a no-op.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-26 21:55:08 +02:00
Zvonko Kaiser
815ebc340d build: add nvgpu-tarball target
serial-targets now waits for the other BASE_TARBALLS items so the
inner rootfs assembly runs with DEPS= against already-built
artifacts. This also fixes a pre-existing race in the main flows
where the outer parallel and inner -j 1 makes could both build
kernel-tarball at the same time.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-26 21:55:08 +02:00
Zvonko Kaiser
6a367ab777 build: declare install-prebuilt-artifacts as .PHONY
Leftover from #12954's rebase: the substantive sed-hack -> DEPS= change
landed on main, but the .PHONY declaration didn't make it. Add it so
the recipe always runs even if a stale `kata-artifacts` file exists in
CWD.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Assisted-By: Claude <noreply@anthropic.com>
2026-05-26 21:55:08 +02:00
thebigbone
d9f2aa895e containerd: use /etc/containerd/conf.d/ drop-in for containerd >= 2.2.0
containerd 2.2.0+ always imports /etc/containerd/conf.d/*.toml,
so write kata-deploy runtime config there directly, avoiding
modification of the main containerd config's imports array.

Signed-off-by: thebigbone <pacman@duck.com>
2026-05-26 21:29:46 +02:00
Manuel Huber
e838cd7d8d agent: compact EROFS overlay lowerdirs
Use kata_types::mount::Mount for the final multi-layer EROFS
overlay mount instead of calling baremount() directly.

The mount helper detects overlay option strings close to the kernel
mount data limit. When lowerdir entries share a common parent, it
changes into that directory and rewrites lowerdir to relative paths.
That avoids repeating the same long prefix for every layer.

Multi-layer EROFS images can have many lower layers under
/run/kata-containers/<cid>/multi-layer. Passing the raw absolute
lowerdir list can exceed the mount option buffer and fail the final
overlay mount, even after all layer devices mounted successfully.

Reuse the helper so this path follows Kata's normal overlay mount
handling, including lowerdir compaction before mount(2).

Assisted-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-26 18:42:11 +00:00
Fabiano Fidêncio
d75a91ee09 Merge pull request #13114 from manuelh-dev/mahuber/nv-fix-policy-check
tests: nvidia: No policy for runtime-rs path
2026-05-26 20:00:02 +02:00
Dan Mihai
c81dadaba1 Merge pull request #13064 from burgerdev/add-arp-neighbour
agent: use rtnetlink to add ARP neighbour
2026-05-26 09:59:44 -07:00
Manuel Huber
6a715cf4f7 tests: nvidia: No policy for runtime-rs path
The current if condition causes agent security policies to be
attached to the non-TEE NVIDIA runtime-rs runtime class. While
this is good to see that it works, this is not intended. Thus,
replacting the condition with is_confidential_gpu_hypervisor.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2026-05-25 16:00:49 -07:00
Fabiano Fidêncio
25491fc20c Merge pull request #13104 from kata-containers/topic/kata-deploy-build-as-an-artefact
kata-deploy: prebuild payload-specific component artifacts
2026-05-25 22:56:55 +02:00
Fabiano Fidêncio
c65d64873b kata-deploy: prebuild payload-specific component artifacts
Build and publish the kata-deploy binary and CoCo guest-pull nydus
snapshotter as dedicated per-arch artifacts, then consume those tarballs
when assembling the kata-deploy image.

This avoids rebuilding those components in the payload image (which
would happen in serial) path and reduces overall CI build time.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-25 22:13:41 +02:00
Fabiano Fidêncio
3dc02a8604 Merge pull request #13085 from Apokleos/erofs-gpt-vmdk-only
runtime-rs: Support erofs snapshotter with gpt vmdk mode
2026-05-25 16:29:59 +02:00
Zvonko Kaiser
6c6c5809f1 Merge pull request #13109 from fidencio/topic/build-validate-measured-rootfs-root-hashes-for-all-shims
build: Validate measured-rootfs root hashes all shims
2026-05-25 15:58:35 +02:00
Zvonko Kaiser
aeadb1af35 Merge pull request #12948 from fidencio/topic/numa
runtime (go): agent: Add NUMA support for QEMU
2026-05-25 15:33:14 +02:00
Alex Lyn
53699b0170 docs: Reset max_unmerged_layers = 0 for gpt+vmdk mode
As max_unmerged_layers = 1 is just for fsmerge mode, as containerd
temperally unsupport fsmerge, we just reset it with default 0.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:13:28 +08:00
Alex Lyn
a359d13476 build: Validate measured-rootfs root hashes all shims
The cached shim-v2 tarballs ship per-variant `root_hash_*.txt` files
embedded in the matching measured-rootfs image. Until now only
shim-v2-rust validated those hashes against the freshly built rootfs
images on a cache hit; shim-v2-go reused whatever was cached without
checking, even though its bundled configuration files contain the
`KERNELVERITYPARAMS_*` values baked in at build time.

When a PR changes the agent (and therefore the rootfs image and its
dm-verity hash) but does not touch `src/runtime`, the shim-v2-go cache
key stays the same and the stale tarball is reused. The resulting
guest cmdline carries a verity hash that no longer matches the new
rootfs image, so the VM panics very early in boot:

    device-mapper: verity: 254:1: metadata block 0 is corrupted
    erofs (device dm-0): cannot read erofs superblock
    Kernel panic - not syncing: VFS: Unable to mount root fs ...

Generalize the shim-v2-rust cache validation so it also runs for
shim-v2-go, push the per-variant root-hash sidecar files for both
shims, and fall back to a full rebuild whenever the cached hash is
missing or differs from the image one.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:12:52 +08:00
Alex Lyn
fd139a1143 kata-deploy: Reset max_unmerged_layers to "0" within erofs snapshotter
we should set max_unmerged_layers = 0 for erofs snapshotter gpt-vmdk
mode.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
2036e66bc3 kata-agent: Integrate GPT partition support into multi-layer handler
In GPT mode, all partitions share the same base block device, so
resolving it once per uevent source and caching the result avoids
redundant hotplug waits that would otherwise scale linearly with
layer count.

Layers are sorted by partition number before mounting to guarantee
correct overlay lowerdir precedence regardless of the order the host
emits Storage entries.

And it will remove dead_code attributes to mark the codes working.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
17fadde6d8 kata-agent: Add GPT partition utility functions
The guest agent needs to resolve individual partition devices from a
single GPT-partitioned block device, but the kernel does not always
create partition nodes immediately after the base device appears,
especially when another fd holds the device open during hot-plug.

Add utility functions that handle two problems:
(1) Mapping a base device path to its partition path following the
kernel naming convention (bare suffix vs 'p' separator).
(2) And ensuring the partition node exists before mount.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
8119a561ae kata-agent: Refactor wait_and_mount_layer to return LayerMountInfo
This commit has No functional change — all callers pass None, so
every call still resolves the device via uevent exactly as before.

It just prepare the multi-layer EROFS handler for GPT partition and
dm-verity support by widening the wait_and_mount_layer() interface
without changing behavior.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
0bd150e5f1 runtime-rs: Integrate GPT+VMDK mode for multi-layer EROFS rootfs
When multiple EROFS layers are present, wrap them into a single
GPT-partitioned virtual disk delivered via one VMDK descriptor and a
single block device hotplug which significantly reduce pci bus slots
compared with the previous one-device-per-layer approach that exhausts
virtio-blk slots for large layer counts.

The host detects multi-layer mounts, computes the GPT layout, generates
head metadata plus a VMDK descriptor referencing all EROFS images, and
hot-plugs the composite disk. Per-partition Storage entries are created
with X-kata.gpt-partitioned and X-kata.partition-number options so the
guest agent can resolve each layer to its partition device.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
c3b06af4c7 kata-types: Add gpt_disk module for GPT metadata generation
Introduce gpt_disk.rs to compute GPT partition layouts and generate
metadata files for multi-layer EROFS rootfs. The module creates GPT
head metadata that are combined with EROFS layer images via VMDK
descriptors, presenting a single GPT-partitioned virtual disk to the
guest VM — each EROFS layer mapped to its own partition.

The layout engine calculates LBA positions for an arbitrary number of
EROFS layers, then writes a full protective-MBR + GPT image and extracts
the head (MBR + primary GPT table)  segments as standalone files for
VMDK extent assembly.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
148810312d runtime-rs: Refactor VMDK writer and erofs rootfs handling logic
Restructure the erofs rootfs handler to support multi-layer GPT+VMDK
mode where multiple EROFS layers are wrapped into a single virtual
disk with a GPT partition table.

Extract VmdkDescriptorWriter as a reusable struct for atomic VMDK
descriptor generation. Change erofs_storage from Option<Storage> to
Vec<Storage> to hold per-layer metadata, and add GPT metadata path
tracking for proper cleanup with path-traversal guards.

Bump MAX_VIRTIO_BLK_DEVICES from 10 to 127 to accommodate GPT disks
carrying many partitions. Pre-extract mkdir directives from overlay
mounts before the main loop to avoid redundant option parsing.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
7086caaddf kata-agent: Remove unused mode field from MkdirDirective
As previous unused codes are with attribute of dead_code which
actually are never used, we'd better remove them totally.

It will remove the mode field from MkdirDirective structure and
also remove its relavent test cases.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
39c512bc36 kata-agent: Enhance virtio block matcher to reject partition uevents
Enhance VirtioBlkPciMatcher to only match whole-disk uevents. This
prevents the matcher from incorrectly matching partition uevents
(e.g., /dev/vdaX) which is critical for partitioned disks where
partition uevents appear alongside whole-disk uevents.

This commit aims to eliminate such bad cases.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Alex Lyn
56f05aa534 kata-agent: Enhance SCSI block device matcher to reject partition uevents
Refactor ScsiBlockMatcher to only match whole-disk uevents. This
prevents the matcher from incorrectly matching partition uevents
(e.g., block/sdd/sdd9) which is critical for partitioned disks
where partition uevents appear alongside whole-disk uevents.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2026-05-25 19:08:31 +08:00
Fabiano Fidêncio
72be31c384 build: Validate measured-rootfs root hashes all shims
The cached shim-v2 tarballs ship per-variant `root_hash_*.txt` files
embedded in the matching measured-rootfs image. Until now only
shim-v2-rust validated those hashes against the freshly built rootfs
images on a cache hit; shim-v2-go reused whatever was cached without
checking, even though its bundled configuration files contain the
`KERNELVERITYPARAMS_*` values baked in at build time.

When a PR changes the agent (and therefore the rootfs image and its
dm-verity hash) but does not touch `src/runtime`, the shim-v2-go cache
key stays the same and the stale tarball is reused. The resulting
guest cmdline carries a verity hash that no longer matches the new
rootfs image, so the VM panics very early in boot:

    device-mapper: verity: 254:1: metadata block 0 is corrupted
    erofs (device dm-0): cannot read erofs superblock
    Kernel panic - not syncing: VFS: Unable to mount root fs ...

Generalize the shim-v2-rust cache validation so it also runs for
shim-v2-go, push the per-variant root-hash sidecar files for both
shims, and fall back to a full rebuild whenever the cached hash is
missing or differs from the image one.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-25 11:04:08 +02:00
Fabiano Fidêncio
7ddea26137 Merge pull request #13086 from fvichot/flo-kata-monitor-fix
kata-monitor: use full URI for connecting to containerd
2026-05-25 10:16:11 +02:00
Fabiano Fidêncio
513d87db7e Merge pull request #13106 from fidencio/topic/runtime-rs-ensure-bios-is-passed-to-qemu-on-non-CC-cases
runtime-rs: qemu: pass -bios for non-confidential guests
2026-05-25 09:56:11 +02:00
Fabiano Fidêncio
407a6946f2 Merge pull request #13077 from hdp617/fix-kata-deploy-build
packaging: fix parallel kernel build race and kata-deploy script bugs
2026-05-25 09:53:38 +02:00
Fabiano Fidêncio
f763e9cca9 tests: Add NUMA topology / GPU placement tests to the NV CIs
Add k8s-nvidia-numa.bats with five tests that validate NUMA behaviour
on hosts where NUMA is configured by default (qemu-nvidia-gpu,
qemu-nvidia-gpu-snp, qemu-nvidia-gpu-tdx):

1. Multi-node sandbox (large workload spanning all host NUMA nodes):
   - Guest NUMA node count matches host
   - Guest vCPU distribution is balanced across nodes (max-min <= 1)
   - Guest memory is distributed across NUMA nodes
   - Host-side vCPU pinning is balanced across NUMA nodes

2. Right-sized single-node sandbox (small workload fitting one node):
   - Guest collapses to a single NUMA node
   - All host vCPU threads pinned to that one NUMA node

3. GPU passthrough with VFIO, multi-node:
   - Guest NUMA topology is balanced (same as test 1)
   - Guest GPU's NUMA node matches the host GPU's NUMA node
     (resolved via the vfio-pci,host=<BDF> from the QEMU command
     line and /sys/bus/pci/devices/<BDF>/numa_node)
   - QEMU command line contains pxb-pcie and policy=bind
   - Host vCPU pinning is balanced

4. GPU passthrough with VFIO, right-sized single-node: small workload
   plus GPU that fits in a single host NUMA node:
   - Guest collapses to a single NUMA node
   - The chosen node is the GPU's host NUMA node, not just any node
     that fits — verified by matching host-nodes= in the memory
     backend and pxb-pcie numa_node= against the GPU's host node
   - Guest GPU reports the same NUMA node as the host GPU

5. Explicit numa_mapping in the runtime TOML (QEMU-only):
   - Drops a config.d/ fragment that sets numa_mapping = ["1"], so the
     auto-derive + right-sizing path is bypassed entirely
   - Guest sees exactly 1 NUMA node
   - QEMU memory backend is bound to host node 1 (host-nodes=1,
     policy=bind), not host node 0
   - Host-side vCPU threads land on host node 1
   - Drop-in is removed on teardown so subsequent tests are unaffected

Guest-side checks use a dedicated container image
(quay.io/kata-containers/numa) that reads sysfs and prints results to
stdout — no kubectl exec or CoCo policy overrides needed.

Host-side checks (crictl, pgrep, taskset) run directly on the host
via sudo; a standalone numa-pinning-check.sh script handles the vCPU
thread affinity inspection.  The config.d/ helpers used by test 5 are
runtime-agnostic (probe Go vs runtime-rs layout on disk) but the test
is gated to qemu-* shims since runtime-rs does not yet implement
NUMA.

Skips cleanly on single-NUMA hosts, unsupported hypervisors, or when
no nvidia.com/pgpu resources are available (GPU tests only).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
20705470e9 docs: Add NUMA support guide for Kata Containers with QEMU
Add a step-by-step how-to guide covering host inspection, Kata NUMA
drop-in setup (via kata-deploy Helm and manual config.d/), pod
deployment examples, and guest/host verification procedures.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
8787da13a9 agent: Add NUMA-aware PCI path parsing
Extend pcipath_from_dev_tree_path() to support the full NUMA-aware path
format "root_complex/bus/device" (e.g. "10/00/02") in addition to the
legacy "bus/device" format, defaulting to root complex "00" for backward
compatibility.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
1cbe930fc9 runtime: Add pxb-pcie NUMA-aware PCIe topology for VFIO devices
When NUMA placement is active and VFIO devices are cold-plugged,
create a pxb-pcie (PCIe Expander Bridge) per NUMA node that has
devices.  Each pxb-pcie carries a numa_node property that gives the
guest kernel correct NUMA affinity for all PCI devices beneath it.

Root ports are created on each pxb-pcie bus instead of pcie.0, and
VFIODevice.Attach() assigns each device to the root port on its host
NUMA node's pxb bridge.  Non-VFIO devices remain on pcie.0.

NUMA placement is "active" when there is more than one guest NUMA
node OR a single guest node mapped to a specific host node (the
latter happens when maybeRightSizeAutoNUMA() collapses a multi-node
sandbox to the GPU's host NUMA node).  In both cases
buildNUMATopology() also emits the matching
memory-backend-ram,host-nodes=,policy=bind entries so guest memory
is sourced from the right host node.

So pxb-pcie can never capture a leaf virtio-pci device as the
default bus, every virtio-pci device emitter (NetDevice, VSOCK,
vhost-user-{net,scsi,blk,fs}) now appends bus=pcie.0 explicitly when
the machine actually exposes a pcie.0 root.  Detection is done via a
new hasPCIeRoot() helper that returns true only for q35/virt machine
types — ppc64le's pseries (pci.0), s390x's s390-ccw-virtio (CCW
transport) and microvm (no PCI) intentionally skip the pin to avoid
"Bus 'pcie.0' not found" at startup.

This is the only QEMU mechanism that works for both regular and
confidential (TDX/SNP) guests, as it operates through the PCI bus
hierarchy rather than ACPI table injection.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
15292da217 config: Enable NUMA by default for nvidia-gpu configurations
Enable enable_numa=true in the three nvidia-gpu QEMU configuration
templates (base, SNP, TDX). On single-NUMA hosts this is a no-op since
buildNUMATopology() returns nil when there is only one node. On
multi-NUMA hosts it ensures GPU memory accesses are NUMA-local.

Add documentation to all QEMU config templates explaining the VFIO
device NUMA placement validation that occurs when NUMA is enabled.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
feeb5d8ecc runtime-rs: Fix vCPU pinning race with backoff retry
QEMU can report fewer vCPU threads during early startup, causing partial
affinity setup. Let's retry with exponential backoff until the expected
thread count is visible, then continue with best-effort pinning if the
window is exhausted.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 22:00:46 +02:00