Create configuration-clh-azure{,-runtime-rs}.toml from the base CLH
configs during build.
This keeps Mariner-specific defaults in explicit config artifacts
instead of ad-hoc runtime mutation.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When NUMA placement is active and VFIO devices are cold-plugged,
create a pxb-pcie (PCIe Expander Bridge) per NUMA node that has
devices. Each pxb-pcie carries a numa_node property that gives the
guest kernel correct NUMA affinity for all PCI devices beneath it.
Root ports are created on each pxb-pcie bus instead of pcie.0, and
VFIODevice.Attach() assigns each device to the root port on its host
NUMA node's pxb bridge. Non-VFIO devices remain on pcie.0.
NUMA placement is "active" when there is more than one guest NUMA
node OR a single guest node mapped to a specific host node (the
latter happens when maybeRightSizeAutoNUMA() collapses a multi-node
sandbox to the GPU's host NUMA node). In both cases
buildNUMATopology() also emits the matching
memory-backend-ram,host-nodes=,policy=bind entries so guest memory
is sourced from the right host node.
So pxb-pcie can never capture a leaf virtio-pci device as the
default bus, every virtio-pci device emitter (NetDevice, VSOCK,
vhost-user-{net,scsi,blk,fs}) now appends bus=pcie.0 explicitly when
the machine actually exposes a pcie.0 root. Detection is done via a
new hasPCIeRoot() helper that returns true only for q35/virt machine
types — ppc64le's pseries (pci.0), s390x's s390-ccw-virtio (CCW
transport) and microvm (no PCI) intentionally skip the pin to avoid
"Bus 'pcie.0' not found" at startup.
This is the only QEMU mechanism that works for both regular and
confidential (TDX/SNP) guests, as it operates through the PCI bus
hierarchy rather than ACPI table injection.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Enable enable_numa=true in the three nvidia-gpu QEMU configuration
templates (base, SNP, TDX). On single-NUMA hosts this is a no-op since
buildNUMATopology() returns nil when there is only one node. On
multi-NUMA hosts it ensures GPU memory accesses are NUMA-local.
Add documentation to all QEMU config templates explaining the VFIO
device NUMA placement validation that occurs when NUMA is enabled.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
QEMU may not have spawned all vCPU threads when pinning starts, so
query_cpus_fast can return an incomplete list and leave some vCPUs
unpinned. To fix it, let's add exponential backoff retries before
pinning and fall back to available threads if retries are exhausted.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When cpuManagerPolicy=static is configured, kubelet sets the sandbox
CPU quota to -1 (unconstrained) because it uses cpuset pinning instead
of CFS quota. This causes CalculateSandboxSizing to compute 0 workload
CPUs, resulting in the VM starting with only default_vcpus.
Fall back to deriving the CPU count from sandbox CPU shares (1024
shares per CPU) when the quota-based calculation yields 0.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Make checkVCPUsPinning() NUMA-aware: when GuestNUMANodes are configured,
vCPU threads are pinned to host CPUs belonging to the same NUMA node as
the vCPU's guest NUMA node assignment via checkVCPUsPinningNUMA(),
preserving memory locality. vCPUs are distributed proportionally across
NUMA nodes, matching the distribution in buildNUMATopology().
Stop unconditionally stripping cpuset.mems in constrainGRPCSpec() and
container update(). When multi-NUMA is configured, translate host NUMA
node IDs to guest NUMA node IDs using translateHostMemsToGuest() before
forwarding to the agent. This allows the agent to enforce NUMA-aware
memory placement for containers.
Filter guest NUMA nodes at VM creation time: before calling CreateVM(),
prune GuestNUMANodes to only those whose HostCPUs intersect the sandbox
cpuset. This avoids exposing fake NUMA topology to the guest when
Kubernetes allocates CPUs from fewer nodes than the host has (e.g. all
CPUs from node 0 on a 2-node host), improving memory locality and
avoiding unnecessary cross-node memory traffic.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Add sysfs-based host NUMA distance reading (GetHostNUMADistances) that
parses /sys/devices/system/node/nodeN/distance to mirror the host NUMA
distance matrix into the guest via -numa dist entries.
Implement buildNUMATopology() which translates the GuestNUMANodes
configuration into govmm NUMANode and NUMADist slices. Each guest NUMA
node gets a floor-divided share of vCPUs and memory, with the last node
absorbing any remainder. This handles the common Kata case of +1 VMM
overhead vCPU gracefully. Memory backends are selected based on
hugepages/virtio-fs/file-backed-mem configuration.
Guard multi-NUMA topology generation to amd64 and arm64 only, since
other architectures (s390x, riscv64) do not support QEMU NUMA/DIMM.
Wire buildNUMATopology() into CreateVM so the QEMU config includes NUMA
nodes and distances.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Add PCISysFsDevicesNUMANode property and GetPCIDeviceNUMANode() helper
to read /sys/bus/pci/devices/<BDF>/numa_node when discovering VFIO
devices. Store the result in the new NUMANode field on VFIODev (-1 for
unknown/no affinity).
Wire NUMA node detection into both GetAllVFIODevicesFromIOMMUGroup()
(legacy VFIO path) and GetDeviceFromVFIODev() (IOMMUFD path) so every
discovered VFIO device carries its host NUMA node.
Add validateVFIODeviceNUMAPlacement() which runs at the end of
buildNUMATopology(). It checks every cold-plugged VFIO device's host
NUMA node against the guest NUMA topology and logs a warning if a device
is on a host NUMA node not covered by any guest NUMA node (indicating
potential cross-NUMA memory access overhead), or an info message
confirming correct placement.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Make cpuTopology() NUMA-aware by accepting a numNUMANodes parameter.
When multiple NUMA nodes are configured, restructure the SMP topology so
that Sockets=numNUMA and Cores=ceil(maxvcpus/numNUMA), grouping vCPUs by
socket per NUMA node. Use ceiling division so that uneven vCPU counts
(e.g. the +1 VMM overhead vCPU that Kata adds) produce a QEMU-valid SMP
topology where MaxCPUs == Sockets * Cores * Threads.
When numNUMANodes <= 1, the existing flat topology (Sockets=maxvcpus,
Cores=1) is preserved.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Introduce NUMANode and NUMADist types, add NUMANodes/NUMADists fields to
Config, and implement appendMultiNUMAMemoryKnobs() to generate per-node
memory-backend objects with host-nodes/policy=bind, -numa node entries
with cpus= ranges, and -numa dist entries for the distance matrix.
Gate the multi-NUMA path in appendMemoryKnobs() behind isDimmSupported()
to ensure architectures without DIMM support (s390x, riscv64) fall back
to the single-node path. Drop 386 from isDimmSupported since 32-bit x86
is not a supported Kata target.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Without the protocol in the URI, grpc-go defaults to the DNS resolver,
which results in an error for unix sockets (`name resolver error: produced
zero addresses`).
We also remove the `getAddressAndDialer(...)` and `dial(...)` functions, as
they are no longer necessary, grpc-go supports connecting to unix sockets
directly. This also removes the matching tests.
This also adds a `Makefile` and tweaks the Dockerfile to simplify building
the Docker image.
Fixes#12398
Signed-off-by: Florian Vichot <florian.vichot@gmail.com>
Docker 29.5+ adds a private time namespace to container bundles by
default, but kata agent only supports the classic namespace set and
then fails with "invalid namespace type".
Let's strip time namespaces in both the Go and rust runtimes before the
spec reaches the agent, matching how network and cgroup namespaces are
handled.
Fixes: #13080
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When the jailer is in use (the default for kata-fc), cmd.Process.Pid in
fcInit() is the jailer's PID, not firecracker's. The jailer forks +
execs firecracker as a separate child and exits. fc.info.PID was
therefore stored as the (soon-to-be-dead) jailer PID.
At sandbox shutdown, fcEnd() calls WaitLocalProcess(fc.info.PID, SIGTERM, ...).
syscall.Kill on the dead jailer PID returns ESRCH, WaitLocalProcess
returns nil immediately, and the real firecracker microVM never
receives a signal. It gets reparented to init and stays alive
indefinitely, holding open resources from the host. Over many
container lifecycles this becomes a serious resource leak.
Read the real PID from <jailerRoot>/firecracker.pid, which firecracker
itself writes after the exec. Update fc.info.PID with that value so all
downstream code (fcEnd, Save/Load, kill-0 alive checks, NewProc) operates
on the actual firecracker process.
Also fix a small adjacent bug in Sandbox.Stop where the per-container
teardown loop ignored the force flag, causing any container.stop error
to short-circuit Stop before stopVM ran.
Signed-off-by: Sebastian Wolf <swolf@nvidia.com>
We sometimes get this error when creating the pod sandbox:
failed to create shim task: Failed to add qdisc for network index 2 : device or resource busy.
Adding a linear backoff retry when adding the qdisc to help mitigate the issue at the source and avoid the cascading error.
Signed-off-by: Mayeul Blanzat <mayeul.blanzat@datadoghq.com>
Because intptr() returns a fresh pointer on every call, those comparisons compared addresses,
never values, so every check evaluated to false.
As a result /dev/null, /dev/urandom, /dev/ptmx, /dev/loop-control and /dev/loop*
were appended to devices allowlist for sandbox_cgroup
even when the runtime spec already listed them, producing duplicate entries.
Switch to nil-safe value comparisons via a type switch on the cgroup device type
and dereferenced *d.Major / *d.Minor,
keeping the same detection semantics but actually matching existing entries.
Assisted-By: Claude 4.7
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
When sandbox_cgroup_only is enabled, the kata shim threads inherit
the sandbox device cgroup. For container rootfs whose mount source
is a regular file backed by a loop device (notably the blockfile
snapshotter), containerd's mount package opens /dev/loop-control to
allocate a free /dev/loopN and then opens that block node to attach
the backing file. Neither device is on the sandbox cgroup allowlist,
so both opens fail with EPERM.
This change adds /dev/loop-control (char 10:237) and the /dev/loopN
block nodes (block major 7, any minor) to the sandbox device cgroup
allowlist when sandbox_cgroup_only is true, mirroring the existing
treatment of /dev/null, /dev/urandom and /dev/ptmx. The additions
are gated on SandboxCgroupOnly because that is the only mode in
which the shim itself is constrained by this cgroup.
Assisted-By: Claude 4.7
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
Bump the go version to resolve CVEs:
- GO-2026-4918
- GO-2026-4971
- GO-2026-4976
- GO-2026-4977
- GO-2026-4980
- GO-2026-4981
- GO-2026-4982
- GO-2026-4986
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Assisted-by: IBM Bob
Switch the NVIDIA GPU rootfs images (both standard and confidential)
from ext4 to erofs (Enhanced Read-Only File System).
Unlike ext4, which is a read-write filesystem mounted read-only by
convention, erofs is structurally read-only -- no journal, no write
metadata, no superblock write path. This eliminates accidental
mutation and reduces the attack surface inside the guest VM, which
is particularly important for confidential workloads using dm-verity.
Introduce a DEFROOTFSTYPE_NV Makefile variable (set to erofs) for
both Go and Rust runtimes, keeping the global DEFROOTFSTYPE as ext4
so non-NVIDIA configurations are unaffected.
Update all six NVIDIA GPU configuration templates (base, SNP, TDX
for both runtimes) to use @DEFROOTFSTYPE_NV@ instead of the global
@DEFROOTFSTYPE@.
Export FS_TYPE=erofs in install_image_nvidia_gpu() and
install_image_nvidia_gpu_confidential() so the build pipeline
produces erofs images via the image builder.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Keep virtio_fs_extra_args support in code, but remove it from default
enable_annotations and add explicit security warnings in Makefiles and
docs.
Release-note note: mirror this hardening in release notes so operators
know this remains opt-in and carries host-side risk when enabled.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
`make vendor` isn't required anymore. People who need vendored code should
use the `tools/packaging/release/generate_vendor.sh` script instead.
Assisted-by: Claude AI
Signed-off-by: Greg Kurz <groug@kaod.org>
Now shipped in the vendored code tarball.
Drop the git tree status check since it isn't needed anymore.
Also stop building with `-mod=vendor`. This requires to
expose GOMODCACHE as suggested by Fabiano Fidêncio.
Signed-off-by: Greg Kurz <groug@kaod.org>
Interfaces whose drivers do not register a specific netlink kind
(e.g. mlx5 Scalable Functions) are reported with the generic type
"device". The endpoint creation code did not handle this type,
causing sandbox creation to fail with:
"Unsupported network interface: device"
This is particularly visible on arm64 with Mellanox ConnectX NICs
using Scalable Functions, where the ethtool BusInfo returns a
non-PCI identifier (e.g. "mlx5_core.sf.4") so isPhysicalIface()
cannot classify the interface as physical either.
Handle "device" type interfaces the same way as veth endpoints,
connecting them through a TAP + TC-filter bridge.
Additionally, relax getLinkForEndpoint() for VethEndpoint so it
accepts the concrete link type returned by the kernel instead of
asserting *netlink.Veth. A "device" type interface wrapped in a
VethEndpoint returns *netlink.Device from LinkByName(), which
would fail the strict type assertion. All callers only need
link.Attrs(), so accepting any link type is safe.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The run-tracing job in basic-ci-amd64.yaml has been disabled
(if: false) due to issue #9763, with no path to re-enablement.
Remove the job definition and the backing
tests/functional/tracing/ directory.
Made-with: Cursor
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This change mirrors host networking into the guest as before, but now also
includes the default gateway neighbor entry for each interface.
Pods using overlay/synthetic gateways (e.g., 169.254.1.1) can hit a
first-connect race while the guest performs the initial ARP. Preseeding the
gateway neighbor removes that latency and makes early connections (e.g.,
to the API Service) deterministic.
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
Add a new extensible GetDiagnosticData RPC that retrieves diagnostic
information from the guest VM. The request carries a log_type string
field to specify what kind of data is requested, and a container_id
field to identify the target container.
The first supported log_type is "termination_log", which reads the
Kubernetes termination message file from inside the guest. This is
needed for shared_fs=none configurations where the host cannot
directly access the guest filesystem.
On the Go runtime side, the container stop() path now calls
GetDiagnosticData to copy the termination message to the host
when running with NoSharedFS and the terminationMessagePolicy
annotation is set to "File". The call is best-effort: failures
are logged as warnings rather than blocking container teardown.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>
Wait() was releasing s.mu immediately after getContainer(), then
calling getExec() — which reads c.execs — without holding any lock.
Concurrent Exec() or Delete() calls that write to c.execs under s.mu
triggered a "concurrent map read and map write" fatal panic.
Add a dedicated sync.RWMutex to the container struct that protects the
execs map. getExec() now acquires a read lock internally, and all
writes go through new setExec()/deleteExec() helpers that acquire the
write lock. This keeps the locking concern local to the map and avoids
complicating the s.mu usage in Wait().
Add a regression test (TestConcurrentExecAccess) that exercises
concurrent getExec reads against setExec/deleteExec writes; this
reliably reproduces the panic under the race detector without the fix.
Fixes: #12825
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This is not related to this PR, but rather to #12734, which ended up not
running the `make src/agent generate-protocols`.
While here, let's also fix it.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The hardcoded DEFAULT_LAUNCH_PROCESS_TIMEOUT of 6 seconds in the kata
agent is insufficient for environments with NVIDIA GPUs and NVSwitches,
where the attestation-agent needs significantly more time to collect
evidence during initialization (e.g. ~2 seconds per NVSwitch).
When the timeout expires, the agent (PID 1) exits with an error, causing
the guest kernel to perform an orderly shutdown before the
attestation-agent has finished starting.
Make this timeout configurable via the kernel parameter
agent.launch_process_timeout (in seconds), preserving the 6-second
default for backward compatibility. The Go runtime is wired up to pass
this value from the TOML config's [agent.kata] section through to the
kernel command line.
The NVIDIA GPU configs set the new default to 15 seconds.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Follow-on to kata-containers/kata-containers#12396
Switch SNP config from initrd-based to image-based rootfs with
dm-verity. The runtime assembles the dm-mod.create kernel cmdline
from kernel_verity_params, and with kernel-hashes=on the root hash
is included in the SNP launch measurement.
Also add qemu-snp to the measured rootfs integration test.
Signed-off-by: Amanda Liem <aliem@amd.com>
runtime-rs lacks several features needed for CPU hotplug on ARM:
pflash/UEFI firmware passthrough, SMP topology in -smp, nr_cpus
kernel parameter, and QMP vCPU add handling for the virt machine
type (which requires core-id only placement with socket/thread/die
set to -1).
Without static sandbox resource management, these gaps cause
failures in tests like k8s-memory.bats where the VM is not correctly
sized for the workload.
Enable static_sandbox_resource_mgmt for aarch64 in the QEMU
runtime-rs configuration so the VM is pre-sized at creation time,
sidestepping the need for hotplug entirely.
Together with this we're aligning the go runtime to the very same
behaviour.
Fixes: #10928
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor