Commit Graph

2326 Commits

Author SHA1 Message Date
Aurélien Bombo
3acb618f6b genpolicy: Assume disable_guest_empty_dir = true
This option should be removed for 4.0, so we don't handle `false`.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-06-24 15:22:13 -05:00
Aurélien Bombo
e191c5b716 runtime-go/rs: Reconcile hugepage emptyDirs and disable_guest_empty_dir
This addresses an issue where the disable_guest_empty_dir=true code paths did
not take into account that hugepage-backed emptyDirs should always be recreated
in the guest (using guest hugepages).

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-06-24 15:22:13 -05:00
Aurélien Bombo
a3e91d9ed2 runtime-go/rs: Set disable_guest_empty_dir = true by default
This makes the runtime share the host Kubelet emptyDir folder with the guest
instead of the agent creating an empty folder in the container rootfs. Doing so
enables the Kubelet to track emptyDir usage and evict greedy pods.

In other words, with virtio-fs the container rootfs uses host storage whether
this is true or false, however with true, Kata uses the k8s emptyDir folder so
the sizeLimit is properly enforced by k8s.

Addresses the ephemeral storage part of #12203.

History:

 * Initially, emptyDirs are slow because they are shared from the host with 9p.
   https://github.com/kata-containers/runtime/issues/1472

 * To address above, emptyDirs are hardcoded to be created by the agent in the
   pause container's rootfs, potentially leveraging devicemapper and improving
   perf.
   https://github.com/kata-containers/runtime/pull/1485

 * The previous PR regressed an (interesting?) use case where emptyDirs were
   used to share data from the host to the guest, so the behavior was made
   configurable and `disable_guest_empty_dir = false` is introduced, defaulting
   to the behavior of the previous PR.
   https://github.com/kata-containers/kata-containers/pull/2056

 * Another resource accounting regression remains which is addressed in this PR.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2026-06-24 15:21:53 -05:00
Fabiano Fidêncio
0ddb2ee1f1 Merge pull request #13160 from LandonTClipp/kata_visible_devices
feat(agent): translate KATA_VISIBLE_DEVICES into CDI GPU requests
2026-06-16 19:10:35 +02:00
LandonTClipp
a1dd28cb52 feat(runtime): plumb VISIBLE_CDI_DEVICES through the Go runtime
Add a `visible_cdi_devices` TOML option to the Go runtime so the
agent.visible_cdi_devices=true kernel parameter is emitted to the guest
when enabled. Wire the option through the NVIDIA GPU configuration
templates and add tests verifying the kernel-params flow.

Signed-off-by: LandonTClipp <lclipp@coreweave.com>
2026-06-16 11:44:09 +02:00
Fabiano Fidêncio
6203e28bef runtime: Propagate block-mode device read-only flag to the VMM
Block-mode volumes (e.g. Kubernetes volumeDevices) are passed to the
container as device nodes in spec.Linux.Devices and carry no mount "ro"
option. Their read-only intent is expressed only via the cgroup device
access in spec.Linux.Resources.Devices ("rm" = read+mknod, no write, for
read-only; "rwm" for read-write).

The device path ignored that signal: newLinuxDeviceInfo() built the
DeviceInfo without ever setting ReadOnly (it only consumed FileMode, the
node permission bits, not the read/write access), so the device was
always attached read-write.

This is problematic for filesystems such as XFS, which inspect the block
device read-only state to decide whether to attempt journal/log recovery.
When the guest device is writable, XFS tries to replay the log even for a
read-only mount, which fails badly. Mounting "-o ro" in the guest is not
enough; the device itself must advertise read-only, which only happens
when the VMM opens the backing device read-only (DeviceInfo.ReadOnly ->
BlockDrive.ReadOnly -> qemu read-only=on / clh Readonly).

Derive the read-only flag from two independent signals, combined with OR
so either one marks the device read-only:

  - the cgroup device access rule that exactly matches the device, so a
    block-mode volume marked read-only by the orchestrator (e.g. a pod
    volume with persistentVolumeClaim.readOnly: true) is honored, and
  - the host block device's own read-only flag (queried via the BLKROGET
    ioctl). Block-mode volumes frequently carry no read-only signal in
    the OCI spec at all, so the device flag is often the only reliable
    source.

The BLKROGET probe is shared (pkg/device/config.BlockDeviceIsReadOnly,
Linux-only with a stub on other platforms) between the device-node path
(newLinuxDeviceInfo, probing /dev/block/<major>:<minor>) and the
bind-mounted/filesystem block path (createDeviceInfo). None of this
relies on external host tooling such as "blockdev --setro".

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor
2026-06-15 23:18:36 +02:00
Manuel Huber
70d8f1bf3d runtime: remove file_mem_backend config option
Remove the Go runtime file_mem_backend and valid_file_mem_backends
config knobs, along with the corresponding sandbox annotation handling.

The runtime still enables file-backed shared memory automatically for
virtio-fs by using /dev/shm as the backing directory. This only removes
the user-selectable backend path.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
2026-06-12 00:07:16 +00:00
Hyounggyu Choi
7cc6767fa2 runtime*: use static_sandbox_resource_mgmt defaults for qemu-se
Switch qemu-se config templates to use the TEE/CoCo-specific
static_sandbox_resource_mgmt defaults instead of the generic
QEMU defaults.

qemu-se-runtime-rs config now uses DEFSTATICRESOURCEMGMT_COCO
while runtime qemu-se config now uses DEFSTATICRESOURCEMGMT_TEE.
This aligns static sandbox resource management behavior with confidential
container expectations for qemu-se variants.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2026-06-09 14:45:50 +02:00
Fabiano Fidêncio
743b0a4839 Merge pull request #13165 from stevenhorsman/bump-go-to-1.25.11
versions: bump golang to 1.25.11
2026-06-04 20:24:57 +02:00
Fabiano Fidêncio
2a1ce7b8c4 Merge pull request #12539 from mythi/no-vcpu-hotplug
Disable CPU hotplug when confidential guest setting enabled
2026-06-04 10:56:52 +02:00
stevenhorsman
879912be25 versions: bump golang to 1.25.11
Bump the go version to resolve CVEs:
- GO-2026-5037
- GO-2026-5038
- GO-2026-5039

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-04 08:49:17 +01:00
Mikko Ylinen
e475d870fb runtime: qemu: don't set maxcpus when confidential guest is enabled
QEMU maxcpus enables CPU hotplug capabilities but it's unused when
confidential guest is enabled.

Change Go runtime code to skip setting maxcpus QEMU cmdline if CPU hotplug
is not needed.

Commit 07db945b09 built a relationship between kernel's cmdline nr_cpus and
the maxcpus config. Now that maxcpus is dropped for confidential guests, drop
nr_cpus from kernel commandline too. This hopefully helps with the reference
values computation too.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2026-06-03 15:27:35 +03:00
stevenhorsman
c0f549860e runtime: bump golang.org/x dependencies
Bump golang.org/x/net from v0.53.0 to v0.55.0 and golang.org/x/sys
from v0.43.0 to v0.44.0 to resolve CVEs:
- GO-2026-5024
- GO-2026-5025
- GO-2026-5026
- GO-2026-5027
- GO-2026-5028
- GO-2026-5029
- GO-2026-5030

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Generated-By: IBM Bob
2026-06-03 09:56:54 +01:00
Fabiano Fidêncio
a2bb3f64b0 Merge pull request #12436 from mythi/tdx-updates-2026-3
runtime(-rs): tdx: use TDX QGS via unix-domain-socket by default
2026-06-03 08:50:26 +02:00
Fabiano Fidêncio
230e01b04e Merge pull request #13126 from kata-containers/topic/runtimes-introduce-azure-specific-configs
runtime/runtime-rs: introduce Azure specific configs
2026-06-02 09:17:09 +02:00
Fabiano Fidêncio
9b5b829265 runtime: oci: derive sandbox CPUs from shares only if unconstrained
The shares-based fallback added for cpuManagerPolicy=static fired whenever
the quota-based CPU count was 0, including for BestEffort sandboxes that
have no CPU request. Those sandboxes still carry the cgroup-floor shares
value (2), so the fallback derived ceil(2/1024)=1 and inflated every such
sandbox by one vCPU. For peer-pods (static resource management) this
changed the VM sizing to default_vcpus+1, regressing the libvirt
instance-type CI checks.

Gate the fallback on the quota being explicitly unconstrained (< 0), which
is the actual cpuManagerPolicy=static signal, instead of on numCPU == 0.
BestEffort sandboxes (quota 0/absent) now correctly contribute 0 vCPUs
while the static-policy case still recovers the CPU count from shares.

Add unit tests covering the static-policy, rounding, BestEffort, and
explicit-quota cases.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-06-01 09:50:49 +02:00
Aurélien Bombo
9acef4bc55 Merge pull request #13133 from microsoft/cameronbaird/upstream/revert-macvtap-simple
Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair"
2026-05-29 14:57:07 -05:00
Cameron Baird
7a9d207ab2 Revert "runtime: Enforce >= 1 queue pairs for tapNetworkPair"
This reverts commit 2799f7d36b.
2026-05-29 17:05:40 +00:00
Fabiano Fidêncio
025202a52a runtime: expose InfiniBand devices to VFIO containers
The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a
PCIDEVICE_* environment variable; it does not add the VFIO char device
to linux.devices in the OCI spec.  As a result the agent's
container_has_vfio_device() gate stays closed and
expose_guest_infiniband_devices() is never triggered — leaving
/dev/infiniband absent from the container even though the guest kernel
created the IB devices (mlx5_core.rdma.0 probes successfully).

Add appendPhysicalEndpointDevices() which runs after appendDevices()
in createContainer().  It walks the sandbox network endpoints; for
each PhysicalEndpoint with a resolved guest PCI path it derives the
VFIO group char path from sysfs (iommu_group symlink) and synthesises
a vfio-pci-gk Device entry.  Both legacy group paths (/dev/vfio/N)
and iommufd cdev paths (/dev/vfio/devices/vfioN) are supported by
reading the iommu_group sysfs symlink.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-29 13:07:45 +02:00
Fabiano Fidêncio
f36c383b4f runtime: generate dedicated CLH Azure config variants
Create configuration-clh-azure{,-runtime-rs}.toml from the base CLH
configs during build.

This keeps Mariner-specific defaults in explicit config artifacts
instead of ad-hoc runtime mutation.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-28 23:32:37 +02:00
Fabiano Fidêncio
fa9a9f3aeb runtime: set VF admin MAC before vfio-pci rebind for IB/RoCE support
Without an admin MAC, the guest's mlx5_core inherits the VF's
firmware-default MAC. This MAC differs from the IB port's HCA MAC, so
mlx5_ib's GID cache refuses to populate
/sys/class/infiniband/mlx5_*/ports/N/gids/*. RoCE then appears active
(port = ACTIVE, link_layer = Ethernet) but every verb that needs a GID
— RoCEv2 packets, address handles, librdmacm bind — fails silently.

Push the CNI-assigned MAC down to the VF as an "admin MAC" via the PF
using RTM_SETLINK before the bind-to-vfio-pci step. The firmware
applies the admin MAC during the VF reset that accompanies the
unbind/rebind cycle, so the guest sees a single consistent MAC across
netdev, IB port, and HCA.

Best-effort: failures are logged at warn and the existing agent-side
MAC reconciliation (rpc.rs::update_interface) remains as a fallback for
L2/L3 connectivity.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
992a723392 runtime: resolve cold-plug VFIO guest PCI path via QMP
For QEMU cold-plug + guest-kernel mode the guest BDF of a cold-plugged
VFIO device is auto-allocated at boot (each pcie-root-port is added with
chassis=N,slot=N but no pinned addr=, so QEMU picks the next free slot
on pcie.0). The hot-plug path already queries QMP via qomGetPciPath;
reuse that same mechanism for cold-plugged devices.

Add ResolveColdPlugVFIOGuestPciPaths to the Hypervisor interface.
Implement it in qemu.go using qomGetPciPath. Add no-op stubs for all
other hypervisors.

Call it at the start of setupNetworks so that the PCI paths are resolved
before generateVCNetworkStructures emits the agent Interface proto. Also
stamp the resolved path onto PhysicalEndpoints (used by SR-IOV VFs
exposed as physical network devices) so that update_interface carries a
non-empty devicePath. Without devicePath the agent falls back to a
by-MAC link lookup which fails when the VF firmware MAC differs from the
CNI-assigned MAC after the vfio-pci unbind/rebind cycle.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
23c5250933 runtime/qemu: emit id= for VFIODevice on -device cmdline
Without an explicit id= on the vfio-pci device, QEMU auto-generates
an internal name that does not match vfioDev.ID, so any subsequent
qomGetPciPath(vfioDev.ID) call via QMP fails with "Device 'X' not
found". This breaks resolveColdPlugVFIOGuestPciPaths which needs the
device ID to look up the guest PCI path, leaving GuestPciPath nil and
causing update_interface to fail repeatedly as the agent can't find
the interface to configure.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
e6777f0866 runtime: keep cold-plug VFIO devices in guest-kernel mode
Container.createDevices was dropping cold-plug VFIO entries from the
container's deviceInfos whenever vfio_mode = "guest-kernel", which
in turn meant the agent's CreateContainer request carried no
vfio-pci-gk device entry and sandbox.pcimap[cid] stayed empty. The
SR-IOV device plugin still set PCIDEVICE_<RES>=<host-BDF> on the
workload container, so update_env_pci then aborted with
"No PCI mapping found for container <id>" and the container failed
with CrashLoopBackOff.

Include cold-plug VFIO devices in deviceInfos for both VFIO modes.
The existing vfio-pci-gk agent handler returns dev: None (so
/dev/vfio/<group> is not materialised in the container spec, and
constrainGRPCSpec(stripVfio=true) already strips it from the grpc
spec for guest-kernel mode), while still recording the host->guest
PCI mapping into sandbox.pcimap[cid] so env-var translation works.

devManager.NewDevice calls FindDevice first, which matches the
already cold-plugged sandbox-level device by HostPath / major / minor,
so this does not double-attach.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Fabiano Fidêncio
9893b6dc03 runtime: correctly resolve cold-plug VFIO guest PCI paths
Populate missing VFIO guest PCI paths via QMP before serializing
container devices so guest-kernel PCI env translation has the mappings
it needs.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
2026-05-28 21:54:52 +02:00
Cameron Baird
2799f7d36b runtime: Enforce >= 1 queue pairs for tapNetworkPair
In the xConnectVMNetwork path, we have queues = 0 as a baseline,
set to h.HypervisorConfig().NumVCPUs() iff h.Capabilities() advertise
MultiQueueSupport. This is certainly incorrect as we always want, as
a baseline, at least one queue pair. Make queues := 1 by default
to ensure the NetworkPair has at least one queue pair for all
virtio-net paths.

Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
2026-05-27 18:55:11 +00:00
Mikko Ylinen
2b38d9f45e runtime(-rs): tdx: use TDX QGS via unix-domain-socket by default
TDX QGS takes raw TD report from QEMU/guest VM and signs it in an SGX
enclave. Historically, QGS has supported two transports: vsock and
unix-domain-socket. The former was necessary before the guest kernel
supported the GetQuote "TDVMCALL" hypercall: DCAP library inside the
guest used vsock to talk to QGS directly.

However, with GetQuote, QEMU gets the TDREPORT and sends it to QGS.
In process-to-process communication, unix-domain-socket is a better
approach. This is also the only transport supported by libvirt by default.

With that, align Kata default configuration to use unix-domain-socket
as well. The change in impacts QEMU commandline:

old:
"quote-generation-socket":{"type":"vsock","cid":"2","port":"4050"}
new:
"quote-generation-socket":{"type":"unix","path":"/var/run/tdx-qgs/qgs.socket"}

Host QGS configuration must be changed to listen unix-domain-sockets.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2026-05-26 17:08:56 +03:00
Mikko Ylinen
733c6791d3 runtime(-rs): make TDX QGS port=0 change backwards compatible
Changing Kata runtime configurations to use TDX QGS port=0 (unix domain
socket transport) means cluster admins must also reconfigure qgsd to
the same and have /var/run/tdx-qgs/qgs.sock available.

Since the early days of TDX attestation in Kata, the configuration has used
vsock with cid=2, port=4050. To avoid unncessary breakages when Kata default
moves to unix domain socket, fall back to the old configuration if
/var/run/tdx-qgs/qgs.sock is not available on the worker node.

Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
2026-05-26 17:01:52 +03:00
Zvonko Kaiser
aeadb1af35 Merge pull request #12948 from fidencio/topic/numa
runtime (go): agent: Add NUMA support for QEMU
2026-05-25 15:33:14 +02:00
Fabiano Fidêncio
1cbe930fc9 runtime: Add pxb-pcie NUMA-aware PCIe topology for VFIO devices
When NUMA placement is active and VFIO devices are cold-plugged,
create a pxb-pcie (PCIe Expander Bridge) per NUMA node that has
devices.  Each pxb-pcie carries a numa_node property that gives the
guest kernel correct NUMA affinity for all PCI devices beneath it.

Root ports are created on each pxb-pcie bus instead of pcie.0, and
VFIODevice.Attach() assigns each device to the root port on its host
NUMA node's pxb bridge.  Non-VFIO devices remain on pcie.0.

NUMA placement is "active" when there is more than one guest NUMA
node OR a single guest node mapped to a specific host node (the
latter happens when maybeRightSizeAutoNUMA() collapses a multi-node
sandbox to the GPU's host NUMA node).  In both cases
buildNUMATopology() also emits the matching
memory-backend-ram,host-nodes=,policy=bind entries so guest memory
is sourced from the right host node.

So pxb-pcie can never capture a leaf virtio-pci device as the
default bus, every virtio-pci device emitter (NetDevice, VSOCK,
vhost-user-{net,scsi,blk,fs}) now appends bus=pcie.0 explicitly when
the machine actually exposes a pcie.0 root.  Detection is done via a
new hasPCIeRoot() helper that returns true only for q35/virt machine
types — ppc64le's pseries (pci.0), s390x's s390-ccw-virtio (CCW
transport) and microvm (no PCI) intentionally skip the pin to avoid
"Bus 'pcie.0' not found" at startup.

This is the only QEMU mechanism that works for both regular and
confidential (TDX/SNP) guests, as it operates through the PCI bus
hierarchy rather than ACPI table injection.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
15292da217 config: Enable NUMA by default for nvidia-gpu configurations
Enable enable_numa=true in the three nvidia-gpu QEMU configuration
templates (base, SNP, TDX). On single-NUMA hosts this is a no-op since
buildNUMATopology() returns nil when there is only one node. On
multi-NUMA hosts it ensures GPU memory accesses are NUMA-local.

Add documentation to all QEMU config templates explaining the VFIO
device NUMA placement validation that occurs when NUMA is enabled.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
f53f427859 runtime: Fix vCPU pinning race for Go runtime
QEMU may not have spawned all vCPU threads when pinning starts, so
query_cpus_fast can return an incomplete list and leave some vCPUs
unpinned. To fix it, let's add exponential backoff retries before
pinning and fall back to available threads if retries are exhausted.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
b688619314 runtime: oci: Fix sandbox CPU sizing with cpuManagerPolicy=static
When cpuManagerPolicy=static is configured, kubelet sets the sandbox
CPU quota to -1 (unconstrained) because it uses cpuset pinning instead
of CFS quota. This causes CalculateSandboxSizing to compute 0 workload
CPUs, resulting in the VM starting with only default_vcpus.

Fall back to deriving the CPU count from sandbox CPU shares (1024
shares per CPU) when the quota-based calculation yields 0.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
12e5985dbd runtime: Add NUMA-aware vCPU pinning and cpuset.mems forwarding
Make checkVCPUsPinning() NUMA-aware: when GuestNUMANodes are configured,
vCPU threads are pinned to host CPUs belonging to the same NUMA node as
the vCPU's guest NUMA node assignment via checkVCPUsPinningNUMA(),
preserving memory locality. vCPUs are distributed proportionally across
NUMA nodes, matching the distribution in buildNUMATopology().

Stop unconditionally stripping cpuset.mems in constrainGRPCSpec() and
container update(). When multi-NUMA is configured, translate host NUMA
node IDs to guest NUMA node IDs using translateHostMemsToGuest() before
forwarding to the agent. This allows the agent to enforce NUMA-aware
memory placement for containers.

Filter guest NUMA nodes at VM creation time: before calling CreateVM(),
prune GuestNUMANodes to only those whose HostCPUs intersect the sandbox
cpuset. This avoids exposing fake NUMA topology to the guest when
Kubernetes allocates CPUs from fewer nodes than the host has (e.g. all
CPUs from node 0 on a 2-node host), improving memory locality and
avoiding unnecessary cross-node memory traffic.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
d0d7deb262 runtime: Add host NUMA distance discovery and build guest NUMA topology
Add sysfs-based host NUMA distance reading (GetHostNUMADistances) that
parses /sys/devices/system/node/nodeN/distance to mirror the host NUMA
distance matrix into the guest via -numa dist entries.

Implement buildNUMATopology() which translates the GuestNUMANodes
configuration into govmm NUMANode and NUMADist slices. Each guest NUMA
node gets a floor-divided share of vCPUs and memory, with the last node
absorbing any remainder. This handles the common Kata case of +1 VMM
overhead vCPU gracefully. Memory backends are selected based on
hugepages/virtio-fs/file-backed-mem configuration.

Guard multi-NUMA topology generation to amd64 and arm64 only, since
other architectures (s390x, riscv64) do not support QEMU NUMA/DIMM.

Wire buildNUMATopology() into CreateVM so the QEMU config includes NUMA
nodes and distances.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
447e2a3faf runtime: Add VFIO device NUMA node detection and placement validation
Add PCISysFsDevicesNUMANode property and GetPCIDeviceNUMANode() helper
to read /sys/bus/pci/devices/<BDF>/numa_node when discovering VFIO
devices. Store the result in the new NUMANode field on VFIODev (-1 for
unknown/no affinity).

Wire NUMA node detection into both GetAllVFIODevicesFromIOMMUGroup()
(legacy VFIO path) and GetDeviceFromVFIODev() (IOMMUFD path) so every
discovered VFIO device carries its host NUMA node.

Add validateVFIODeviceNUMAPlacement() which runs at the end of
buildNUMATopology(). It checks every cold-plugged VFIO device's host
NUMA node against the guest NUMA topology and logs a warning if a device
is on a host NUMA node not covered by any guest NUMA node (indicating
potential cross-NUMA memory access overhead), or an info message
confirming correct placement.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
1ee8bb5740 runtime: Add NUMA-aware SMP topology
Make cpuTopology() NUMA-aware by accepting a numNUMANodes parameter.
When multiple NUMA nodes are configured, restructure the SMP topology so
that Sockets=numNUMA and Cores=ceil(maxvcpus/numNUMA), grouping vCPUs by
socket per NUMA node. Use ceiling division so that uneven vCPU counts
(e.g. the +1 VMM overhead vCPU that Kata adds) produce a QEMU-valid SMP
topology where MaxCPUs == Sockets * Cores * Threads.

When numNUMANodes <= 1, the existing flat topology (Sockets=maxvcpus,
Cores=1) is preserved.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Fabiano Fidêncio
1e9da61d48 govmm: Add multi-NUMA memory backend and distance matrix support
Introduce NUMANode and NUMADist types, add NUMANodes/NUMADists fields to
Config, and implement appendMultiNUMAMemoryKnobs() to generate per-node
memory-backend objects with host-nodes/policy=bind, -numa node entries
with cpus= ranges, and -numa dist entries for the distance matrix.

Gate the multi-NUMA path in appendMemoryKnobs() behind isDimmSupported()
to ensure architectures without DIMM support (s390x, riscv64) fall back
to the single-node path. Drop 386 from isDimmSupported since 32-bit x86
is not a supported Kata target.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2026-05-24 22:00:46 +02:00
Florian Vichot
554e8f91b1 kata-monitor: use full URI for connecting to containerd
Without the protocol in the URI, grpc-go defaults to the DNS resolver,
which results in an error for unix sockets (`name resolver error: produced
zero addresses`).

We also remove the `getAddressAndDialer(...)` and `dial(...)` functions, as
they are no longer necessary, grpc-go supports connecting to unix sockets
directly. This also removes the matching tests.

This also adds a `Makefile` and tweaks the Dockerfile to simplify building
the Docker image.

Fixes #12398

Signed-off-by: Florian Vichot <florian.vichot@gmail.com>
2026-05-23 16:47:46 +02:00
dependabot[bot]
ac77c5fdff build(deps): bump github.com/containerd/containerd in /src/runtime
Bumps [github.com/containerd/containerd](https://github.com/containerd/containerd) from 1.7.29 to 1.7.32.
- [Release notes](https://github.com/containerd/containerd/releases)
- [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md)
- [Commits](https://github.com/containerd/containerd/compare/v1.7.29...v1.7.32)

---
updated-dependencies:
- dependency-name: github.com/containerd/containerd
  dependency-version: 1.7.32
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-05-21 21:56:06 +00:00
Fabiano Fidêncio
f9eafb3341 runtime: drop host time namespace from OCI spec
Docker 29.5+ adds a private time namespace to container bundles by
default, but kata agent only supports the classic namespace set and
then fails with "invalid namespace type".

Let's strip time namespaces in both the Go and rust runtimes before the
spec reaches the agent, matching how network and cgroup namespaces are
handled.

Fixes: #13080

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2026-05-21 13:56:45 +02:00
Fabiano Fidêncio
ffa59ce3aa Merge commit from fork
runtime: disable virtiofsd extra-args annotation by default
2026-05-19 08:22:12 +02:00
Alex Lyn
8dca734008 Merge pull request #12959 from DataDog/mayeul/fix-race-condition-when-adding-qdisc
shim: Add backoff retry to ingress qdisc creation to avoid potential race condition
2026-05-19 14:06:37 +08:00
Sebastian Wolf
26746c9ce8 runtime/fc: track real firecracker PID instead of jailer PID
When the jailer is in use (the default for kata-fc), cmd.Process.Pid in
fcInit() is the jailer's PID, not firecracker's. The jailer forks +
execs firecracker as a separate child and exits. fc.info.PID was
therefore stored as the (soon-to-be-dead) jailer PID.

At sandbox shutdown, fcEnd() calls WaitLocalProcess(fc.info.PID, SIGTERM, ...).
syscall.Kill on the dead jailer PID returns ESRCH, WaitLocalProcess
returns nil immediately, and the real firecracker microVM never
receives a signal. It gets reparented to init and stays alive
indefinitely, holding open resources from the host. Over many
container lifecycles this becomes a serious resource leak.

Read the real PID from <jailerRoot>/firecracker.pid, which firecracker
itself writes after the exec. Update fc.info.PID with that value so all
downstream code (fcEnd, Save/Load, kill-0 alive checks, NewProc) operates
on the actual firecracker process.

Also fix a small adjacent bug in Sandbox.Stop where the per-container
teardown loop ignored the force flag, causing any container.stop error
to short-circuit Stop before stopVM ran.

Signed-off-by: Sebastian Wolf <swolf@nvidia.com>
2026-05-18 21:09:51 +02:00
Mayeul Blanzat
26f60ddd9b shim: Add backoff retry to ingress qdisc creation to avoid race condition
We sometimes get this error when creating the pod sandbox:
failed to create shim task: Failed to add qdisc for network index 2 : device or resource busy.

Adding a linear backoff retry when adding the qdisc to help mitigate the issue at the source and avoid the cascading error.

Signed-off-by: Mayeul Blanzat <mayeul.blanzat@datadoghq.com>
2026-05-18 17:46:50 +02:00
Steve Horsman
557fb5187b Merge pull request #12853 from kata-containers/dependabot/go_modules/src/runtime/github.com/sirupsen/logrus-1.9.4
build(deps): bump github.com/sirupsen/logrus from 1.9.3 to 1.9.4 in /src/runtime
2026-05-14 13:56:10 +01:00
Fabiano Fidêncio
c8f6f17269 Merge pull request #13027 from PiotrProkop/fix-loop-blockfile-sandbox-cgroup
runtime: allow loopback devices when sandbox_cgroup_only is enabled
2026-05-14 11:18:45 +02:00
dependabot[bot]
408e15641c build(deps): bump github.com/sirupsen/logrus in /src/runtime
Bumps [github.com/sirupsen/logrus](https://github.com/sirupsen/logrus) from 1.9.3 to 1.9.4.
- [Release notes](https://github.com/sirupsen/logrus/releases)
- [Changelog](https://github.com/sirupsen/logrus/blob/master/CHANGELOG.md)
- [Commits](https://github.com/sirupsen/logrus/compare/v1.9.3...v1.9.4)

---
updated-dependencies:
- dependency-name: github.com/sirupsen/logrus
  dependency-version: 1.9.4
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2026-05-13 06:11:21 +00:00
Greg Kurz
d2dc0a923c Merge pull request #13030 from stevenhorsman/go-1.25.10-bump
Go 1.25.10 bump
2026-05-13 08:09:51 +02:00
PiotrProkop
5065058d4a runtime: fix device allowlist detection comparing pointers
Because intptr() returns a fresh pointer on every call, those comparisons compared addresses,
never values, so every check evaluated to false.
As a result /dev/null, /dev/urandom, /dev/ptmx, /dev/loop-control and /dev/loop*
were appended to devices allowlist for sandbox_cgroup
even when the runtime spec already listed them, producing duplicate entries.

Switch to nil-safe value comparisons via a type switch on the cgroup device type
and dereferenced *d.Major / *d.Minor,
keeping the same detection semantics but actually matching existing entries.

Assisted-By: Claude 4.7
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
2026-05-12 18:52:53 +02:00