The shares-based fallback added for cpuManagerPolicy=static fired whenever
the quota-based CPU count was 0, including for BestEffort sandboxes that
have no CPU request. Those sandboxes still carry the cgroup-floor shares
value (2), so the fallback derived ceil(2/1024)=1 and inflated every such
sandbox by one vCPU. For peer-pods (static resource management) this
changed the VM sizing to default_vcpus+1, regressing the libvirt
instance-type CI checks.
Gate the fallback on the quota being explicitly unconstrained (< 0), which
is the actual cpuManagerPolicy=static signal, instead of on numCPU == 0.
BestEffort sandboxes (quota 0/absent) now correctly contribute 0 vCPUs
while the static-policy case still recovers the CPU count from shares.
Add unit tests covering the static-policy, rounding, BestEffort, and
explicit-quota cases.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Without an explicit id= on the vfio-pci device, QEMU auto-generates
an internal name that does not match vfioDev.ID, so any subsequent
qomGetPciPath(vfioDev.ID) call via QMP fails with "Device 'X' not
found". This breaks resolveColdPlugVFIOGuestPciPaths which needs the
device ID to look up the guest PCI path, leaving GuestPciPath nil and
causing update_interface to fail repeatedly as the agent can't find
the interface to configure.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Changing Kata runtime configurations to use TDX QGS port=0 (unix domain
socket transport) means cluster admins must also reconfigure qgsd to
the same and have /var/run/tdx-qgs/qgs.sock available.
Since the early days of TDX attestation in Kata, the configuration has used
vsock with cid=2, port=4050. To avoid unncessary breakages when Kata default
moves to unix domain socket, fall back to the old configuration if
/var/run/tdx-qgs/qgs.sock is not available on the worker node.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
When NUMA placement is active and VFIO devices are cold-plugged,
create a pxb-pcie (PCIe Expander Bridge) per NUMA node that has
devices. Each pxb-pcie carries a numa_node property that gives the
guest kernel correct NUMA affinity for all PCI devices beneath it.
Root ports are created on each pxb-pcie bus instead of pcie.0, and
VFIODevice.Attach() assigns each device to the root port on its host
NUMA node's pxb bridge. Non-VFIO devices remain on pcie.0.
NUMA placement is "active" when there is more than one guest NUMA
node OR a single guest node mapped to a specific host node (the
latter happens when maybeRightSizeAutoNUMA() collapses a multi-node
sandbox to the GPU's host NUMA node). In both cases
buildNUMATopology() also emits the matching
memory-backend-ram,host-nodes=,policy=bind entries so guest memory
is sourced from the right host node.
So pxb-pcie can never capture a leaf virtio-pci device as the
default bus, every virtio-pci device emitter (NetDevice, VSOCK,
vhost-user-{net,scsi,blk,fs}) now appends bus=pcie.0 explicitly when
the machine actually exposes a pcie.0 root. Detection is done via a
new hasPCIeRoot() helper that returns true only for q35/virt machine
types — ppc64le's pseries (pci.0), s390x's s390-ccw-virtio (CCW
transport) and microvm (no PCI) intentionally skip the pin to avoid
"Bus 'pcie.0' not found" at startup.
This is the only QEMU mechanism that works for both regular and
confidential (TDX/SNP) guests, as it operates through the PCI bus
hierarchy rather than ACPI table injection.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When cpuManagerPolicy=static is configured, kubelet sets the sandbox
CPU quota to -1 (unconstrained) because it uses cpuset pinning instead
of CFS quota. This causes CalculateSandboxSizing to compute 0 workload
CPUs, resulting in the VM starting with only default_vcpus.
Fall back to deriving the CPU count from sandbox CPU shares (1024
shares per CPU) when the quota-based calculation yields 0.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Add sysfs-based host NUMA distance reading (GetHostNUMADistances) that
parses /sys/devices/system/node/nodeN/distance to mirror the host NUMA
distance matrix into the guest via -numa dist entries.
Implement buildNUMATopology() which translates the GuestNUMANodes
configuration into govmm NUMANode and NUMADist slices. Each guest NUMA
node gets a floor-divided share of vCPUs and memory, with the last node
absorbing any remainder. This handles the common Kata case of +1 VMM
overhead vCPU gracefully. Memory backends are selected based on
hugepages/virtio-fs/file-backed-mem configuration.
Guard multi-NUMA topology generation to amd64 and arm64 only, since
other architectures (s390x, riscv64) do not support QEMU NUMA/DIMM.
Wire buildNUMATopology() into CreateVM so the QEMU config includes NUMA
nodes and distances.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Add PCISysFsDevicesNUMANode property and GetPCIDeviceNUMANode() helper
to read /sys/bus/pci/devices/<BDF>/numa_node when discovering VFIO
devices. Store the result in the new NUMANode field on VFIODev (-1 for
unknown/no affinity).
Wire NUMA node detection into both GetAllVFIODevicesFromIOMMUGroup()
(legacy VFIO path) and GetDeviceFromVFIODev() (IOMMUFD path) so every
discovered VFIO device carries its host NUMA node.
Add validateVFIODeviceNUMAPlacement() which runs at the end of
buildNUMATopology(). It checks every cold-plugged VFIO device's host
NUMA node against the guest NUMA topology and logs a warning if a device
is on a host NUMA node not covered by any guest NUMA node (indicating
potential cross-NUMA memory access overhead), or an info message
confirming correct placement.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Introduce NUMANode and NUMADist types, add NUMANodes/NUMADists fields to
Config, and implement appendMultiNUMAMemoryKnobs() to generate per-node
memory-backend objects with host-nodes/policy=bind, -numa node entries
with cpus= ranges, and -numa dist entries for the distance matrix.
Gate the multi-NUMA path in appendMemoryKnobs() behind isDimmSupported()
to ensure architectures without DIMM support (s390x, riscv64) fall back
to the single-node path. Drop 386 from isDimmSupported since 32-bit x86
is not a supported Kata target.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Without the protocol in the URI, grpc-go defaults to the DNS resolver,
which results in an error for unix sockets (`name resolver error: produced
zero addresses`).
We also remove the `getAddressAndDialer(...)` and `dial(...)` functions, as
they are no longer necessary, grpc-go supports connecting to unix sockets
directly. This also removes the matching tests.
This also adds a `Makefile` and tweaks the Dockerfile to simplify building
the Docker image.
Fixes#12398
Signed-off-by: Florian Vichot <florian.vichot@gmail.com>
The run-tracing job in basic-ci-amd64.yaml has been disabled
(if: false) due to issue #9763, with no path to re-enablement.
Remove the job definition and the backing
tests/functional/tracing/ directory.
Made-with: Cursor
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Wait() was releasing s.mu immediately after getContainer(), then
calling getExec() — which reads c.execs — without holding any lock.
Concurrent Exec() or Delete() calls that write to c.execs under s.mu
triggered a "concurrent map read and map write" fatal panic.
Add a dedicated sync.RWMutex to the container struct that protects the
execs map. getExec() now acquires a read lock internally, and all
writes go through new setExec()/deleteExec() helpers that acquire the
write lock. This keeps the locking concern local to the map and avoids
complicating the s.mu usage in Wait().
Add a regression test (TestConcurrentExecAccess) that exercises
concurrent getExec reads against setExec/deleteExec writes; this
reliably reproduces the panic under the race detector without the fix.
Fixes: #12825
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The hardcoded DEFAULT_LAUNCH_PROCESS_TIMEOUT of 6 seconds in the kata
agent is insufficient for environments with NVIDIA GPUs and NVSwitches,
where the attestation-agent needs significantly more time to collect
evidence during initialization (e.g. ~2 seconds per NVSwitch).
When the timeout expires, the agent (PID 1) exits with an error, causing
the guest kernel to perform an orderly shutdown before the
attestation-agent has finished starting.
Make this timeout configurable via the kernel parameter
agent.launch_process_timeout (in seconds), preserving the 6-second
default for backward compatibility. The Go runtime is wired up to pass
this value from the TOML config's [agent.kata] section through to the
kernel command line.
The NVIDIA GPU configs set the new default to 15 seconds.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Docker 26+ configures container networking (veth pair, IP addresses,
routes) after task creation rather than before. Kata's endpoint scan
runs during CreateSandbox, before the interfaces exist, resulting in
VMs starting without network connectivity (no -netdev passed to QEMU).
Add RescanNetwork() which runs asynchronously after the Start RPC.
It polls the network namespace until Docker's interfaces appear, then
hotplugs them to QEMU and informs the guest agent to configure them
inside the VM.
Additional fixes:
- mountinfo parser: find fs type dynamically instead of hardcoded
field index, fixing parsing with optional mount tags (shared:,
master:)
- IsDockerContainer: check CreateRuntime hooks for Docker 26+
- DockerNetnsPath: extract netns path from libnetwork-setkey hook
args with path traversal protection
- detectHypervisorNetns: verify PID ownership via /proc/pid/cmdline
to guard against PID recycling
- startVM guard: rescan when len(endpoints)==0 after VM start
Fixes: #9340
Signed-off-by: llink5 <llink5@users.noreply.github.com>
The govmm workflow isn't run by us and it and the other CI files
are just legacy from when it was a separate repo, so let's clean up
this debt rather than having to update it frequently.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Update the action to resolve the following warning in GHA:
> Node.js 20 actions are deprecated. The following actions are running
> on Node.js 20 and may not work as expected:
> actions/checkout@11bd71901b.
> Actions will be forced to run with Node.js 24 by default starting June 2nd, 2026.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Add two new configuration knobs that control the logical and physical
sector sizes advertised by virtio-blk devices to the guest:
block_device_logical_sector_size (config file)
block_device_physical_sector_size (config file)
io.katacontainers.config.hypervisor.blk_logical_sector_size (annotation)
io.katacontainers.config.hypervisor.blk_physical_sector_size (annotation)
The annotation names are abbreviated relative to the config file keys
because Kubernetes enforces a 63-character limit on annotation name
segments, and the full names would exceed it.
Both settings default to 0 (let QEMU decide). When set, they are passed
as logical_block_size and physical_block_size in the QMP device_add
command during block device hotplug.
Setting logical_sector_size smaller then container filesystem
block size will cause EINVAL on mount. The physical_sector_size can
always be set independently.
Values must be 0 or a power of 2 in the range [512, 65536]; other
values are rejected with an error at sandbox creation time.
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
Enable VFIO device pass-through at VM creation time on Cloud Hypervisor,
in addition to the existing hot-plug path.
Signed-off-by: Roaa Sakr <romoh@microsoft.com>
Replace hardcoded NVIDIA vendor ID (0x10de) and class (0x030) checks
with a vendor-agnostic lookup table (cdiDeviceKind) that maps PCI
vendor/class pairs to CDI device kinds. This makes it straightforward
to add support for new device types by adding entries to the table.
Refactor siblingAnnotation to resolve device BDFs once upfront and
reuse them for both CDI type detection and sibling matching, eliminating
redundant sysfs reads. Devices not in the lookup table (e.g. NVSwitches)
are skipped with errNoSiblingFound, while known device types that fail
to match a sibling produce a hard error.
Consolidate the hot-plug and cold-plug device loops into a single loop
over extracted container paths, removing duplicated filtering logic.
Export GetPCIDeviceProperty from the device drivers package to allow
vendor/class lookup from sysfs in the container annotation path.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
* Introduces the `emptydir_mode` config flag to allow instructing the runtime
to create a block device for emptyDir volumes.
* The block device is created in the original emptyDir folder on the host
so that Kubelet can monitors its disk usage and evict the pod if it exceeds
its sizeLimit. This matches runc and virtio-fs.
* The block device's disk image file is sparse to minimize host disk
footprint.
Fixes: #10560
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
The remote hypervisor delegates VM creation to a remote service.
The VM runs on cloud infrastructure, not the local host kernel.
So requiring a KVM/MSHV device is semantically wrong and would
cause a hard failure on any host where these devices are absent
(e.g., a VM that doesn't expose nested virtualization).
Skip sandboxDevices() entirely when the configured hypervisor type
is remoteHypervisor{}.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Different kubernetes distributions, such as k0s, use a different kubelet
root dir location instead of the default /var/lib/kubelet, so ConfigMap
and Secret volume propagation were failing.
This adds a kubelet_root_dir config option that the go runtime uses when
matching volume paths and kata-deploy now sets it automatically for k0s
via a drop-in file.
runtime-rs does not need this option: it identifies ConfigMap/Secret,
projected, and downward-api volumes by volume-type path segment
(kubernetes.io~configmap, etc.), not by kubelet root prefix.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This commit adds logic to properly handle memory hotplug
for QemuCCWVirtio in the ExecMemdevAdd() path.
The new logic is triggered only when virtio-mem is enabled.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
strings.ReplaceAll was introduced in Go 1.12 as a more readable and self-documenting way to say "replace everything".
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Update from `this` to fix:
```
ST1006: receiver name should be a reflection of its identity; don't use generic names such as "this" or "self" (staticcheck)
```
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
strings.SplitN(s, sep, -1) is functionally identical to strings.Split(s, sep)
as -1 says to return all substrings, so choose the more concise version
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
GenericDevice is an embedded (anonymous) field in the device struct, so its fields
and methods are "promoted" to the outer struct, so we go straight to it.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Separatly added hypervisor devices to cgroup to
omit not relevant warnings and fail if none of them
are available.
Also fix a testcase reload removed kernel modules to later testcases
and skip some tests on ARM because lack of virtualization support
Fixes#6656
Signed-off-by: Balint Tobik <btobik@redhat.com>
Previously zizmor only mandated pinning of third-party actions,
but has recommended rolling this out to all actions now.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
With enable_numa=true hypervisor will expose host NUMA topology as is:
map vm NUMA nodes to host 1:1 and bind vpus to relates CPUS.
Option "numa_mapping" allows to redefine NUMA nodes mapping:
- map each vm node to particular host node or several numa nodes
- emulate numa on host without numa (useful for tests)
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Co-authored-by: Zvonko Kaiser <zkaiser@nvidia.com>
Similar to the kernel_params annotation, add a
kernel_verity_params annotation and add logic to make these
parameters overwritable. For instance, this can be used in test
logic to provide bogus dm-verity hashes for negative tests.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
This change introduces the kernel_verity_parameters knob to the
Go based shim, picking up dm-verity information in a new config
field (the corresponding build variable is already produced by
the shim build). The change extends the shim to parse dm-verity
information from this parameter and to construct the kernel command
line appropriately, based on the indicated initramfs or kernelinit
build variant.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Add support for CRI-O annotations when fetching pod identifiers for
device cold plug. The code now checks containerd CRI annotations first,
then falls back to CRI-O annotations if they are empty.
This enables device cold plug to work with both containerd and CRI-O
container runtimes.
Annotations supported:
- containerd: io.kubernetes.cri.sandbox-name, io.kubernetes.cri.sandbox-namespace
- CRI-O: io.kubernetes.cri-o.KubeName, io.kubernetes.cri-o.Namespace
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Adding additional cases for the IOMMUFDID method to check for
non-IOMMUFD paths are passed. The method should do the right
thing.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
Logging the QMP commands gives us a lot of flexibility to
troubleshoot issues with what is being sent to QEMU.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
An import cycle was introduced because of a mutual need
for the constant that describes the prefix of IOMMUFD files.
We need to extract this out into a higher-level package.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
The QMP commands sent to QEMU did not properly set up
IOMMUFD objects in the codepath that handles VFIO device
hot-plugging. This is mainly relevant in the Kubernetes
use-case where the VFIO devices are not available when
QEMU is first launched.
Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>
If the sandbox has cold-plugged a IOMMUFD device but the
device-plugins sends us a /dev/vfio/<NUM> device we need to
check if the IOMMUFD device and the VFIO device are the same
We have the sibling.BDF we now need to extract the BDF of the
devPath that is either /dev/vfio/<NUM> or /dev/vfio/devices/vfio<NUM>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>