This addresses an issue where the disable_guest_empty_dir=true code paths did
not take into account that hugepage-backed emptyDirs should always be recreated
in the guest (using guest hugepages).
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
This makes the runtime share the host Kubelet emptyDir folder with the guest
instead of the agent creating an empty folder in the container rootfs. Doing so
enables the Kubelet to track emptyDir usage and evict greedy pods.
In other words, with virtio-fs the container rootfs uses host storage whether
this is true or false, however with true, Kata uses the k8s emptyDir folder so
the sizeLimit is properly enforced by k8s.
Addresses the ephemeral storage part of #12203.
History:
* Initially, emptyDirs are slow because they are shared from the host with 9p.
https://github.com/kata-containers/runtime/issues/1472
* To address above, emptyDirs are hardcoded to be created by the agent in the
pause container's rootfs, potentially leveraging devicemapper and improving
perf.
https://github.com/kata-containers/runtime/pull/1485
* The previous PR regressed an (interesting?) use case where emptyDirs were
used to share data from the host to the guest, so the behavior was made
configurable and `disable_guest_empty_dir = false` is introduced, defaulting
to the behavior of the previous PR.
https://github.com/kata-containers/kata-containers/pull/2056
* Another resource accounting regression remains which is addressed in this PR.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Add k8s-vm-templating-test.bats which exercises pod create
with the factory initialized on the target node.
Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
Add support for VM Template factory on the clh path.
In order to support snapshot/restore-based VM templating,
the following changes were needed:
1. For clh.go, implement SaveVM, PauseVM, restoreVM, ResumeVM
2. Remove initrd config check for VM Templating path. The
root disk image (when using image mode) is created in memory
and therefore captured in the VM snapshot.
3. Truncate the memory file to the size of the VM at factory VM
create time. This allows CLH to use the memory file
as the backing for the template VM memory, allowing O(1)
snapshot times.
4. CLH uses memory zones as backing for its memory on the template paths
5. Update StartVM in CLH to use the restore path when template is
configured and available
Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
Add a `visible_cdi_devices` TOML option to the Go runtime so the
agent.visible_cdi_devices=true kernel parameter is emitted to the guest
when enabled. Wire the option through the NVIDIA GPU configuration
templates and add tests verifying the kernel-params flow.
Signed-off-by: LandonTClipp <lclipp@coreweave.com>
Block-mode volumes (e.g. Kubernetes volumeDevices) are passed to the
container as device nodes in spec.Linux.Devices and carry no mount "ro"
option. Their read-only intent is expressed only via the cgroup device
access in spec.Linux.Resources.Devices ("rm" = read+mknod, no write, for
read-only; "rwm" for read-write).
The device path ignored that signal: newLinuxDeviceInfo() built the
DeviceInfo without ever setting ReadOnly (it only consumed FileMode, the
node permission bits, not the read/write access), so the device was
always attached read-write.
This is problematic for filesystems such as XFS, which inspect the block
device read-only state to decide whether to attempt journal/log recovery.
When the guest device is writable, XFS tries to replay the log even for a
read-only mount, which fails badly. Mounting "-o ro" in the guest is not
enough; the device itself must advertise read-only, which only happens
when the VMM opens the backing device read-only (DeviceInfo.ReadOnly ->
BlockDrive.ReadOnly -> qemu read-only=on / clh Readonly).
Derive the read-only flag from two independent signals, combined with OR
so either one marks the device read-only:
- the cgroup device access rule that exactly matches the device, so a
block-mode volume marked read-only by the orchestrator (e.g. a pod
volume with persistentVolumeClaim.readOnly: true) is honored, and
- the host block device's own read-only flag (queried via the BLKROGET
ioctl). Block-mode volumes frequently carry no read-only signal in
the OCI spec at all, so the device flag is often the only reliable
source.
The BLKROGET probe is shared (pkg/device/config.BlockDeviceIsReadOnly,
Linux-only with a stub on other platforms) between the device-node path
(newLinuxDeviceInfo, probing /dev/block/<major>:<minor>) and the
bind-mounted/filesystem block path (createDeviceInfo). None of this
relies on external host tooling such as "blockdev --setro".
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor
Remove the Go runtime file_mem_backend and valid_file_mem_backends
config knobs, along with the corresponding sandbox annotation handling.
The runtime still enables file-backed shared memory automatically for
virtio-fs by using /dev/shm as the backing directory. This only removes
the user-selectable backend path.
Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Assisted-by: OpenAI Codex <codex@openai.com>
QEMU maxcpus enables CPU hotplug capabilities but it's unused when
confidential guest is enabled.
Change Go runtime code to skip setting maxcpus QEMU cmdline if CPU hotplug
is not needed.
Commit 07db945b09 built a relationship between kernel's cmdline nr_cpus and
the maxcpus config. Now that maxcpus is dropped for confidential guests, drop
nr_cpus from kernel commandline too. This hopefully helps with the reference
values computation too.
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
The NVIDIA BF3 SR-IOV device plugin injects the VF BDF only as a
PCIDEVICE_* environment variable; it does not add the VFIO char device
to linux.devices in the OCI spec. As a result the agent's
container_has_vfio_device() gate stays closed and
expose_guest_infiniband_devices() is never triggered — leaving
/dev/infiniband absent from the container even though the guest kernel
created the IB devices (mlx5_core.rdma.0 probes successfully).
Add appendPhysicalEndpointDevices() which runs after appendDevices()
in createContainer(). It walks the sandbox network endpoints; for
each PhysicalEndpoint with a resolved guest PCI path it derives the
VFIO group char path from sysfs (iommu_group symlink) and synthesises
a vfio-pci-gk Device entry. Both legacy group paths (/dev/vfio/N)
and iommufd cdev paths (/dev/vfio/devices/vfioN) are supported by
reading the iommu_group sysfs symlink.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Without an admin MAC, the guest's mlx5_core inherits the VF's
firmware-default MAC. This MAC differs from the IB port's HCA MAC, so
mlx5_ib's GID cache refuses to populate
/sys/class/infiniband/mlx5_*/ports/N/gids/*. RoCE then appears active
(port = ACTIVE, link_layer = Ethernet) but every verb that needs a GID
— RoCEv2 packets, address handles, librdmacm bind — fails silently.
Push the CNI-assigned MAC down to the VF as an "admin MAC" via the PF
using RTM_SETLINK before the bind-to-vfio-pci step. The firmware
applies the admin MAC during the VF reset that accompanies the
unbind/rebind cycle, so the guest sees a single consistent MAC across
netdev, IB port, and HCA.
Best-effort: failures are logged at warn and the existing agent-side
MAC reconciliation (rpc.rs::update_interface) remains as a fallback for
L2/L3 connectivity.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
For QEMU cold-plug + guest-kernel mode the guest BDF of a cold-plugged
VFIO device is auto-allocated at boot (each pcie-root-port is added with
chassis=N,slot=N but no pinned addr=, so QEMU picks the next free slot
on pcie.0). The hot-plug path already queries QMP via qomGetPciPath;
reuse that same mechanism for cold-plugged devices.
Add ResolveColdPlugVFIOGuestPciPaths to the Hypervisor interface.
Implement it in qemu.go using qomGetPciPath. Add no-op stubs for all
other hypervisors.
Call it at the start of setupNetworks so that the PCI paths are resolved
before generateVCNetworkStructures emits the agent Interface proto. Also
stamp the resolved path onto PhysicalEndpoints (used by SR-IOV VFs
exposed as physical network devices) so that update_interface carries a
non-empty devicePath. Without devicePath the agent falls back to a
by-MAC link lookup which fails when the VF firmware MAC differs from the
CNI-assigned MAC after the vfio-pci unbind/rebind cycle.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Container.createDevices was dropping cold-plug VFIO entries from the
container's deviceInfos whenever vfio_mode = "guest-kernel", which
in turn meant the agent's CreateContainer request carried no
vfio-pci-gk device entry and sandbox.pcimap[cid] stayed empty. The
SR-IOV device plugin still set PCIDEVICE_<RES>=<host-BDF> on the
workload container, so update_env_pci then aborted with
"No PCI mapping found for container <id>" and the container failed
with CrashLoopBackOff.
Include cold-plug VFIO devices in deviceInfos for both VFIO modes.
The existing vfio-pci-gk agent handler returns dev: None (so
/dev/vfio/<group> is not materialised in the container spec, and
constrainGRPCSpec(stripVfio=true) already strips it from the grpc
spec for guest-kernel mode), while still recording the host->guest
PCI mapping into sandbox.pcimap[cid] so env-var translation works.
devManager.NewDevice calls FindDevice first, which matches the
already cold-plugged sandbox-level device by HostPath / major / minor,
so this does not double-attach.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
Populate missing VFIO guest PCI paths via QMP before serializing
container devices so guest-kernel PCI env translation has the mappings
it needs.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Assisted-by: Cursor <cursoragent@cursor.com>
In the xConnectVMNetwork path, we have queues = 0 as a baseline,
set to h.HypervisorConfig().NumVCPUs() iff h.Capabilities() advertise
MultiQueueSupport. This is certainly incorrect as we always want, as
a baseline, at least one queue pair. Make queues := 1 by default
to ensure the NetworkPair has at least one queue pair for all
virtio-net paths.
Signed-off-by: Cameron Baird <cameronbaird@microsoft.com>
When NUMA placement is active and VFIO devices are cold-plugged,
create a pxb-pcie (PCIe Expander Bridge) per NUMA node that has
devices. Each pxb-pcie carries a numa_node property that gives the
guest kernel correct NUMA affinity for all PCI devices beneath it.
Root ports are created on each pxb-pcie bus instead of pcie.0, and
VFIODevice.Attach() assigns each device to the root port on its host
NUMA node's pxb bridge. Non-VFIO devices remain on pcie.0.
NUMA placement is "active" when there is more than one guest NUMA
node OR a single guest node mapped to a specific host node (the
latter happens when maybeRightSizeAutoNUMA() collapses a multi-node
sandbox to the GPU's host NUMA node). In both cases
buildNUMATopology() also emits the matching
memory-backend-ram,host-nodes=,policy=bind entries so guest memory
is sourced from the right host node.
So pxb-pcie can never capture a leaf virtio-pci device as the
default bus, every virtio-pci device emitter (NetDevice, VSOCK,
vhost-user-{net,scsi,blk,fs}) now appends bus=pcie.0 explicitly when
the machine actually exposes a pcie.0 root. Detection is done via a
new hasPCIeRoot() helper that returns true only for q35/virt machine
types — ppc64le's pseries (pci.0), s390x's s390-ccw-virtio (CCW
transport) and microvm (no PCI) intentionally skip the pin to avoid
"Bus 'pcie.0' not found" at startup.
This is the only QEMU mechanism that works for both regular and
confidential (TDX/SNP) guests, as it operates through the PCI bus
hierarchy rather than ACPI table injection.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
QEMU may not have spawned all vCPU threads when pinning starts, so
query_cpus_fast can return an incomplete list and leave some vCPUs
unpinned. To fix it, let's add exponential backoff retries before
pinning and fall back to available threads if retries are exhausted.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Make checkVCPUsPinning() NUMA-aware: when GuestNUMANodes are configured,
vCPU threads are pinned to host CPUs belonging to the same NUMA node as
the vCPU's guest NUMA node assignment via checkVCPUsPinningNUMA(),
preserving memory locality. vCPUs are distributed proportionally across
NUMA nodes, matching the distribution in buildNUMATopology().
Stop unconditionally stripping cpuset.mems in constrainGRPCSpec() and
container update(). When multi-NUMA is configured, translate host NUMA
node IDs to guest NUMA node IDs using translateHostMemsToGuest() before
forwarding to the agent. This allows the agent to enforce NUMA-aware
memory placement for containers.
Filter guest NUMA nodes at VM creation time: before calling CreateVM(),
prune GuestNUMANodes to only those whose HostCPUs intersect the sandbox
cpuset. This avoids exposing fake NUMA topology to the guest when
Kubernetes allocates CPUs from fewer nodes than the host has (e.g. all
CPUs from node 0 on a 2-node host), improving memory locality and
avoiding unnecessary cross-node memory traffic.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Add sysfs-based host NUMA distance reading (GetHostNUMADistances) that
parses /sys/devices/system/node/nodeN/distance to mirror the host NUMA
distance matrix into the guest via -numa dist entries.
Implement buildNUMATopology() which translates the GuestNUMANodes
configuration into govmm NUMANode and NUMADist slices. Each guest NUMA
node gets a floor-divided share of vCPUs and memory, with the last node
absorbing any remainder. This handles the common Kata case of +1 VMM
overhead vCPU gracefully. Memory backends are selected based on
hugepages/virtio-fs/file-backed-mem configuration.
Guard multi-NUMA topology generation to amd64 and arm64 only, since
other architectures (s390x, riscv64) do not support QEMU NUMA/DIMM.
Wire buildNUMATopology() into CreateVM so the QEMU config includes NUMA
nodes and distances.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Add PCISysFsDevicesNUMANode property and GetPCIDeviceNUMANode() helper
to read /sys/bus/pci/devices/<BDF>/numa_node when discovering VFIO
devices. Store the result in the new NUMANode field on VFIODev (-1 for
unknown/no affinity).
Wire NUMA node detection into both GetAllVFIODevicesFromIOMMUGroup()
(legacy VFIO path) and GetDeviceFromVFIODev() (IOMMUFD path) so every
discovered VFIO device carries its host NUMA node.
Add validateVFIODeviceNUMAPlacement() which runs at the end of
buildNUMATopology(). It checks every cold-plugged VFIO device's host
NUMA node against the guest NUMA topology and logs a warning if a device
is on a host NUMA node not covered by any guest NUMA node (indicating
potential cross-NUMA memory access overhead), or an info message
confirming correct placement.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Make cpuTopology() NUMA-aware by accepting a numNUMANodes parameter.
When multiple NUMA nodes are configured, restructure the SMP topology so
that Sockets=numNUMA and Cores=ceil(maxvcpus/numNUMA), grouping vCPUs by
socket per NUMA node. Use ceiling division so that uneven vCPU counts
(e.g. the +1 VMM overhead vCPU that Kata adds) produce a QEMU-valid SMP
topology where MaxCPUs == Sockets * Cores * Threads.
When numNUMANodes <= 1, the existing flat topology (Sockets=maxvcpus,
Cores=1) is preserved.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Docker 29.5+ adds a private time namespace to container bundles by
default, but kata agent only supports the classic namespace set and
then fails with "invalid namespace type".
Let's strip time namespaces in both the Go and rust runtimes before the
spec reaches the agent, matching how network and cgroup namespaces are
handled.
Fixes: #13080
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
When the jailer is in use (the default for kata-fc), cmd.Process.Pid in
fcInit() is the jailer's PID, not firecracker's. The jailer forks +
execs firecracker as a separate child and exits. fc.info.PID was
therefore stored as the (soon-to-be-dead) jailer PID.
At sandbox shutdown, fcEnd() calls WaitLocalProcess(fc.info.PID, SIGTERM, ...).
syscall.Kill on the dead jailer PID returns ESRCH, WaitLocalProcess
returns nil immediately, and the real firecracker microVM never
receives a signal. It gets reparented to init and stays alive
indefinitely, holding open resources from the host. Over many
container lifecycles this becomes a serious resource leak.
Read the real PID from <jailerRoot>/firecracker.pid, which firecracker
itself writes after the exec. Update fc.info.PID with that value so all
downstream code (fcEnd, Save/Load, kill-0 alive checks, NewProc) operates
on the actual firecracker process.
Also fix a small adjacent bug in Sandbox.Stop where the per-container
teardown loop ignored the force flag, causing any container.stop error
to short-circuit Stop before stopVM ran.
Signed-off-by: Sebastian Wolf <swolf@nvidia.com>
We sometimes get this error when creating the pod sandbox:
failed to create shim task: Failed to add qdisc for network index 2 : device or resource busy.
Adding a linear backoff retry when adding the qdisc to help mitigate the issue at the source and avoid the cascading error.
Signed-off-by: Mayeul Blanzat <mayeul.blanzat@datadoghq.com>
Because intptr() returns a fresh pointer on every call, those comparisons compared addresses,
never values, so every check evaluated to false.
As a result /dev/null, /dev/urandom, /dev/ptmx, /dev/loop-control and /dev/loop*
were appended to devices allowlist for sandbox_cgroup
even when the runtime spec already listed them, producing duplicate entries.
Switch to nil-safe value comparisons via a type switch on the cgroup device type
and dereferenced *d.Major / *d.Minor,
keeping the same detection semantics but actually matching existing entries.
Assisted-By: Claude 4.7
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
When sandbox_cgroup_only is enabled, the kata shim threads inherit
the sandbox device cgroup. For container rootfs whose mount source
is a regular file backed by a loop device (notably the blockfile
snapshotter), containerd's mount package opens /dev/loop-control to
allocate a free /dev/loopN and then opens that block node to attach
the backing file. Neither device is on the sandbox cgroup allowlist,
so both opens fail with EPERM.
This change adds /dev/loop-control (char 10:237) and the /dev/loopN
block nodes (block major 7, any minor) to the sandbox device cgroup
allowlist when sandbox_cgroup_only is true, mirroring the existing
treatment of /dev/null, /dev/urandom and /dev/ptmx. The additions
are gated on SandboxCgroupOnly because that is the only mode in
which the shim itself is constrained by this cgroup.
Assisted-By: Claude 4.7
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
Now shipped in the vendored code tarball.
Drop the git tree status check since it isn't needed anymore.
Also stop building with `-mod=vendor`. This requires to
expose GOMODCACHE as suggested by Fabiano Fidêncio.
Signed-off-by: Greg Kurz <groug@kaod.org>
Interfaces whose drivers do not register a specific netlink kind
(e.g. mlx5 Scalable Functions) are reported with the generic type
"device". The endpoint creation code did not handle this type,
causing sandbox creation to fail with:
"Unsupported network interface: device"
This is particularly visible on arm64 with Mellanox ConnectX NICs
using Scalable Functions, where the ethtool BusInfo returns a
non-PCI identifier (e.g. "mlx5_core.sf.4") so isPhysicalIface()
cannot classify the interface as physical either.
Handle "device" type interfaces the same way as veth endpoints,
connecting them through a TAP + TC-filter bridge.
Additionally, relax getLinkForEndpoint() for VethEndpoint so it
accepts the concrete link type returned by the kernel instead of
asserting *netlink.Veth. A "device" type interface wrapped in a
VethEndpoint returns *netlink.Device from LinkByName(), which
would fail the strict type assertion. All callers only need
link.Attrs(), so accepting any link type is safe.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
This change mirrors host networking into the guest as before, but now also
includes the default gateway neighbor entry for each interface.
Pods using overlay/synthetic gateways (e.g., 169.254.1.1) can hit a
first-connect race while the guest performs the initial ARP. Preseeding the
gateway neighbor removes that latency and makes early connections (e.g.,
to the API Service) deterministic.
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
Add a new extensible GetDiagnosticData RPC that retrieves diagnostic
information from the guest VM. The request carries a log_type string
field to specify what kind of data is requested, and a container_id
field to identify the target container.
The first supported log_type is "termination_log", which reads the
Kubernetes termination message file from inside the guest. This is
needed for shared_fs=none configurations where the host cannot
directly access the guest filesystem.
On the Go runtime side, the container stop() path now calls
GetDiagnosticData to copy the termination message to the host
when running with NoSharedFS and the terminationMessagePolicy
annotation is set to "File". The call is best-effort: failures
are logged as warnings rather than blocking container teardown.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Signed-off-by: Silenio Quarti <silenio_quarti@ca.ibm.com>
This is not related to this PR, but rather to #12734, which ended up not
running the `make src/agent generate-protocols`.
While here, let's also fix it.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The hardcoded DEFAULT_LAUNCH_PROCESS_TIMEOUT of 6 seconds in the kata
agent is insufficient for environments with NVIDIA GPUs and NVSwitches,
where the attestation-agent needs significantly more time to collect
evidence during initialization (e.g. ~2 seconds per NVSwitch).
When the timeout expires, the agent (PID 1) exits with an error, causing
the guest kernel to perform an orderly shutdown before the
attestation-agent has finished starting.
Make this timeout configurable via the kernel parameter
agent.launch_process_timeout (in seconds), preserving the 6-second
default for backward compatibility. The Go runtime is wired up to pass
this value from the TOML config's [agent.kata] section through to the
kernel command line.
The NVIDIA GPU configs set the new default to 15 seconds.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
Made-with: Cursor
Docker 26+ configures container networking (veth pair, IP addresses,
routes) after task creation rather than before. Kata's endpoint scan
runs during CreateSandbox, before the interfaces exist, resulting in
VMs starting without network connectivity (no -netdev passed to QEMU).
Add RescanNetwork() which runs asynchronously after the Start RPC.
It polls the network namespace until Docker's interfaces appear, then
hotplugs them to QEMU and informs the guest agent to configure them
inside the VM.
Additional fixes:
- mountinfo parser: find fs type dynamically instead of hardcoded
field index, fixing parsing with optional mount tags (shared:,
master:)
- IsDockerContainer: check CreateRuntime hooks for Docker 26+
- DockerNetnsPath: extract netns path from libnetwork-setkey hook
args with path traversal protection
- detectHypervisorNetns: verify PID ownership via /proc/pid/cmdline
to guard against PID recycling
- startVM guard: rescan when len(endpoints)==0 after VM start
Fixes: #9340
Signed-off-by: llink5 <llink5@users.noreply.github.com>
Add two new configuration knobs that control the logical and physical
sector sizes advertised by virtio-blk devices to the guest:
block_device_logical_sector_size (config file)
block_device_physical_sector_size (config file)
io.katacontainers.config.hypervisor.blk_logical_sector_size (annotation)
io.katacontainers.config.hypervisor.blk_physical_sector_size (annotation)
The annotation names are abbreviated relative to the config file keys
because Kubernetes enforces a 63-character limit on annotation name
segments, and the full names would exceed it.
Both settings default to 0 (let QEMU decide). When set, they are passed
as logical_block_size and physical_block_size in the QMP device_add
command during block device hotplug.
Setting logical_sector_size smaller then container filesystem
block size will cause EINVAL on mount. The physical_sector_size can
always be set independently.
Values must be 0 or a power of 2 in the range [512, 65536]; other
values are rejected with an error at sandbox creation time.
Signed-off-by: PiotrProkop <pprokop@nvidia.com>
Enable VFIO device pass-through at VM creation time on Cloud Hypervisor,
in addition to the existing hot-plug path.
Signed-off-by: Roaa Sakr <romoh@microsoft.com>
Replace hardcoded NVIDIA vendor ID (0x10de) and class (0x030) checks
with a vendor-agnostic lookup table (cdiDeviceKind) that maps PCI
vendor/class pairs to CDI device kinds. This makes it straightforward
to add support for new device types by adding entries to the table.
Refactor siblingAnnotation to resolve device BDFs once upfront and
reuse them for both CDI type detection and sibling matching, eliminating
redundant sysfs reads. Devices not in the lookup table (e.g. NVSwitches)
are skipped with errNoSiblingFound, while known device types that fail
to match a sibling produce a hard error.
Consolidate the hot-plug and cold-plug device loops into a single loop
over extracted container paths, removing duplicated filtering logic.
Export GetPCIDeviceProperty from the device drivers package to allow
vendor/class lookup from sysfs in the container annotation path.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
* Introduces the `emptydir_mode` config flag to allow instructing the runtime
to create a block device for emptyDir volumes.
* The block device is created in the original emptyDir folder on the host
so that Kubelet can monitors its disk usage and evict the pod if it exceeds
its sizeLimit. This matches runc and virtio-fs.
* The block device's disk image file is sparse to minimize host disk
footprint.
Fixes: #10560
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Some VMMs support plugging a disk as an image file instead of a block device,
so we adapt the runtime to support that.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
Co-authored-by: Aurélien Bombo <abombo@microsoft.com>
Virtio-mmio transport is not hardened for confidential computing (unlike
virtio-pci). Reject config that would use virtio-blk-mmio for rootfs/block
when confidential_guest is set, so CoCo guests only use virtio-blk-pci.
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
The remote hypervisor delegates VM creation to a remote service.
The VM runs on cloud infrastructure, not the local host kernel.
So requiring a KVM/MSHV device is semantically wrong and would
cause a hard failure on any host where these devices are absent
(e.g., a VM that doesn't expose nested virtualization).
Skip sandboxDevices() entirely when the configured hypervisor type
is remoteHypervisor{}.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Stop computing and setting mem-reserve and pref64-reserve on PCIe root
ports and switch ports. Remove getBARsMaxAddressableMemory() which
scanned host GPU BARs to pre-calculate these values.
The previous approach only considered GPU devices (IsGPU(), class
0x0302) when scanning for BAR sizes, so devices like NVSwitches (class
0x0680) with their 32MB non-prefetchable BAR0 were not accounted for
and received the 4MB default. Additionally, GetTotalAddressableMemory()
classifies BARs by 32/64-bit address width rather than by the
prefetchable flag that QEMU's mem-reserve vs pref64-reserve maps to.
Modern QEMU introspects VFIO device BARs when they are attached to
root ports and sizes the MMIO windows accordingly. Modern OVMF
(edk2-stable202502+) automatically calculates the 64-bit PCI MMIO
aperture based on the BARs of actually present devices during PCI
enumeration. Omitting the reserve parameters lets QEMU and OVMF
handle MMIO window sizing correctly for all device types including
GPUs, NVSwitches, and NICs without requiring host-side BAR scanning.
This also removes the nvpci dependency from qemu_arch_base.go.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Modern OVMF (edk2-stable202502 and later) automatically sizes the
64-bit PCI MMIO aperture based on the BARs of actually attached
devices during PCI enumeration. The opt/ovmf/X-PciMmio64Mb fw_cfg
hint is no longer needed to ensure large-BAR devices like NVIDIA
GPUs receive adequate MMIO space.
The previous approach was fragile: the runtime scanned host PCI
devices to estimate the required aperture size, but only considered
GPU devices (class 0x0302), missing NVSwitches and other devices
with large BARs. Removing this code avoids confusion about MMIO
sizing responsibility.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>