kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2026-05-05 12:02:33 +00:00

Author	SHA1	Message	Date
Zvonko Kaiser	8ff5d164c6	runtime: make CDI annotation vendor-agnostic with lookup table Replace hardcoded NVIDIA vendor ID (0x10de) and class (0x030) checks with a vendor-agnostic lookup table (cdiDeviceKind) that maps PCI vendor/class pairs to CDI device kinds. This makes it straightforward to add support for new device types by adding entries to the table. Refactor siblingAnnotation to resolve device BDFs once upfront and reuse them for both CDI type detection and sibling matching, eliminating redundant sysfs reads. Devices not in the lookup table (e.g. NVSwitches) are skipped with errNoSiblingFound, while known device types that fail to match a sibling produce a hard error. Consolidate the hot-plug and cold-plug device loops into a single loop over extracted container paths, removing duplicated filtering logic. Export GetPCIDeviceProperty from the device drivers package to allow vendor/class lookup from sysfs in the container annotation path. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2026-03-15 09:53:32 -07:00
stevenhorsman	f1960103d1	runtime: Improve split statement strings.SplitN(s, sep, -1) is functionally identical to strings.Split(s, sep) as -1 says to return all substrings, so choose the more concise version Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	8cd3aa8c84	runtime: Remove embedded field from selector GenericDevice is an embedded (anonymous) field in the device struct, so its fields and methods are "promoted" to the outer struct, so we go straight to it. Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:33:04 +00:00
stevenhorsman	cff8994336	runtime: Switch to switch statements Resolve: `QF1003: could use tagged switch on major (staticcheck)` Signed-off-by: stevenhorsman <steven@uk.ibm.com>	2026-02-24 14:22:10 +00:00
LandonTClipp	b50a73912d	runtime: Config test extension for IOMMUFDID Adding additional cases for the IOMMUFDID method to check for non-IOMMUFD paths are passed. The method should do the right thing. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2025-12-10 15:46:28 +01:00
LandonTClipp	137866f793	runtime: Allow QMP commands to be logged in debug level Logging the QMP commands gives us a lot of flexibility to troubleshoot issues with what is being sent to QEMU. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2025-12-10 15:46:28 +01:00
LandonTClipp	a3b5764f67	runtime: Fix import cycle and add unit test for IOMMUFDID() An import cycle was introduced because of a mutual need for the constant that describes the prefix of IOMMUFD files. We need to extract this out into a higher-level package. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2025-12-10 15:46:28 +01:00
LandonTClipp	09438fd54f	runtime: Add IOMMUFD Object Creation for QEMU QMP Commands The QMP commands sent to QEMU did not properly set up IOMMUFD objects in the codepath that handles VFIO device hot-plugging. This is mainly relevant in the Kubernetes use-case where the VFIO devices are not available when QEMU is first launched. Signed-off-by: LandonTClipp <11232769+LandonTClipp@users.noreply.github.com>	2025-12-10 15:46:28 +01:00
Zvonko Kaiser	f8ad17499d	gpu: VFIO handling container vs sandbox If the sandbox has cold-plugged a IOMMUFD device but the device-plugins sends us a /dev/vfio/<NUM> device we need to check if the IOMMUFD device and the VFIO device are the same We have the sibling.BDF we now need to extract the BDF of the devPath that is either /dev/vfio/<NUM> or /dev/vfio/devices/vfio<NUM> Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-12-05 16:53:31 +01:00
Joji Mekkattuparamban	5aa184925a	shim: Support device cold plug with Kubernetes Utilize Kubelet's Pod Resource API to determine device allocations for the Pod during sandbox creation. Use CDI files to translate the device IDs to corresponding device paths and perform device injection. Fixes #12009 Signed-off-by: Joji Mekkattuparamban <jojim@nvidia.com>	2025-11-20 10:58:55 +01:00
Manuel Huber	1561d7fbba	runtime: Clear outer CDI annotations Pod annotations from the outer runtime are being used for cold-plugging CDI devices. We need to ensure that these annotations don't leak into the inner runtime for which specific container (sibling) annotations are being created. Without this change, the inner runtime receives both annotations, leading to failing CDI injection as an outer runtime annotation observed in the guest translates to an unresolvable CDI device, for example, cdi.k8s.io/gpu: "nvidia.com/pgpu=0". Signed-off-by: Manuel Huber <manuelh@nvidia.com>	2025-11-04 23:18:00 +01:00
Fabiano Fidêncio	fd1b8ceed1	runtime: qemu: Add reclaim_guest_freed_memory [BACKPORT] Similar to what we've done for Cloud Hypervisor in the commit `9f76467cb7`, we're backporting a runtime-rs feature that would be benificial to have as part of the go runtime. This allows users to use virito-balloon for the hypervisor to reclaim memory freed by the guest. Signed-off-by: Fabiano Fidêncio <fidencio@northflank.com>	2025-08-22 23:56:47 +02:00
Zvonko Kaiser	97f4bcb456	gpu: Remove CDI annotations for outer runtime After the outer runtime has processed the CDI annotation from the spec we can delete them since they were converted into Linux devices in the OCI spec. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2025-04-23 21:02:06 +00:00
Zvonko Kaiser	804e5cd332	gpu: IOMMUFD provide proper ID We need a proper ID otherwise QEMU sometimes fails with invalid ID. Use the same pattern as with the old VFIO implementation. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2025-02-25 16:24:17 +00:00
Zvonko Kaiser	7cca2c4925	gpu: Use a dedicated VFIO group vs iommufd entry We do not want to abuse the sysfsentry lets use a dedicated devfsentry. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2025-02-20 18:27:52 +00:00
Zvonko Kaiser	9add633258	qemu: Add command line for IOMMUFD For each IOMMUFD device create an object and assign it to the device, we need additional information that is populated now correctly to decide if we run the old VFIO or new VFIO backend. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2025-02-20 18:27:50 +00:00
Zvonko Kaiser	e82fdee20f	runtime: Add proper IOMMUFD parsing With newer kernels we have a new backend for VFIO called IOMMUFD this is a departure from VFIO IOMMU Groups since it has only one device associated with an IOMMUFD entry. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2025-01-15 23:39:33 +00:00
Fabiano Fidêncio	fefcf7cfa4	acrn: Drop support As we don't have any CI, nor maintainer to keep ACRN code around, we better have it removed than give users the expectation that it should or would work at some point. Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>	2024-09-19 16:05:43 +02:00
Archana Shinde	1636c201f4	network: Implement network hotunplug for physical endpoints Similar to HotAttach, the HotDetach method signature for network endoints needs to be changed as well to allow for the method to make use of device manager to manage the hot unplug of physical network devices. Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>	2024-07-17 16:42:41 -07:00
Archana Shinde	c6390f2a2a	vfio: Introduce function to get vfio dev path This function will be later used to get the vfio dev path. Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>	2024-07-17 16:42:41 -07:00
Archana Shinde	2fef4bc844	vfio: use driver_override field for device binding. The current implementation for device binding using driver bind/unbind and new_id fails in the scenario when the physical device is not bound to a driver before assigning it to vfio. There exists and updated mechanism to accomplish the same that does not have the same issue as above. The driver_override field for a device allows us to specify the driver for a device rather than relying on the bound driver to provide a positive match of the device. It also has other advantages referenced here: https://patchwork.kernel.org/project/linux-pci/patch/1396372540.476.160.camel@ul30vt.home/ So use the updated driver_override mechanism for binding/unbinding a physical device/virtual function to vfio-pci. Signed-off-by: liangxianlong <liang.xianlong@zte.com.cn> Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>	2024-07-17 16:42:40 -07:00
Zvonko Kaiser	d4832b3b74	vfio: Fix hotpunplug We need to remove the device from the tracking map, a container restart will increment the bus index and we will get out of root-ports and crash the machine. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2024-05-28 07:37:30 +00:00
Zvonko Kaiser	4c93bb2d61	qemu: Add CDI device handling for any container type We need special handling for pod_sandbox, pod_container and single_container how and when to inject CDI devices Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2024-05-27 10:13:01 +00:00
Zvonko Kaiser	c7b41361b2	gpu: reintroduce pcie_root_port and add pcie_switch_port In Kubernetes we still do not have proper VM sizing at sandbox creation level. This KEP tries to mitigates that: kubernetes/enhancements#4113 but this can take some time until Kube and containerd or other runtimes have those changes rolled out. Before we used a static config of VFIO ports, and we introduced CDI support which needs a patched contianerd. We want to eliminate the patched continerd in the GPU case as well. Fixes: #8860 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2024-05-27 10:13:01 +00:00
Wedson Almeida Filho	7e1b1949d4	runtime: add support for kata overlays When at least one `io.katacontainers.fs-opt.layer` option is added to the rootfs, it gets inserted into the VM as a layer, and the file system is mounted as an overlay of all layers using the overlayfs driver. Additionally, if the `io.katacontainers.fs-opt.block_device=file` option is present in a layer, it is mounted as a block device backed by a file on the host. Fixes: #7536 Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>	2023-08-03 17:58:39 -03:00
Zvonko Kaiser	1fc715bc65	s390x: Add AP Attach/Detach test Now that we have propper AP device support add a unit test for testing the correct Attach/Detach of AP devices. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-07-23 13:44:19 +00:00
Zvonko Kaiser	62aa6750ec	vfio: Added better handling of VFIO Control Devices Depending on the vfio_mode we need to mount the VFIO control device additionally into the container. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-07-20 13:42:42 +00:00
Zvonko Kaiser	114542e2ba	s390x: Fixing device.Bus assignment The device.Bus was reset if a specific combination of configuration parameters were not met. With the new PCIe topology this should not happen anymore Fixes: #7381 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-07-20 07:24:26 +00:00
Peng Tao	581be92b25	Merge pull request #4492 from zvonkok/pcie-topology runtime: fix PCIe topology for GPUDirect use-case	2023-07-03 09:17:12 +08:00
Fabiano Fidêncio	6a21e20c63	runtime: Add "none" as a shared_fs option Currently, even when using devmapper, if the VMM supports virtio-fs / virtio-9p, that's used to share a few files between the host and the guest. This needed, as we need to share with the guest contents like secrets, certificates, and configurations, via Kubernetes objects like configMaps or secrets, and those are rotated and must be updated into the guest whenever the rotation happens. However, there are still use-cases users can live with just copying those files into the guest at the pod creation time, and for those there's absolutely no need to have a shared filesystem process running with no extra obvious benefit, consuming memory and even increasing the attack surface used by Kata Containers. For the case mentioned above, we should allow users, making it very clear which limitations it'll bring, to run Kata Containers with devmapper without actually having to use a shared file system, which is already the approach taken when using Firecracker as the VMM. Fixes: #7207 Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>	2023-06-30 20:45:00 +02:00
Zvonko Kaiser	0f454d0c04	gpu: Fixing typos for PCIe topology changes Some comments and functions had typos and wrong capitalization. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-06-30 08:42:55 +00:00
Zvonko Kaiser	8330fb8ee7	gpu: Update unit tests Some tests are now failing due to the changes how PCIe is handled. Update the test accordingly. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-06-23 11:16:25 +00:00
Zvonko Kaiser	72f2cb84e6	gpu: Reset cold or hot plug after overriding If we override the cold, hot plug with an annotation we need to reset the other plugging mechanism to NoPort otherwise both will be enabled. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-06-15 17:51:01 +00:00
Zvonko Kaiser	fbacc09646	gpu: PCIe topology, consider vhost-user-block in Virt In Virt the vhost-user-block is an PCIe device so we need to make sure to consider it as well. We're keeping track of vhost-user-block devices and deduce the correct amount of PCIe root ports. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-06-15 17:39:55 +00:00
Zvonko Kaiser	b11246c3aa	gpu: Various fixes for virt machine type The PCI qom path was not deduced correctly added regex for correct path walking. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-06-14 08:33:57 +00:00
Zvonko Kaiser	40101ea7db	vfio: Added annotation for hot(cold) plug Now it is possible to configure the PCIe topology via annotations and addded a simple test, checking for Invalid and RootPort Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-06-14 08:20:24 +00:00
Zvonko Kaiser	b5c4677e0e	vfio: Rearrange the bus assignemnt Refactor the bus assignment so that the call to GetAllVFIODevicesFromIOMMUGroup can be used by any module without affecting the topology. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-06-14 08:20:24 +00:00
Zvonko Kaiser	b1aa8c8a24	gpu: Moved the PCIe configs to drivers The hypervisor_state file was the wrong location for the PCIe Port settings, moved everything under device umbrella, where it can be consumed more easily and we do not get into circular deps. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-06-14 08:20:24 +00:00
Zvonko Kaiser	55a66eb7fb	gpu: Add config to TOML Update cold-plug and hot-plug setting to include bridge, root and switch-port Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-06-14 08:20:24 +00:00
Zvonko Kaiser	de39fb7d38	runtime: Add support for GPUDirect and GPUDirect RDMA PCIe topology Fixes: #4491 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-06-14 08:20:24 +00:00
Zvonko Kaiser	2a830177ca	gpu: Add fwcfg helper function Added driver util function for easier handling of VFIO devices outside of the VFIO module. At the sandbox level we may need to set options depending if we have a VFIO/PCIe device, like the fwCfg for confiential guests. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-04-26 09:47:37 +00:00
Zvonko Kaiser	131f056a12	gpu: Extract VFIO Functions to drivers Some functions may be used in other modules then only in the VFIO module, extract them and make them available to other layers like sandbox. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-04-26 09:47:37 +00:00
Zvonko Kaiser	c8cf7ed3bc	gpu: Add ColdPlug of VFIO devices with devManager If we have a VFIO device and cold-plug is enabled we mark each device as ColdPlug=true and let the VFIO module do the attaching. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-04-26 09:47:37 +00:00
Zvonko Kaiser	f4f958d53c	gpu: Do not pass-through PCI (Host) Bridges On some systems a GPU is in a IOMMU group with a PCI Bridge and PCI Host Bridge. Per default no PCI Bridge needs to be passed-through. When scanning the IOMMU group, ignore devices with a 0x60 class ID prefix. Fixes: #6663 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-04-17 10:08:23 +00:00
Hyounggyu Choi	96baa83895	agent: Bring in VFIO-AP device handling again This PR is a continuing work for (kata-containers#3679). This generalizes the previous VFIO device handling which only focuses on PCI to include AP (IBM Z specific). Fixes: kata-containers#3678 Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2023-03-16 18:14:12 +09:00
Jakob Naucke	f666f8e2df	agent: Add VFIO-AP device handling Initial VFIO-AP support (#578) was simple, but somewhat hacky; a different code path would be chosen for performing the hotplug, and agent-side device handling was bound to knowing the assigned queue numbers (APQNs) through some other means; plus the code for awaiting them was written for the Go agent and never released. This code also artificially increased the hotplug timeout to wait for the (relatively expensive, thus limited to 5 seconds at the quickest) AP rescan, which is impractical for e.g. common k8s timeouts. Since then, the general handling logic was improved (#1190), but it assumed PCI in several places. In the runtime, introduce and parse AP devices. Annotate them as such when passing to the agent, and include information about the associated APQNs. The agent awaits the passed APQNs through uevents and triggers a rescan directly. Fixes: #3678 Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>	2023-03-16 10:07:48 +09:00
Jakob Naucke	b546eca26f	runtime: Generalize VFIO devices Generalize VFIO devices to allow for adding AP in the next patch. The logic for VFIOPciDeviceMediatedType() has been changed and IsAPVFIOMediatedDevice() has been removed. The rationale for the revomal is: - VFIODeviceMediatedType is divided into 2 subtypes for AP and PCI - Logic of checking a subtype of mediated device is included in GetVFIODeviceType() - VFIOPciDeviceMediatedType() can simply fulfill the device addition based on a type categorized by GetVFIODeviceType() Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>	2023-03-16 10:06:37 +09:00
Jakob Naucke	4c527d00c7	agent: Rename VFIO handling to VFIO PCI handling e.g., split_vfio_option is PCI-specific and should instead be named split_vfio_pci_option. This mutually affects the runtime, most notably how the labels are named for the agent. Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>	2023-03-16 07:43:39 +09:00
zhaojizhuang	ca02c9f512	runtime: add reconnect timeout for vhost user block Fixes: #6075 Signed-off-by: zhaojizhuang <571130360@qq.com>	2023-02-13 14:33:46 +08:00
liyuxuan.darfux	3bb145c63a	runtime: Support virtiofs queue size for qemu and make it configurable The default vhost-user-fs queue-size of qemu is 128 now. Set it to 1024 by default which is same as clh. Also make this value configurable. Fixes: #5694 Signed-off-by: liyuxuan.darfux <liyuxuan.darfux@bytedance.com>	2022-11-19 15:38:11 +08:00

1 2

56 Commits