Commit Graph

219 Commits

Author SHA1 Message Date
Xynnn007
91bb6b7c34 runtime: add support for io.katacontainers.config.runtime.cc_init_data
io.katacontainers.config.runtime.cc_init_data specifies initdata used by
the pod in base64(gzip(initdata toml)) format. The initdata will be
encapsulated into an initdata image and mount it as a raw block device
to the guest.

The initdata image will be aligned with 512 bytes, which is chosen as a
usual sector size supported by different hypervisors like qemu, clh and
dragonball.

Note that this patch only adds support for qemu hypervisor.

Signed-off-by: Xynnn007 <xynnn@linux.alibaba.com>
2025-04-15 16:35:59 +08:00
Jakob Naucke
a084b99324
virtcontainers: Separate PCI/CCW for net devices
On s390x, virtio-net devices should use CCW, alongside a different
device path. Use accordingly.

Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2025-02-26 11:36:43 +01:00
Jakob Naucke
2aa523f08a
virtcontainers: Fix virtio-net-ccw address format
Hex device number was formatted as hex twice, thus encoding the string
as hex.

Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2025-02-26 11:36:43 +01:00
Zvonko Kaiser
9add633258 qemu: Add command line for IOMMUFD
For each IOMMUFD device create an object and assign
it to the device, we need additional information that
is populated now correctly to decide if we run the old VFIO
or new VFIO backend.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2025-02-20 18:27:50 +00:00
Zvonko Kaiser
e82fdee20f runtime: Add proper IOMMUFD parsing
With newer kernels we have a new backend for VFIO
called IOMMUFD this is a departure from VFIO IOMMU Groups
since it has only one device associated with an IOMMUFD entry.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2025-01-15 23:39:33 +00:00
wangyaqi54
cf4b81344d runtime: Failed to clean up resources when QEMU is terminated by signal 15
When QEMU is terminated by signal 15, it deletes the PidFile.
Upon detecting that QEMU has exited, the shim executes the stopVM function.
If the PidFile is not found, the PID is set to 0.
Subsequently, the shim executes `kill -9 0`, which terminates the current process group.
This prevents any further logic from being executed, resulting in resources not being cleaned up.

Signed-off-by: wangyaqi54 <wangyaqi54@jd.com>
2024-10-22 17:04:46 +08:00
Greg Kurz
33eaf69d5f virtcontainers: Drop QEMU's Daemonize knob
QEMU isn't started as daemon anymore and this won't change (see #5736
for details). Drop the related code.

Signed-off-by: Greg Kurz <groug@kaod.org>
2024-06-11 19:55:54 +02:00
Zvonko Kaiser
c7b41361b2 gpu: reintroduce pcie_root_port and add pcie_switch_port
In Kubernetes we still do not have proper VM sizing
at sandbox creation level. This KEP tries to mitigates
that: kubernetes/enhancements#4113 but this can take
some time until Kube and containerd or other runtimes
have those changes rolled out.

Before we used a static config of VFIO ports, and we
introduced CDI support which needs a patched contianerd.
We want to eliminate the patched continerd in the GPU case
as well.

Fixes: #8860

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2024-05-27 10:13:01 +00:00
Fabiano Fidêncio
416d00228c
Revert "qemu: tdx: Adapt command line" (partially)
This reverts commit b7cccfa019.

The `private=on` bit has never made its way upstream, and was removed
from the latest iteration that we're using.  With that in mind, let's
revert its usage in the code.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2024-05-09 07:59:12 +02:00
Greg Kurz
be8f0cb520
Merge pull request #9402 from deagon/feat/debug-threads
qemu: show the thread name when enable the hypervisor.debug option
2024-04-08 11:04:36 +02:00
Fabiano Fidêncio
b7cccfa019
qemu: tdx: Adapt command line
This commit is a mess, but I'm not exactly sure what's the best way to
make it less messy, as we're getting QEMU TDX to work while partially
reverting 1e34220c41.

With that said, let me cover the content of this commit.

Firstly, we're reverting all the changes related to
"memory-backend-memfd-private", as that's what was used with the
previous host stack, but it seems it
didn't fly upstream.

Secondly, in order to get QEMU to properly work with TDX, we need to
enforce the 'private=on' knob and use the "memory-backend-ram", and
we're doing so, and also making sure to test the `private=on` newly
added knob.

I'm sorry for the confusion, I understand this is not optimal, I just
don't see an easy path to do changes without leaving the code broken
during those changes.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2024-04-05 19:51:27 +02:00
Fabiano Fidêncio
6b4cc5ea6a
Revert "qemu: tdx: Workaround SMP issue with TDX 1.5"
This reverts commit d1b54ede29.

 Conflicts:
	src/runtime/virtcontainers/qemu.go

This commit was a hack that was needed in order to get QEMU + TDX to
work atop of the stack our CI was running on.  As we're moving to "the
officially supported by distros" host OS, we need to get rid of this.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2024-04-05 10:23:52 +02:00
Guoqiang Ding
cd0c31e185 qemu: show the thread name when enable the hypervisor.debug option
Add debug-threads=on in the name argument if debug enabled.

Fixes: #9400
Signed-off-by: Guoqiang Ding <dgq8211@gmail.com>
2024-04-03 10:36:52 +08:00
Hyounggyu Choi
540a2a7fb1 runtime: Allow no initrd path for IBM Z Secure Execution
This is to reintroduce a configuration rule for IBM Z Secure Execution,
where no initrd path should be configured. For the TEE of interest,
only a kernel image should be specified with `confidential_guest=true`.

Fixes: #8692

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2023-12-19 11:21:16 +01:00
Fabiano Fidêncio
e477ed0e86 runtime: Improve vCPU allocation for the VMMs
First of all, this is a controversial piece, and I know that.

In this commit we're trying to make a less greedy approach regards the
amount of vCPUs we allocate for the VMM, which will be advantageous
mainly when using the `static_sandbox_resource_mgmt` feature, which is
used by the confidential guests.

The current approach we have basically does:
* Gets the amount of vCPUs set in the config (an integer)
* Gets the amount of vCPUs set as limit (an integer)
* Sum those up
* Starts / Updates the VMM to use that total amount of vCPUs

The fact we're dealing with integers is logical, as we cannot request
500m vCPUs to the VMMs.  However, it leads us to, in several cases, be
wasting one vCPU.

Let's take the example that we know the VMM requires 500m vCPUs to be
running, and the workload sets 250m vCPUs as a resource limit.

In that case, we'd do:
* Gets the amount of vCPUs set in the config: 1
* Gets the amount of vCPUs set as limit: ceil(0.25)
* 1 + ceil(0.25) = 1 + 1 = 2 vCPUs
* Starts / Updates the VMM to use 2 vCPUs

With the logic changed here, what we're doing is considering everything
as float till just before we start / update the VMM. So, the flow
describe above would be:
* Gets the amount of vCPUs set in the config: 0.5
* Gets the amount of vCPUs set as limit: 0.25
* ceil(0.5 + 0.25) = 1 vCPUs
* Starts / Updates the VMM to use 1 vCPUs

In the way I've written this patch we introduce zero regressions, as
the default values set are still the same, and those will only be
changed for the TEE use cases (although I can see firecracker, or any
other user of `static_sandbox_resource_mgmt=true` taking advantage of
this).

There's, though, an implicit assumption in this patch that we'd need to
make explicit, and that's that the default_vcpus / default_memory is the
amount of vcpus / memory required by the VMM, and absolutely nothing
else.  Also, the amount set there should be reflected in the
podOverhead for the specific runtime class.

One other possible approach, which I am not that much in favour of
taking as I think it's **less clear**, is that we could actually get the
podOverhead amount, subtract it from the default_vcpus (treating the
result as a float), then sum up what the user set as limit (as a float),
and finally ceil the result.  It could work, but IMHO this is **less
clear**, and **less explicit** on what we're actually doing, and how the
default_vcpus / default_memory should be used.

Fixes: #6909

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2023-11-10 18:25:57 +01:00
Archana Shinde
a6272733e7 network: Fix network hotplug for ipvlan and macvlan endpoints.
Since moving from network coldplug to hotplug, the only case verified
was veth endpoints. Support for network hotplug for ipvlan and macvlan was
broken/not added. Fix it.

Fixes: #8391

Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
2023-11-07 10:13:51 -08:00
Jianyong Wu
f9c9d8f645 runtime: QemuVirt: hotadd virtio-mem dev to pcie root port
Hotplug virtio-mem device to pcie root port for Qemu Virt.

Fixes: #7646
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2023-10-18 06:35:57 +00:00
Jianyong Wu
ef18c9550c runtime:qemuvirt: hotadd net dev to pcie root port
Hotplug network device to pcie root port as this is the only way on
QemuVirt.

Fixes: #7646
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2023-10-18 06:35:57 +00:00
Jianyong Wu
f1aec98f9d qemu/virt: use pcie_root_port to do device hotplug for virt
ACPI PCI device hotplug on qemu virt is not supported. The only way to
hotplug pci device is pcie native way. Thus we need create pcie root
port as default.

Pcie root port number depends on following:
1. reserved one for network device as default;
2. virtio-mem dev;
3. add enough port for vhost user blk dev;

Fixes: #7646
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2023-10-18 06:35:57 +00:00
Greg Kurz
1f16b6627b runtime/qemu: Rework QMP/HMP support
PR #6146 added the possibility to control QEMU with an extra HMP socket
as an aid for debugging. This is great for development or bug chasing
but this raises some concerns in production.

The HMP monitor allows to temper with the VM state in a variety of ways.
This could be intentionally or mistakenly used to inject subtle bugs in
the VM that would be extremely hard if not even impossible to debug. We
definitely don't want that to be enabled by default.

The feature is currently wired to the `enable_debug` setting in the
`[hypervisor.qemu]` section of the configuration file. This setting has
historically been used to control "debug output" and it is used as such
by some downstream users (e.g. Openshift). Forcing people to have the
extra HMP backdoor at the same time is abusive and dangerous.

A new `extra_monitor_socket` is added to `[hypervisor.qemu]` to give
fine control on whether the HMP socket is wanted or not. This setting
is still gated by `enable_debug = true` to make it clear it is for
debug only. The default is to not have the HMP socket though. This
isn't backward compatible with #6416 but it is for the sake of "better
safe than sorry".

An extra monitor socket makes the QEMU instance untrusted. A warning is
thus logged to the journal when one is requested.

While here, also allow the user to choose between HMP and QMP for the
extra monitor socket. Motivation is that QMP offers way more options to
control or introspect the VM than HMP does. Users can also ask for
pretty json formatting well suited for human reading. This will improve
the debugging experience.

This feature is only made visible in the base and GPU configurations
of QEMU for now.

Fixes #7952

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-09-18 12:13:01 +02:00
Greg Kurz
81536f21af runtime/qemu: Pass "--xattr" to virtiofsd instead of "-o xattr"
The "-o" syntax belongs to the legacy C virtiofsd. It is deprecated
with the rust implementation.

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-09-06 17:50:35 +02:00
Fabiano Fidêncio
d1b54ede29 qemu: tdx: Workaround SMP issue with TDX 1.5
`...,sockets=1,cores=numvcpus,threads=1,...` must be used.

Fixes: #7770

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-08-28 13:41:36 +02:00
Archana Shinde
1e34220c41 qemu: tdx: Adapt to the TDX 1.5 stack
QEMU for TDX 1.5 makes use of private memory map/unmap.
Make changes to govmm to support this. Support for private backing fd
for memory is added as knob to the qemu config.

Userspace's map/unmap operations are done by fallocate() ioctl on the
backing store fd.
Reference:
https://lore.kernel.org/linux-mm/20220519153713.819591-1-chao.p.peng@linux.intel.com/

Fixes: #7770

Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-08-28 13:41:36 +02:00
Peng Tao
1a0092d631 runtime/qemu: fix image/initrd annotation handling
Right now if we configure an image annotation and have a config file
setting initrd, the initrd config would override the image annotation.

Add a helper function ImageOrInitrdAssetPath to make sure annotations
are preferred over config options in image and initrd path handling.

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2023-08-23 03:47:27 +00:00
Zvonko Kaiser
1fc715bc65 s390x: Add AP Attach/Detach test
Now that we have propper AP device support add a
unit test for testing the correct Attach/Detach of AP devices.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-07-23 13:44:19 +00:00
Zvonko Kaiser
114542e2ba s390x: Fixing device.Bus assignment
The device.Bus was reset if a specific combination of
configuration parameters were not met. With the new
PCIe topology this should not happen anymore

Fixes: #7381

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-07-20 07:24:26 +00:00
Peng Tao
581be92b25
Merge pull request #4492 from zvonkok/pcie-topology
runtime: fix PCIe topology for GPUDirect use-case
2023-07-03 09:17:12 +08:00
Fabiano Fidêncio
6a21e20c63 runtime: Add "none" as a shared_fs option
Currently, even when using devmapper, if the VMM supports virtio-fs /
virtio-9p, that's used to share a few files between the host and the
guest.

This *needed*, as we need to share with the guest contents like secrets,
certificates, and configurations, via Kubernetes objects like configMaps
or secrets, and those are rotated and must be updated into the guest
whenever the rotation happens.

However, there are still use-cases users can live with just copying
those files into the guest at the pod creation time, and for those
there's absolutely no need to have a shared filesystem process running
with no extra obvious benefit, consuming memory and even increasing the
attack surface used by Kata Containers.

For the case mentioned above, we should allow users, making it very
clear which limitations it'll bring, to run Kata Containers with
devmapper without actually having to use a shared file system, which is
already the approach taken when using Firecracker as the VMM.

Fixes: #7207

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-06-30 20:45:00 +02:00
Zvonko Kaiser
72f2cb84e6 gpu: Reset cold or hot plug after overriding
If we override the cold, hot plug with an annotation
we need to reset the other plugging mechanism to NoPort
otherwise both will be enabled.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-15 17:51:01 +00:00
Zvonko Kaiser
fbacc09646 gpu: PCIe topology, consider vhost-user-block in Virt
In Virt the vhost-user-block is an PCIe device so
we need to make sure to consider it as well. We're keeping
track of vhost-user-block devices and deduce the correct
amount of PCIe root ports.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-15 17:39:55 +00:00
Zvonko Kaiser
b11246c3aa gpu: Various fixes for virt machine type
The PCI qom path was not deduced correctly added regex for correct
path walking.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:33:57 +00:00
Zvonko Kaiser
40101ea7db vfio: Added annotation for hot(cold) plug
Now it is possible to configure the PCIe topology via annotations
and addded a simple test, checking for Invalid and RootPort

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
8f0d4e2612 vfio: Cleanup of Cold and Hot Plug
Removed the configuration of PCIeRootPort and PCIeSwitchPort, those
values can be deduced in createPCIeTopology

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
b5c4677e0e vfio: Rearrange the bus assignemnt
Refactor the bus assignment so that the call to GetAllVFIODevicesFromIOMMUGroup
can be used by any module without affecting the topology.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
b1aa8c8a24 gpu: Moved the PCIe configs to drivers
The hypervisor_state file was the wrong location for the PCIe Port
settings, moved everything under device umbrella, where it can be
consumed more easily and we do not get into circular deps.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
55a66eb7fb gpu: Add config to TOML
Update cold-plug and hot-plug setting to include bridge, root and
switch-port

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
da42801c38 gpu: Add config settings tests for hot-plug
Updated all references and config settings for hot-plug to match
cold-plug

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
de39fb7d38 runtime: Add support for GPUDirect and GPUDirect RDMA PCIe topology
Fixes: #4491

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Beraldo Leal
0e47cfc4c7 runtime: sending SIGKILL to qemu
There is a race condition when virtiofsd is killed without finishing all
the clients. Because of that, when a pod is stopped, QEMU detects
virtiofsd is gone, which is legitimate.

Sending a SIGTERM first before killing could introduce some latency
during the shutdown.

Fixes #6757.

Signed-off-by: Beraldo Leal <bleal@redhat.com>
2023-05-24 11:31:28 -04:00
Zvonko Kaiser
0fec2e6986 gpu: Add cold-plug test
Cold plug setting is now correctly decoded in toml

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-27 09:30:24 +00:00
Zvonko Kaiser
dded731db3 gpu: Add OVMF setting for MMIO aperture
The default size of OVMFs aperture is too low to
initialized PCIe devices with huge BARs

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-26 09:47:37 +00:00
Zvonko Kaiser
2a830177ca gpu: Add fwcfg helper function
Added driver util function for easier handling of VFIO
devices outside of the VFIO module. At the sandbox level
we may need to set options depending if we have a VFIO/PCIe
device, like the fwCfg for confiential guests.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-26 09:47:37 +00:00
Zvonko Kaiser
c8cf7ed3bc gpu: Add ColdPlug of VFIO devices with devManager
If we have a VFIO device and cold-plug is enabled
we mark each device as ColdPlug=true and let the VFIO
module do the attaching.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-26 09:47:37 +00:00
Fabiano Fidêncio
50ce33b02d
Merge pull request #6205 from fengwang666/non-root-clh
runtime: support non-root for clh
2023-04-11 19:34:00 +02:00
Jakob Naucke
f666f8e2df agent: Add VFIO-AP device handling
Initial VFIO-AP support (#578) was simple, but somewhat hacky; a
different code path would be chosen for performing the hotplug, and
agent-side device handling was bound to knowing the assigned queue
numbers (APQNs) through some other means; plus the code for awaiting
them was written for the Go agent and never released. This code also
artificially increased the hotplug timeout to wait for the (relatively
expensive, thus limited to 5 seconds at the quickest) AP rescan, which
is impractical for e.g. common k8s timeouts.

Since then, the general handling logic was improved (#1190), but it
assumed PCI in several places.

In the runtime, introduce and parse AP devices. Annotate them as such
when passing to the agent, and include information about the associated
APQNs.

The agent awaits the passed APQNs through uevents and triggers a
rescan directly.

Fixes: #3678
Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2023-03-16 10:07:48 +09:00
Jakob Naucke
b546eca26f runtime: Generalize VFIO devices
Generalize VFIO devices to allow for adding AP in the next patch.
The logic for VFIOPciDeviceMediatedType() has been changed and IsAPVFIOMediatedDevice() has been removed.

The rationale for the revomal is:

- VFIODeviceMediatedType is divided into 2 subtypes for AP and PCI
- Logic of checking a subtype of mediated device is included in GetVFIODeviceType()
- VFIOPciDeviceMediatedType() can simply fulfill the device addition based
on a type categorized by GetVFIODeviceType()

Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2023-03-16 10:06:37 +09:00
Jakob Naucke
4c527d00c7 agent: Rename VFIO handling to VFIO PCI handling
e.g., split_vfio_option is PCI-specific and should instead be named
split_vfio_pci_option. This mutually affects the runtime, most notably
how the labels are named for the agent.

Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2023-03-16 07:43:39 +09:00
Feng Wang
cbe6ad9034 runtime: support non-root for clh
This change enables to run cloud-hypervisor VMM using a non-root user
when rootless flag is set true in the configuration

Fixes: #2567

Signed-off-by: Feng Wang <fwang@confluent.io>
2023-02-22 13:57:09 -08:00
zhaojizhuang
ca02c9f512 runtime: add reconnect timeout for vhost user block
Fixes: #6075
Signed-off-by: zhaojizhuang <571130360@qq.com>
2023-02-13 14:33:46 +08:00
Greg Kurz
334c4b8bdc runtime: Drop QEMU log file support
The QEMU log file is essentially about fine grain tracing of QEMU
internals and mostly useful for developpers, not production. Notably,
the log file isn't limited in size, nor rotated in any way. It means
that a container running in the VM could possibly flood the log file
with a guest triggerable trace. For example, on openshift, the log
file is supposed to reside on a per-VM 14 GiB tmpfs mount. This means
that each pod running with the kata runtime could potentially consume
this amount of host RAM which is not acceptable.

Error messages are best collected from QEMU's stderr as kata is doing
now since PR #5736 was merged. Drop support for the QEMU log file
because it doesn't bring any value but can certainly do harm.

Fixes #6173

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-01-31 09:20:29 +01:00