kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2025-08-08 03:24:15 +00:00

Author	SHA1	Message	Date
Gao Xiang	9079c8e598	runtime: improve EROFS snapshotter support To better support containerd 2.1 and later versions, remove the hardcoded `layer.erofs` and instead parse `/proc/mounts` to obtain the real mount source (and `/sys/block/loopX/loop/backing_file` if needed). If the mount source doesn't end with `layer.erofs`, it should be marked as unsupported, as it may be a filesystem meta file generated by later containerd versions for the EROFS flattened filesystem feature. Also check whether the filesystem type is `overlay` or not, since the containerd mount manager [1] may change it after being introduced. [1] https://github.com/containerd/containerd/issues/11303 Fixes: `f63ec50ba3` ("runtime: Add EROFS snapshotter with block device support") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2025-06-26 10:12:12 +08:00
Hyounggyu Choi	2fd2cd4a9b	runtime: Preserve hotplug devices for vfio-coldplug mode Fixes: #11288 This commit appends hotplug devices (e.g., persistent volume) to deviceInfos when `vfio_mod` is `vfio` and `cold_plug_vfio` is set to one except `no-port`. For details, please visit the issue. Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2025-05-19 13:46:49 +02:00
ChengyuZhu6	f63ec50ba3	runtime: Add EROFS snapshotter with block device support - Detection of EROFS options in container rootfs - Creation of necessary EROFS devices - Sharing of rootfs with EROFS via overlayfs Fixes: #11163 Signed-off-by: ChengyuZhu6 <hudson@cyzhu.com>	2025-05-05 23:51:13 +02:00
Zvonko Kaiser	3946435291	gpu: Handle VFIO devices with DevicePlugin and CDI We can provide devices during cold-plug with CDI annotation on a Pod level and add per container device information wit the device plugin. Since the sandbox has already attached the VFIO device remove them from consideration and just apply the inner runtime CDI annotation. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2025-04-23 21:02:06 +00:00
Hyounggyu Choi	419b5ed715	runtime: Add DeviceInfo to Container for VFIO coldplug configuration Even though ociSpec.Linux.Devices is preserved when vfio_mode is VFIO, it has not been updated correctly for coldplug scenarios. This happens because the device info passed to the agent via CreateContainerRequest is dropped by the Kata runtime. This commit ensures that the device info is added to the sandbox's device manager when vfio_mode is VFIO and coldPlugVFIO is true (e.g., vfio-ap-cold), allowing ociSpec.Linux.Devices to be properly updated with the device information before the container is created on the guest. Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>	2025-01-28 10:53:00 +01:00
Fabiano Fidêncio	eccdffebf7	Merge pull request #10243 from katexochen/nydus-overlayfs-path virtcontainers: allow specifying nydus-overlayfs binary by path	2024-09-19 11:35:45 +02:00
Alex Lyn	1684c1962c	runtime: Fix runtime/cdi panic with assignment to entry in nil map It will panic when users do GPU vfio passthrough with cdi in runtime. The root cause is that CustomSpec.Annotations is nil when new element added. To address this issue, initialization is introduced when it's nil. Fixes #10266 Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>	2024-09-09 20:15:10 +08:00
Paul Meyer	3be719c805	virtcontainers: allow specifying nydus-overlayfs binary by path ...or by using a binary with additional suffix. This allows having multiple versions of nydus-overlayfs installed on the host, telling nydus-snapshotter which one to use while still detecting Nydus is used. Signed-off-by: Paul Meyer <49727155+katexochen@users.noreply.github.com>	2024-09-04 08:29:40 +02:00
Zvonko Kaiser	4c93bb2d61	qemu: Add CDI device handling for any container type We need special handling for pod_sandbox, pod_container and single_container how and when to inject CDI devices Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2024-05-27 10:13:01 +00:00
Zvonko Kaiser	c7b41361b2	gpu: reintroduce pcie_root_port and add pcie_switch_port In Kubernetes we still do not have proper VM sizing at sandbox creation level. This KEP tries to mitigates that: kubernetes/enhancements#4113 but this can take some time until Kube and containerd or other runtimes have those changes rolled out. Before we used a static config of VFIO ports, and we introduced CDI support which needs a patched contianerd. We want to eliminate the patched continerd in the GPU case as well. Fixes: #8860 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2024-05-27 10:13:01 +00:00
yuchen.cc	1cd1558a92	mount: support checking multiple kinds of block device driver Device mapper is the only supported block device driver so far, which seems limiting. Kata Containers can work well with other block devices. It is necessary to enhance supporting of multiple kinds of host block device. Fixes #4714 Signed-off-by: yuchen.cc <yuchen.cc@alibaba-inc.com>	2023-12-01 11:59:30 +08:00
ChengyuZhu6	e4f33ac141	runtime: add functions to create devices in KataVirtualVolume The snapshotter will place `KataVirtualVolume` information into 'rootfs.options' and commence with the prefix 'io.katacontainers.volume='. The purpose of this commit is to transform the encapsulated KataVirtualVolume data into device information. Fixes: #8495 Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com> Co-authored-by: Feng Wang <feng.wang@databricks.com> Co-authored-by: Samuel Ortiz <sameo@linux.intel.com> Co-authored-by: Wedson Almeida Filho <walmeida@microsoft.com>	2023-11-23 23:05:13 +08:00
Wedson Almeida Filho	7e1b1949d4	runtime: add support for kata overlays When at least one `io.katacontainers.fs-opt.layer` option is added to the rootfs, it gets inserted into the VM as a layer, and the file system is mounted as an overlay of all layers using the overlayfs driver. Additionally, if the `io.katacontainers.fs-opt.block_device=file` option is present in a layer, it is mounted as a block device backed by a file on the host. Fixes: #7536 Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>	2023-08-03 17:58:39 -03:00
Zvonko Kaiser	cddcde1d40	vfio: Fix vfio device ordering If modeVFIO is enabled we need 1st to attach the VFIO control group device /dev/vfio/vfio an 2nd the actuall device(s) afterwards.Sort the devices starting with device #1 being the VFIO control group device and the next the actuall device(s) /dev/vfio/<group> Fixes: #7493 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-07-31 11:26:27 +00:00
Zvonko Kaiser	62aa6750ec	vfio: Added better handling of VFIO Control Devices Depending on the vfio_mode we need to mount the VFIO control device additionally into the container. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-07-20 13:42:42 +00:00
Zvonko Kaiser	114542e2ba	s390x: Fixing device.Bus assignment The device.Bus was reset if a specific combination of configuration parameters were not met. With the new PCIe topology this should not happen anymore Fixes: #7381 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-07-20 07:24:26 +00:00
Feng Wang	205909fbed	runtime: Fix virtiofs fd leak The kata runtime invokes removeStaleVirtiofsShareMounts after a container is stopped to clean up the stale virtiofs file caches. Fixes: #6455 Signed-off-by: Feng Wang <fwang@confluent.io>	2023-04-26 15:53:39 -07:00
Eric Ernst	f9e96c6506	runtime: device: move to top level package Let's move device package to runtime/pkg instead of being buried under virtcontainers. Signed-off-by: Eric Ernst <eric_ernst@apple.com>	2022-06-26 21:31:29 -07:00
Rafael Fonseca	ce2e521a0f	runtime: remove duplicate 'types' import Fallout of `09f7962ff` Fixes #4285 Signed-off-by: Rafael Fonseca <r4f4rfs@gmail.com>	2022-05-19 13:49:47 +02:00
Yibo Zhuang	532d53977e	runtime: fsGroup support for direct-assigned volume The fsGroup will be specified by the fsGroup key in the direct-assign mountinfo metadate field. This will be set when invoking the kata-runtime binary and providing the key, value pair in the metadata field. Similarly, the fsGroupChangePolicy will also be provided in the mountinfo metadate field. Adding an extra fields FsGroup and FSGroupChangePolicy in the Mount construct for container mount which will be populated when creating block devices by parsing out the mountInfo.json. And in handleDeviceBlockVolume of the kata-agent client, it checks if the mount FSGroup is not nil, which indicates that fsGroup change is required in the guest, and will provide the FSGroup field in the protobuf to pass the value to the agent. Fixes #4018 Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>	2022-04-11 08:41:13 -07:00
Feng Wang	aa5ae6b17c	runtime: Properly handle ESRCH error when signaling container Currently kata shim v2 doesn't translate ESRCH signal, causing container fail to stop and shim leak. Fixes: #3874 Signed-off-by: Feng Wang <feng.wang@databricks.com>	2022-03-14 11:03:05 -07:00
Feng Wang	e76519af83	runtime: small refactor to improve readability Remove some confusing/duplicate code so it's more readable Fixes: #3454 Signed-off-by: Feng Wang <feng.wang@databricks.com>	2022-03-04 10:00:52 -08:00
Feng Wang	f905161bbb	runtime: mount direct-assigned block device fs only once Mount the direct-assigned block device fs only once and keep a refcount in the guest. Also use the ro flag inside the options field to determine whether the block device and filesystem should be mounted as ro Fixes: #3454 Signed-off-by: Feng Wang <feng.wang@databricks.com>	2022-03-03 18:57:02 -08:00
Feng Wang	c39281ad65	runtime: update container creation to work with direct assigned volumes During the container creation, it will parse the mount info file of the direct assigned volumes and update the in memory mount object. Fixes: #3454 Signed-off-by: Feng Wang <feng.wang@databricks.com>	2022-03-03 18:57:02 -08:00
Eric Ernst	e355a71860	container: file is not linux specific This should not be linux specific -- drop restriction. Signed-off-by: Eric Ernst <eric_ernst@apple.com>	2022-02-28 08:01:53 -08:00
Samuel Ortiz	ad0449195d	virtcontainers: Convert stats dev_t to uint64 We need to convert them to uint64 as their types may differ on various host OSes, but unix.Major\|Minor takes a uint64 regardless. Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>	2022-02-28 08:01:53 -08:00
Samuel Ortiz	1103f5a4d4	virtcontainers: Use FilesystemSharer for sharing the containers files Switching to the generic FilesystemSharer brings 2 majors improvements: 1. Remove container and sandbox specific code from kata_agent.go 2. Allow for non Linux implementations to provide ways to share container files and root filesystems with the Kata Linux guest. Fixes #3622 Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>	2022-02-25 17:22:27 +01:00
bin	f6fc1621f7	shim: log events for CRI-O CRI-O start shim process without setting TTRPC_ADDRESS, that the forwarding events goroutine will get errors. For CRI-O runtime, we can log the events to log file. Fixes: #3733 Signed-off-by: bin <bin@hyper.sh>	2022-02-22 11:02:50 +08:00
Samuel Ortiz	77c29bfd3b	container: Remove VFIO lazy attach handling With the recently added VFIO fixes and support, we should not need that anymore. Fixes #3108 Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>	2022-02-17 08:39:44 +01:00
bin	81a8baa5e5	runtime: add hugepages support Add hugepages support, port from: `b486387cba` Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com> Signed-off-by: bin <bin@hyper.sh>	2022-02-16 15:14:53 +08:00
luodaowen.backend	2d9f89aec7	feature(nydusd): add nydusd support to introduse lazyload ability Pulling image is the most time-consuming step in the container lifecycle. This PR introduse nydus to kata container, it can lazily pull image when container start. So it can speed up kata container create and start. Fixes #2724 Signed-off-by: luodaowen.backend <luodaowen.backend@bytedance.com>	2022-02-11 21:41:17 +08:00
bin	85f5ae190e	runtime: close span before return from function in case of error Return before closing span will cause invalid spans, so span should be closed before function return. Fixes: #3424 Signed-off-by: bin <bin@hyper.sh>	2022-01-11 19:45:41 +08:00
Eric Ernst	860f30882a	virtcontainers: move oci, uuid packages top level This will be useful at runtime level; no need for oci or uuid to be subpkg of virtcontainers. While at it, ensure we run gofmt on the changed files. Signed-off-by: Eric Ernst <eric_ernst@apple.com>	2021-11-17 14:12:57 -08:00
bin	09f7962ff1	runtime: merge virtcontainers/pkg/types into virtcontainers/types There are two types packages under virtcontainers, and the virtcontainers/pkg/types has a few codes, merging them into one can make it easy for outstanding and using types package. Fixes: #3031 Signed-off-by: bin <bin@hyper.sh>	2021-11-12 15:06:39 +08:00
Yujia Qiao	e66d0473be	virtcontainers: simplify read-only mount handling Current handling of read-only mounts is a little tricky. However, a clearer solution can be used here: 1. make a private ro bind mount at privateDest to the mount source 2. make a bind mount at mountDest to the mount created in step 1 3. umount the private bind mount created in step 1 One important aspect is that the mount in step 2 is duplicated from the one we created in step 1. So the MS_RDONLY flag is properly preserved in all mounts created in the propagtion. Fixes: #2205 Depends-on: github.com/kata-containers/tests#4106 Signed-off-by: Yujia Qiao <rapiz3142@gmail.com>	2021-10-28 15:48:41 +08:00
Manohar Castelino	4d47aeef2e	hypervisor: Export generic interface methods This is in preparation for creating a seperate hypervisor package. Non functional change. Signed-off-by: Manohar Castelino <mcastelino@apple.com>	2021-10-22 16:45:35 -07:00
Chelsea Mafrica	077b77c178	runtime: tracing: Fix logger passed in newContainer Change logger in Trace call in newContainer from sandbox.Logger() to nil. Passing nil will cause an error to be logged by kataTraceLogger instead of the sandbox logger, which will avoid having the log message report it as part of the sandbox subsystem when it is part of the container subsystem. The kataTraceLogger will not log it as related to the container subsystem, but since the container logger has not been created at this point, and we already use the kataTraceLogger in other instances where a subsystem's logger has not been created yet, this PR makes the call consistent with other code. Fixes #2665 Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>	2021-09-17 11:41:04 -07:00
Samuel Ortiz	f17752b0dc	virtcontainers: container: Do not create and manage container host cgroups The only process we are adding there is the container host one, and there is no such thing anymore. Signed-off-by: Samuel Ortiz <samuel.e.ortiz@protonmail.com>	2021-09-14 07:09:33 +02:00
Chelsea Mafrica	8f0f949abf	tracing: Move dynamically added attributes to Trace() Where possible, move attributes added with AddTag() to Trace() call to reduce the amount of code used for tracing. Fixes #2512 Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>	2021-08-27 08:26:40 -07:00
Chelsea Mafrica	8058e97212	tracing: Change runtime tracing tags to vars Tracing tags are stored inconsistently throughout the runtime. Change all instances of tracing tags to variables. Fixes #2512 Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>	2021-08-26 15:55:32 -07:00
Hui Zhu	e6408fe670	Container: Add initConfigResourcesMemory and call it in newContainer The swappiness is not right if just set io.katacontainers.container.resource.swappiness: $ pod_yaml=pod.yaml $ container_yaml=container.yaml $ image="quay.io/prometheus/busybox:latest" $ cat << EOF > "${pod_yaml}" metadata: name: busybox-sandbox1 EOF $ cat << EOF > "${container_yaml}" metadata: name: busybox-killed-vmm annotations: io.katacontainers.container.resource.swappiness: "100" image: image: "$image" command: - top EOF $ sudo crictl pull $image $ podid=$(sudo crictl runp $pod_yaml) $ cid=$(sudo crictl create $podid $container_yaml $pod_yaml) $ sudo crictl start $cid crictl exec $cid cat /sys/fs/cgroup/memory/memory.swappiness 60 The cause of this issue is there are two elements store the resources infomation. They are c.config.Resources for calculateSandboxMemory and c.GetPatchedOCISpec() for agent. This add initConfigResourcesMemory to Container and call it in newContainer to handle the issue. Fixes: #2372 Signed-off-by: Hui Zhu <teawater@antfin.com>	2021-08-02 16:02:12 +08:00
Hui Zhu	ee90affc18	newContainer: Initialize c.config.Resources.Memory if it is nil container start fail if io.katacontainers.container.resource.swap_in_bytes and memory_limit_in_bytes are not set. $ pod_yaml=pod.yaml $ container_yaml=container.yaml $ image="quay.io/prometheus/busybox:latest" $ cat << EOF > "${pod_yaml}" metadata: name: busybox-sandbox1 EOF $ cat << EOF > "${container_yaml}" metadata: name: busybox-killed-vmm annotations: io.katacontainers.container.resource.swappiness: "60" image: image: "$image" command: - top EOF $ sudo crictl pull $image $ podid=$(sudo crictl runp $pod_yaml) $ cid=$(sudo crictl create $podid $container_yaml $pod_yaml) $ sudo crictl start $cid DEBU[0000] get runtime connection DEBU[0000] connect using endpoint 'unix:///var/run/containerd/containerd.sock' with '10s' timeout DEBU[0000] connected successfully using endpoint: unix:///var/run/containerd/containerd.sock DEBU[0000] StartContainerRequest: &StartContainerRequest{ContainerId:4fea91d16f661931fe33acd247efe831ef9e571588ba18b5a16f04c278fd61b8,} DEBU[0000] StartContainerResponse: nil FATA[0000] starting the container "4fea91d16f661931fe33acd247efe831ef9e571588ba18b5a16f04c278fd61b8": rpc error: code = Unknown desc = failed to create containerd task: failed to create shim: ttrpc: closed: unknown The cause of fail if if c.config.Resources.Memory is nil, values of io.katacontainers.container.resource.swappiness and io.katacontainers.container.resource.swap_in_bytes will be store in newContainer. This commit initialize c.config.Resources.Memory if it is nil in newContainer. Fixes: #2367 Signed-off-by: Hui Zhu <teawater@antfin.com>	2021-08-01 10:03:27 +08:00
Julio Montes	47d95dc1c6	runtime: virtcontainers: fix govet fieldalignment Fix structures alignment fixes #2271 Depends-on: github.com/kata-containers/tests#3727 Signed-off-by: Julio Montes <julio.montes@intel.com>	2021-07-20 11:59:15 -05:00
Hui Zhu	a733f537e5	runtime: newContainer: Handle the annotations of SWAP This commit add code to handle the annotations "io.katacontainers.container.resource.swappiness" and "io.katacontainers.container.resource.swap_in_bytes". It will set the value of "io.katacontainers.resource.swappiness" to c.config.Resources.Memory.Swappiness and set the value of "io.katacontainers.resource.swap_in_bytes" to c.config.Resources.Memory.Swap. Fixes: #2201 Signed-off-by: Hui Zhu <teawater@antfin.com>	2021-07-19 23:20:46 +08:00
Benjamin Porter	b10e3e22b5	tracing: Consolidate tracing into a new katatrace package Removes custom trace functions defined across the repo and creates a single trace function in a new katatrace package. Also moves span tag management into this package and provides a function to dynamically add a tag at runtime, such as a container id, etc. Fixes #1162 Signed-off-by: Benjamin Porter <bporter816@gmail.com>	2021-07-11 14:19:51 -05:00
Eric Ernst	064dfb164b	runtime: Add "watchable-mounts" concept for inotify support To workaround virtiofs' lack of inotify support, we'll special case particular mounts which are typically watched, and pass on information to the agent so it can ensure that the mount presented to the container is indeed watchable (see applicable agent commit). This commit will: - identify watchable mounts based on file count and mount source - create a watchable-bind storage object for these mounts to communicate intent to the agent - update the OCI spec to take the updated watchable mount source into account Unit tests added and updated for the newly introduced functionality/functions. Signed-off-by: Eric Ernst <eric_ernst@apple.com>	2021-06-24 10:07:06 -07:00
Eric Ernst	57c0cee0a5	runtime: Cleanup mountSharedDirMounts, shareFile parameters There's no reason to pass the paths; they can be determined when they are actually used. Let's make the return values more comparable to the other mount handling functions (we'll add storage object in future commit), and pass the mount maps as function parameters. ...No functional changes here... Signed-off-by: Eric Ernst <eric_ernst@apple.com>	2021-06-24 10:07:06 -07:00
Chelsea Mafrica	8ca0207281	tracing: Add sandbox and container ID to trace spans Add sandbox, container, and hypervisor IDs to trace spans. Note that some spans in sandbox.go are created with a trace() call from api.go. These spans have additional attributes set after span creation to overwrite the api attributes. Fixes #1878 Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>	2021-06-02 21:53:54 -07:00
Chelsea Mafrica	05a46fede0	tracing: Make runtime span attributes more consistent Span attributes (tags) are not consistent in runtime tracing, so designate and use core attributes such source, package, subsystem, and type as span metadata for more understandable output. Use WithAttributes() during span creation to reduce calls to SetAttributes(). Modify Trace() in katautils to accept slice of attributes so multiple functions using different attributes can use it. Fixes #1852 Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>	2021-05-27 10:07:11 -07:00
Hui Zhu	0787ea8073	cgroupsCreate: not set resources to c.config.Resources cgroupsCreate will just keep the CPU resources infomation but not the others. Set it to c.config.Resources will clean most of resources of the container. This commit remove it to handle the issue. Fixes: #1758 Signed-off-by: Hui Zhu <teawater@antfin.com>	2021-04-27 16:44:30 +08:00

1 2

73 Commits