kata-containers

mirror of https://github.com/kata-containers/kata-containers.git synced 2025-08-15 14:43:51 +00:00

Author	SHA1	Message	Date
Fabiano Fidêncio	fefcf7cfa4	acrn: Drop support As we don't have any CI, nor maintainer to keep ACRN code around, we better have it removed than give users the expectation that it should or would work at some point. Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>	2024-09-19 16:05:43 +02:00
Zvonko Kaiser	c7b41361b2	gpu: reintroduce pcie_root_port and add pcie_switch_port In Kubernetes we still do not have proper VM sizing at sandbox creation level. This KEP tries to mitigates that: kubernetes/enhancements#4113 but this can take some time until Kube and containerd or other runtimes have those changes rolled out. Before we used a static config of VFIO ports, and we introduced CDI support which needs a patched contianerd. We want to eliminate the patched continerd in the GPU case as well. Fixes: #8860 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2024-05-27 10:13:01 +00:00
Fabiano Fidêncio	e477ed0e86	runtime: Improve vCPU allocation for the VMMs First of all, this is a controversial piece, and I know that. In this commit we're trying to make a less greedy approach regards the amount of vCPUs we allocate for the VMM, which will be advantageous mainly when using the `static_sandbox_resource_mgmt` feature, which is used by the confidential guests. The current approach we have basically does: * Gets the amount of vCPUs set in the config (an integer) * Gets the amount of vCPUs set as limit (an integer) * Sum those up * Starts / Updates the VMM to use that total amount of vCPUs The fact we're dealing with integers is logical, as we cannot request 500m vCPUs to the VMMs. However, it leads us to, in several cases, be wasting one vCPU. Let's take the example that we know the VMM requires 500m vCPUs to be running, and the workload sets 250m vCPUs as a resource limit. In that case, we'd do: * Gets the amount of vCPUs set in the config: 1 * Gets the amount of vCPUs set as limit: ceil(0.25) * 1 + ceil(0.25) = 1 + 1 = 2 vCPUs * Starts / Updates the VMM to use 2 vCPUs With the logic changed here, what we're doing is considering everything as float till just before we start / update the VMM. So, the flow describe above would be: * Gets the amount of vCPUs set in the config: 0.5 * Gets the amount of vCPUs set as limit: 0.25 * ceil(0.5 + 0.25) = 1 vCPUs * Starts / Updates the VMM to use 1 vCPUs In the way I've written this patch we introduce zero regressions, as the default values set are still the same, and those will only be changed for the TEE use cases (although I can see firecracker, or any other user of `static_sandbox_resource_mgmt=true` taking advantage of this). There's, though, an implicit assumption in this patch that we'd need to make explicit, and that's that the default_vcpus / default_memory is the amount of vcpus / memory required by the VMM, and absolutely nothing else. Also, the amount set there should be reflected in the podOverhead for the specific runtime class. One other possible approach, which I am not that much in favour of taking as I think it's less clear, is that we could actually get the podOverhead amount, subtract it from the default_vcpus (treating the result as a float), then sum up what the user set as limit (as a float), and finally ceil the result. It could work, but IMHO this is less clear, and less explicit on what we're actually doing, and how the default_vcpus / default_memory should be used. Fixes: #6909 Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com> Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>	2023-11-10 18:25:57 +01:00
Zvonko Kaiser	dd422ccb69	vfio: Remove obsolete HotplugVFIOonRootBus Removing HotplugVFIOonRootBus which is obsolete with the latest PCI topology changes, users can set cold_plug_vfio or hot_plug_vfio either in the configuration.toml or via annotations. Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-07-20 07:25:40 +00:00
Zvonko Kaiser	8f0d4e2612	vfio: Cleanup of Cold and Hot Plug Removed the configuration of PCIeRootPort and PCIeSwitchPort, those values can be deduced in createPCIeTopology Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-06-14 08:20:24 +00:00
Zvonko Kaiser	da42801c38	gpu: Add config settings tests for hot-plug Updated all references and config settings for hot-plug to match cold-plug Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-06-14 08:20:24 +00:00
Zvonko Kaiser	de39fb7d38	runtime: Add support for GPUDirect and GPUDirect RDMA PCIe topology Fixes: #4491 Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-06-14 08:20:24 +00:00
Zvonko Kaiser	377ebc2ad1	gpu: Add configuration option for cold-plug VFIO Users can set cold-plug="root-port" to cold plug a VFIO device in QEMU Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>	2023-04-26 09:47:37 +00:00
Eric Ernst	6ee550e9a5	runtime: vCPUs pinning is sandbox specific, not hypervisor While at it, make sure we persist this and fix a misc typo. Signed-off-by: Eric Ernst <eric_ernst@apple.com>	2023-01-12 15:44:25 -08:00
Manabu Sugimoto	c617bbe70d	runtime: Pass SELinux policy for containers to the agent Pass SELinux policy for containers to the agent if `disable_guest_selinux` is set to `false` in the runtime configuration. The `container_t` type is applied to the container process inside the guest by default. Users can also set a custom SELinux policy to the container process using `guest_selinux_label` in the runtime configuration. This will be an alternative configuration of Kubernetes' security context for SELinux because users cannot specify the policy in Kata through Kubernetes's security context. To apply SELinux policy to the container, the guest rootfs must be CentOS that is created and built with `SELINUX=yes`. Fixes: #4812 Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>	2022-11-29 19:07:56 +09:00
Bin Liu	27b1bb5ed9	Merge pull request #4467 from egernst/device-pkg device package cleanup/refactor	2022-06-27 14:40:53 +08:00
Eric Ernst	f97d9b45c8	runtime: device/persist: drop persist dependency from device pkgs Rather than have device package depend on persist, let's define the (almost duplicate) structures within device itself, and have the Kata Container's persist pkg import these. This'll help avoid unecessary dependencies within our core packages. Signed-off-by: Eric Ernst <eric_ernst@apple.com>	2022-06-26 21:31:29 -07:00
Eric Ernst	f9e96c6506	runtime: device: move to top level package Let's move device package to runtime/pkg instead of being buried under virtcontainers. Signed-off-by: Eric Ernst <eric_ernst@apple.com>	2022-06-26 21:31:29 -07:00
Liang Zhou	ef925d40ce	runtime: enable sandbox feature on qemu Enable "-sandbox on" in qemu can introduce another protect layer on the host, to make the secure container more secure. The default option is disable because this feature may introduce some performance cost, even though user can enable /proc/sys/net/core/bpf_jit_enable to reduce the impact. Fixes: #2266 Signed-off-by: Feng Wang <feng.wang@databricks.com>	2022-06-17 15:30:46 -07:00
Samuel Ortiz	5e119e90e8	virtcontainers: Rename the Network structure fields and methods We are converting the Network structure into an interface, so that different host OSes can have different networking implementations for Kata. One step into that direction is to rename all the Network structure fields and methods to something that is less Linux networking namespace specific. This will make the Network interface naming consistent. Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>	2022-02-08 22:27:53 +01:00
Samuel Ortiz	b858d0dedf	virtcontainers: Make all Network fields private Prepare for making it a real interface. Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>	2022-02-08 22:27:53 +01:00
Samuel Ortiz	49eee79f5f	virtcontainers: Remove the NetworkNamespace structure It is now replaced with a single Network structure Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>	2022-02-08 22:27:53 +01:00
Julio Montes	49223e67af	runtime: remove enable_swap option `enable_swap` option was added long time ago to add `-realtime mlock=off` to the QEMU's command line. Kata now supports QEMU 6, `-realtime` option has been deprecated and `mlock=on` is causing unexpected behaviors in kata. This patch removes support for `enable_swap`, `-realtime` and `mlock=` since they are causing bugs in kata. Signed-off-by: Julio Montes <julio.montes@intel.com>	2022-01-18 11:12:29 -06:00
Peng Tao	01b6ffc0a4	Merge pull request #3028 from egernst/hypervisor-hacking Hypervisor cleanup, refactoring	2021-11-26 10:21:49 +08:00
bin	ddc68131df	runtime: delete netmon Netmon is not used anymore. Fixes: #3112 Signed-off-by: bin <bin@hyper.sh>	2021-11-24 15:08:18 +08:00
Eric Ernst	4c2883f7e2	vc: hypervisor: remove dependency on persist API Today the hypervisor code in vc relies on persist pkg for two things: 1. To get the VM/run store path on the host filesystem, 2. For type definition of the Load/Save functions of the hypervisor interface. For (1), we can simply remove the store interface from the hypervisor config and replace it with just the path, since this is all we really need. When we create a NewHypervisor structure, outside of the hypervisor, we can populate this path. For (2), rather than have the persist pkg define the structure, let's let the hypervisor code (soon to be pkg) define the structure. persist API already needs to call into hypervisor anyway; let's allow us to define the structure. We'll probably want to look at following similar pattern for other parts of vc that we want to make independent of the persist API. In doing this, we started an initial hypervisors pkg, to hold these types (avoid a circular dependency between virtcontainers and persist pkg). Next step will be to remove all other dependencies and move the hypervisor specific code into this pkg, and out of virtcontaienrs. Signed-off-by: Eric Ernst <eric_ernst@apple.com>	2021-11-19 12:20:41 -08:00
Eric Ernst	1e7cb4bc3a	macvlan: drop bridged part of name The fact that we need to "bridge" the endpoint is a bit irrelevant. To be consistent with the rest of the endpoints, let's just call this "macvlan" Fixes: #3050 Signed-off-by: Eric Ernst <eric_ernst@apple.com>	2021-11-16 16:44:29 -08:00
Manohar Castelino	4d47aeef2e	hypervisor: Export generic interface methods This is in preparation for creating a seperate hypervisor package. Non functional change. Signed-off-by: Manohar Castelino <mcastelino@apple.com>	2021-10-22 16:45:35 -07:00
Samuel Ortiz	9bed2ade0f	virtcontainers: Convert to the new cgroups package API The new API is based on containerd's cgroups package. With that conversion we can simpligy the virtcontainers sandbox code and also uniformize our cgroups external API dependency. We now only depend on containerd/cgroups for everything cgroups related. Depends-on: github.com/kata-containers/tests#3805 Signed-off-by: Samuel Ortiz <samuel.e.ortiz@protonmail.com> Signed-off-by: Eric Ernst <eric_ernst@apple.com>	2021-09-14 07:09:34 +02:00
Eric Ernst	7f1030d303	sandbox-bindmount: persist mount information Without this, if the shim dies, we will not have a reliable way to identify what mounts should be cleaned up if `containerd-shim-kata-v2 cleanup` is called for the sandbox. Before this, if you `ctr run` with a sandbox bindmount defined and SIGKILL the containerd-shim-kata-v2, you'll notice the sandbox bindmount left on host. With this change, the shim is able to get the sandbox bindmount information from disk and do the appropriate cleanup. Fixes #1896 Signed-off-by: Eric Ernst <eric_ernst@apple.com>	2021-05-21 12:54:35 -07:00
Christophe de Dinechin	dcb9f40394	config: Protect annotation for entropy_source It would be undesirable to be given an annotation like "/dev/null". Filter out bad annotation values. Fixes: #1043 Suggested-by: James O. D. Hunt <james.o.hunt@intel.com> Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>	2021-04-22 15:26:40 +02:00
bin	d75fe95685	virtcontainers: replace newStore by store in Sandbox struct The property name make newcomers confused when reading code. Since in Kata Containers 2.0 there will only be one type of store, so it's safe to replace it by `store` simply. Fixes: #1660 Signed-off-by: bin <bin@hyper.sh>	2021-04-08 23:59:16 +08:00
bin liu	fdbf7d3222	virtcontainers: revert CleanupContainer from PR 1079 In PR 1079, CleanupContainer's parameter of sandboxID is changed to VCSandbox, but at cleanup, there is no VCSandbox is constructed, we should load it from disk by loadSandboxConfig() in persist.go. This commit reverts parts of #1079 Fixes: #1119 Signed-off-by: bin liu <bin@hyper.sh>	2020-11-17 10:31:33 +08:00
bin liu	290203943c	runtime: delete sandboxlist.go and sandboxlist_test.go Delete sandboxlist.go and sandboxlist_test.go under virtcontainers package. Fixes: #1078 Signed-off-by: bin liu <bin@hyper.sh>	2020-11-13 09:47:09 +08:00
Christophe de Dinechin	7c6aede5d4	config: Whitelist hypervisor annotations by name Add a field "enable_annotations" to the runtime configuration that can be used to whitelist annotations using a list of regular expressions, which are used to match any part of the base annotation name, i.e. the part after "io.katacontainers.config.hypervisor." For example, the following configuraiton will match "virtio_fs_daemon", "initrd" and "jailer_path", but not "path" nor "firmware": enable_annotations = [ "virtio.*", "initrd", "_path" ] The default is an empty list of enabled annotations, which disables annotations entirely. If an anontation is rejected, the message is something like: annotation io.katacontainers.config.hypervisor.virtio_fs_daemon is not enabled Fixes: #901 Suggested-by: Peng Tao <tao.peng@linux.alibaba.com> Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>	2020-10-14 16:10:12 +02:00
Christophe de Dinechin	4e89b885d2	config: Protect file_mem_backend against annotation attacks This one could theoretically be used to overwrite data on the host. It seems somewhat less risky than the earlier ones for a number of reasons, but worth protecting a little anyway. Fixes: #901 Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>	2020-10-14 16:10:12 +02:00
Christophe de Dinechin	aae9656d8b	config: Protect vhost_user_store_path against annotation attacks This path could be used to overwrite data on the host. Fixes: #901 Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>	2020-10-14 16:10:12 +02:00
Christophe de Dinechin	b21a829c61	config: Protect ctlpath from annotation attack This also adds annotation for ctlpath which were not present before. It's better to implement the code consistenly right now to make sure that we don't end up with a leaky implementation tacked on later. Fixes: #901 Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>	2020-10-14 16:10:12 +02:00
Christophe de Dinechin	27b6620b23	config: Protect jailer_path annotation The jailer_path annotation can be used to execute arbitrary code on the host. Add a jailer_path_list configuration entry providing a list of regular expressions that can be used to filter annotations that represent valid file names. Fixes: #901 Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>	2020-10-14 16:10:12 +02:00
Christophe de Dinechin	bf13ff0a3a	config: Protect virtio_fs_daemon annotation Sending the virtio_fs_daemon annotation can be used to execute arbitrary code on the host. In order to prevent this, restrict the values of the annotation to a list provided by the configuration file. Fixes: #901 Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>	2020-10-14 16:10:12 +02:00
Julio Montes	6df165c19d	runtime: add support for SGX Support the `sgx.intel.com/epc` annotation that is defined by the intel k8s plugin. This annotation enables SGX. Hardware-based isolation and memory encryption. For example, use `sgx.intel.com/epc = "64Mi"` to create a container with 1 EPC section with pre-allocated memory. At the time of writing this patch, SGX patches have not landed on the linux kernel project. The following github kernel fork contains all the SGX patches for the host and guest: https://github.com/intel/kvm-sgx fixes #483 Signed-off-by: Julio Montes <julio.montes@intel.com>	2020-10-01 08:24:29 -05:00
Penny Zheng	1099a28830	kata 2.0: delete use_vsock option and proxy abstraction With kata containers moving to 2.0, (hybrid-)vsock will be the only way to directly communicate between host and agent. And kata-proxy as additional component to handle the multiplexing on serial port is also no longer needed. Cleaning up related unit tests, and also add another mock socket type `MockHybridVSock` to deal with ttrpc-based hybrid-vsock mock server. Fixes: #389 Signed-off-by: Penny Zheng penny.zheng@arm.com	2020-07-16 04:20:02 +00:00
bin liu	3cf8b470cd	runtime: delete Stateful from SandboxConfig Since all containers are started from shim v2, `Stateful` is not needed. Fixes: #332 Signed-off-by: bin liu <bin@hyper.sh>	2020-07-08 21:59:44 +08:00
bin liu	bd8f03a5ef	runtime: remove agent abstraction This PR will delete agent abstraction and use Kata agent as the only one agent. Fixes: #377 Signed-off-by: bin liu <bin@hyper.sh>	2020-07-08 10:07:40 +08:00
Fupan Li	57dfda9b3d	Merge pull request #365 from liubin/feature/delete-shim-364 runtime: clean up shim abstraction	2020-07-07 09:54:27 +08:00
bin liu	5200ac0678	runtime: remove old store Remove old store for Kata 2.0. Fixes: #368 Signed-off-by: bin liu <bin@hyper.sh>	2020-07-02 16:12:05 +08:00
bin liu	3ece4130c9	runtime: clean up shim abstraction This PR will delete shim abstraction from sandbox. Fixes: #364 Signed-off-by: bin liu <bin@hyper.sh>	2020-06-30 15:52:20 +08:00
Jia He	fa9d619e8a	qemu: add cpu_features option [ port from runtime commit 0100af18a2afdd6dfcc95129ec6237ba4915b3e5 ] To control whether guest can enable/disable some CPU features. E.g. pmu=off, vmx=off. As discussed in the thread [1], the best approach is to let users specify them. How about adding a new option in the configuration file. Currently this patch only supports this option in qemu,no other vmm. [1] https://github.com/kata-containers/runtime/pull/2559#issuecomment-603998256 Signed-off-by: Jia He <justin.he@arm.com> Signed-off-by: Peng Tao <bergwolf@hyper.sh>	2020-06-29 20:16:11 -07:00
Penny Zheng	c2645f5d5a	rate-limiter: add rate limiter configuration/annotation on VM level Add configuration/annotation about network I/O throttling on VM level. rx_rate_limiter_max_rate is dedicated to control network inbound bandwidth per pod. tx_rate_limiter_max_rate is dedicated to control network outbound bandwidth per pod. Fixes: #250 Signed-off-by: Penny Zheng <penny.zheng@arm.com>	2020-06-24 06:14:04 +00:00
Peng Tao	6de95bf36c	gomod: update runtime import path To use the kata-containers repo path. Most of the change is generated by script: find . -type f -name "*.go" \|xargs sed -i -e \ 's\|github.com/kata-containers/runtime\|github.com/kata-containers/kata-containers/src/runtime\|g' Fixes: #201 Signed-off-by: Peng Tao <bergwolf@hyper.sh>	2020-04-29 18:39:03 -07:00
Peng Tao	a02a8bda66	runtime: move all code to src/runtime To prepare for merging into kata-containers repository. Signed-off-by: Peng Tao <bergwolf@hyper.sh>	2020-04-27 19:39:25 -07:00

46 Commits