Commit Graph

1241 Commits

Author SHA1 Message Date
Fabiano Fidêncio
bb4c51a5e0
Merge pull request #8494 from ChengyuZhu6/kata_virtual_volume
runtime: Pass `KataVirtualVolume` to the guest as devices in go runtime
2023-11-27 16:02:28 +01:00
ChengyuZhu6
5318afe273 runtime: support to create VirtualVolume rootfs storages
1) Creating storage for all `io.katacontainers.volume=` messages in rootFs.Options,
and then aggregates all storages  into `containerStorages`.
2) Creating storage for other data volumes and push them into `volumeStorages`.

Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
2023-11-23 23:22:55 +08:00
ChengyuZhu6
0b4f7c2ee7 runtime: redefine and add functions to handle VirtualVolume to storage
1) Extract function `handleBlockVolume` to create Storage only.
2) Add functions to handle KataVirtualVolume device and construct
   corresponding storages.

Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
2023-11-23 23:07:32 +08:00
ChengyuZhu6
bd099fbda9 runtime: extend SharedFile to support mutiple storage devices
To enhance the construction and administration of `Katavirtualvolume` storages,
this commit expands the 'sharedFile' structure to manage both
rootfs storages(`containerStorages`) including `Katavirtualvolume` and other data volumes storages(`volumeStorages`).

NOTE: `volumeStorages` is intended for future extensions to support Kubernetes data volumes.
Currently, `KataVirtualVolume` is exclusively employed for container rootfs, hence only `containerStorages` is actively utilized.

Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
2023-11-23 23:05:14 +08:00
ChengyuZhu6
e4f33ac141 runtime: add functions to create devices in KataVirtualVolume
The snapshotter will place `KataVirtualVolume` information
into 'rootfs.options' and commence with the prefix 'io.katacontainers.volume='.
The purpose of this commit is to transform the encapsulated KataVirtualVolume data into device information.

Fixes: #8495

Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Feng Wang <feng.wang@databricks.com>
Co-authored-by: Samuel Ortiz <sameo@linux.intel.com>
Co-authored-by: Wedson Almeida Filho <walmeida@microsoft.com>
2023-11-23 23:05:13 +08:00
Dan Mihai
756022787c
Merge pull request #8239 from Sumynwa/sumsharma/fix_configmap_update_propagation
runtime: Fix configmap/secrets updates with FS sharing disabled
2023-11-23 06:50:53 -08:00
Fabiano Fidêncio
9445a967b6
Merge pull request #8471 from ChengyuZhu6/kata-virtual-volume
runtime: Introduce `KataVirtualVolume` structure into go runtime
2023-11-20 21:58:27 +01:00
ChengyuZhu6
1353b14e6c runtime: Add KataVirtualVolume struct in runtime
Add the corresponding data structure in the runtime part according to
kata-containers/kata-containers/pull/7698.

Fixes: #8472

Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
2023-11-19 13:30:32 +08:00
Pradipta Banerjee
39e8c84269 runtime: Add support for key annotations to remote hyp
In order to support different pod VM instance type via
remote hypervisor implementation (cloud-api-adaptor),
we need to pass machine_type, default_vcpus
and default_memory annotations to cloud-api-adaptor.

The cloud-api-adaptor then uses these annotations to spin
up the appropriate cloud instance.

Reference PR for cloud-api-adaptor
https://github.com/confidential-containers/cloud-api-adaptor/pull/1088

Fixes: #7140
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
(based on commit 004f07f076)
2023-11-17 13:33:27 +00:00
Lei Li
50e0d43dad runtime: Support privileged containers in peer pod VM
This patch fixes the issue of running containers
with privileged as true.

See the discussion at this URL for the details.
https://github.com/confidential-containers/cloud-api-adaptor/issues/111

Signed-off-by: Lei Li <cdlleili@cn.ibm.com>
(based on commit c3e6b66051)
2023-11-17 13:32:52 +00:00
Yohei Ueda
57d4dd8e57 runtime: Support the remote hypervisor type
This patch adds the support of the remote hypervisor type.
Shim opens a Unix domain socket specified in the config file,
and sends TTPRC requests to a external process to control
sandbox VMs.

Fixes #4482

Co-authored-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
(based on commit f9278f22c3)
2023-11-17 13:32:49 +00:00
Sumedh Alok Sharma
4aaf54bdad runtime: Fix configmap/secrets update propagation with FS sharing disabled
This PR fixes k8's configmap/secrets etc update propagation when filesystem sharing is disabled.
The commit introduces below changes with some limitations:
- creates new timestamped directory in guest
- updates the '..data' symlink
- creates user visible symlinks to newly created secrets.
- Limitation: The older timestamped directory and stale user visible symlinks exist in guest
  due to missing DELETE api in agent.

Fixes: #7398

Signed-off-by: Sumedh Alok Sharma <sumsharma@microsoft.com>
2023-11-17 13:01:23 +05:30
Liu Wenyuan
c77e990c3e tests: Enable tests for StratoVirt hypervisor
This commit enables StratoVirt hypervisor to be tested in kata GHA,
incluing k8s, metrics, cri-containerd, nydus and so on.

Meanwhile, adding some unit tests for StratoVirt to make sure it works.

Fixes: #7794

Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
2023-11-16 20:47:26 +08:00
Liu Wenyuan
26966c8469 virtcontainers: Add StratoVirt as a supported hypervisor
Initial support of the MicroVM machine type of StratoVirt
hypervisor for the kata go runtime.

Fixes: #7794

Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
2023-11-16 20:47:24 +08:00
Fabiano Fidêncio
5e9cf75937 vc: utils: Rename CalculateMilliCPUs() to CalculateCPUsF()
With the change done in the last commit, instead of calculating milli
cpus, we're actually converting the CPUs to a fraction number, a float.

Let's update the function name (and associated vars) to represent that
change.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-11-10 18:26:01 +01:00
Fabiano Fidêncio
e477ed0e86 runtime: Improve vCPU allocation for the VMMs
First of all, this is a controversial piece, and I know that.

In this commit we're trying to make a less greedy approach regards the
amount of vCPUs we allocate for the VMM, which will be advantageous
mainly when using the `static_sandbox_resource_mgmt` feature, which is
used by the confidential guests.

The current approach we have basically does:
* Gets the amount of vCPUs set in the config (an integer)
* Gets the amount of vCPUs set as limit (an integer)
* Sum those up
* Starts / Updates the VMM to use that total amount of vCPUs

The fact we're dealing with integers is logical, as we cannot request
500m vCPUs to the VMMs.  However, it leads us to, in several cases, be
wasting one vCPU.

Let's take the example that we know the VMM requires 500m vCPUs to be
running, and the workload sets 250m vCPUs as a resource limit.

In that case, we'd do:
* Gets the amount of vCPUs set in the config: 1
* Gets the amount of vCPUs set as limit: ceil(0.25)
* 1 + ceil(0.25) = 1 + 1 = 2 vCPUs
* Starts / Updates the VMM to use 2 vCPUs

With the logic changed here, what we're doing is considering everything
as float till just before we start / update the VMM. So, the flow
describe above would be:
* Gets the amount of vCPUs set in the config: 0.5
* Gets the amount of vCPUs set as limit: 0.25
* ceil(0.5 + 0.25) = 1 vCPUs
* Starts / Updates the VMM to use 1 vCPUs

In the way I've written this patch we introduce zero regressions, as
the default values set are still the same, and those will only be
changed for the TEE use cases (although I can see firecracker, or any
other user of `static_sandbox_resource_mgmt=true` taking advantage of
this).

There's, though, an implicit assumption in this patch that we'd need to
make explicit, and that's that the default_vcpus / default_memory is the
amount of vcpus / memory required by the VMM, and absolutely nothing
else.  Also, the amount set there should be reflected in the
podOverhead for the specific runtime class.

One other possible approach, which I am not that much in favour of
taking as I think it's **less clear**, is that we could actually get the
podOverhead amount, subtract it from the default_vcpus (treating the
result as a float), then sum up what the user set as limit (as a float),
and finally ceil the result.  It could work, but IMHO this is **less
clear**, and **less explicit** on what we're actually doing, and how the
default_vcpus / default_memory should be used.

Fixes: #6909

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2023-11-10 18:25:57 +01:00
Fabiano Fidêncio
b0157ad73a runtime: confidential: Do not set the max_vcpu to cpu
We don't have to do this since we're relying on the
`static_sandbox_resource_mgmt` feature, which gives us the correct
amount of memory and CPUs to be allocated.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-11-10 12:58:20 +01:00
Archana Shinde
1611723465
Merge pull request #8379 from likebreath/1103/clh_v36.0
Upgrade to Cloud Hypervisor v36.0
2023-11-08 21:10:41 -08:00
Archana Shinde
92a517156c
Merge pull request #8367 from amshinde/add-nerdctl-ipvlan-test
network: Fix network hotplug for ipvlan and macvlan endpoints for qemu and add tests
2023-11-08 11:45:13 -08:00
Archana Shinde
a6272733e7 network: Fix network hotplug for ipvlan and macvlan endpoints.
Since moving from network coldplug to hotplug, the only case verified
was veth endpoints. Support for network hotplug for ipvlan and macvlan was
broken/not added. Fix it.

Fixes: #8391

Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
2023-11-07 10:13:51 -08:00
Beraldo Leal
7641c19f74 runtime: bump containerd for gogo deprecation
This update includes necessary changes due to the version bump of
containerd and its dependencies. It's part of a broader initiative to
phase out gogo protobuf, which has been deprecated, and to align with
the current supported libraries.

Fixes #7420.

Signed-off-by: Beraldo Leal <bleal@redhat.com>
2023-11-06 16:49:59 +00:00
Beraldo Leal
16fa2c39e6 protocols: replace gogo/types.Empty and Any
by Google versions.

Signed-off-by: Beraldo Leal <bleal@redhat.com>
2023-11-06 16:49:58 +00:00
Beraldo Leal
5d88c78a6e protocols: generating agent.pb.go
a3b003c345 modified agent but agent.pb.go
was not updated.

Signed-off-by: Beraldo Leal <bleal@redhat.com>
2023-11-06 16:49:58 +00:00
Bo Chen
071667f1ca runtime: clh: Re-generate the client code
This patch re-generates the client code for Cloud Hypervisor v35.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.

Fixes: #8378

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-11-03 10:47:06 -07:00
Fabiano Fidêncio
40cc397218
Merge pull request #8255 from cmaf/migrate-checks-fixes-links
docs: Fix broken links
2023-11-01 14:46:30 +01:00
Archana Shinde
f53f86884f network: Fix network attach for ipvlan and macvlan
We used the approach of cold-plugging network interface for pre-shimv2
support for docker.Since the hotplug approach was not required,
we never really got to implementing hotplug support for certain network
endpoints, ipvlan and macvlan being among them.

Since moving to shimv2 interface as the default for
runtime, we switched to hotplugging the network interface for supporting
docker and nerdctl. This was done for veth endpoints only.

Implement the hot-attach apis for ipvlan and macvlan as well to support
ipvlan and macvlan networks with docker and nerdctl.

Fixes: #8333

Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
2023-10-27 21:42:37 -07:00
Chelsea Mafrica
0608e20a01 docs: Fix broken links
Update broken links so that static checks pass.

Fixes #8254

Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
2023-10-26 10:17:01 -07:00
Jianyong Wu
f9c9d8f645 runtime: QemuVirt: hotadd virtio-mem dev to pcie root port
Hotplug virtio-mem device to pcie root port for Qemu Virt.

Fixes: #7646
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2023-10-18 06:35:57 +00:00
Jianyong Wu
ef18c9550c runtime:qemuvirt: hotadd net dev to pcie root port
Hotplug network device to pcie root port as this is the only way on
QemuVirt.

Fixes: #7646
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2023-10-18 06:35:57 +00:00
Jianyong Wu
f1aec98f9d qemu/virt: use pcie_root_port to do device hotplug for virt
ACPI PCI device hotplug on qemu virt is not supported. The only way to
hotplug pci device is pcie native way. Thus we need create pcie root
port as default.

Pcie root port number depends on following:
1. reserved one for network device as default;
2. virtio-mem dev;
3. add enough port for vhost user blk dev;

Fixes: #7646
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2023-10-18 06:35:57 +00:00
Jianyong Wu
28a41e1d16 runtime: add a new API for Network interface
Add GetEndpointsNum API for Network Interface to get the number of
network endpoints. This is used for caculate the number of pcie root
port for QemuVirt.

Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2023-10-18 06:35:57 +00:00
Greg Kurz
defbb64ac8
Merge pull request #8036 from rye-stripe/bugfix/overhead-metrics
runtime: fix reading cgroup stats of sandboxes
2023-09-27 19:39:55 +02:00
Bo Chen
dfd0c9fa9a runtime: clh: Re-generate the client code
This patch re-generates the client code for Cloud Hypervisor v35.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.

Fixes: #8057

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-09-25 12:22:37 -07:00
Peteris Rudzusiks
94e2ccc2d5 runtime: fix reading cgroup stats of sandboxes
The cgroup stats come from resourcecontrol package in the form of pointers
to structs. The sandbox Stat() method incorrectly was expecting structs.
This caused the cpu and memory stats to always be 0, which in turn caused
incorrect pod overhead metrics.

Fixes #8035

Signed-off-by: Peteris Rudzusiks <rye@stripe.com>
2023-09-21 17:00:53 +02:00
Alexandru Matei
d507d189bb fc: Add support for noflush cache option
Firecracker supports noflush semantic via Unsafe cache type.
There is no support for direct i/o, remove it from config file

Fixes: #7823

Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
2023-09-21 14:48:24 +03:00
Alexandru Matei
2ca781518a clh: Direct IO support for block devices
Clh suports direct i/o for disks. It doesn't
offer any support for noflush, removed passing
of option to cloud-hypervisor internal config

Fixes: #7798

Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
2023-09-21 14:48:24 +03:00
Wainer Moschetta
87e64a07ed
Merge pull request #7979 from beraldoleal/gogo-removal
protocol: remove gogoprotobuff tests
2023-09-20 22:38:10 -03:00
Dan Mihai
82ff2db460 runtime: support kernel params including spaces
Support quoted kernel command line parameters that include space
characters. Example:

dm-mod.create="dm-verity,,,ro,0 736328 verity 1
/dev/vda1 /dev/vda2 4096 4096 92041 0 sha256
f211b9f1921ef726d57a72bf82be23a510076639fa8549ade10f85e214e0ddb4
065c13dfb5b4e0af034685aa5442bddda47b17c182ee44ba55a373835d18a038"

Fixes: #8003

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2023-09-19 20:26:38 +00:00
Beraldo Leal
604a9dd673 protocol: remove gogoprotobuff tests
This is part of a bigger effort to drop gogoprotobuff from our code
base. IIUC, those options are basically used by *pb_test.go, and since
we are dropping gogoprotobuff and those are auto generated tests, let's
just remove it.

Fixes #7978.

Signed-off-by: Beraldo Leal <bleal@redhat.com>
2023-09-19 12:55:42 -04:00
Fabiano Fidêncio
c3ee913bf6
Merge pull request #7953 from gkurz/extra-monitor-socket
runtime/qemu: Rework QMP/HMP support
2023-09-18 19:04:14 +02:00
Jeremi Piotrowski
dfa6af54df
Merge pull request #7806 from jongwu/clh_serial
clh:arm64: use arm AMBA UART for hypervisor debug
2023-09-18 12:29:07 +02:00
Greg Kurz
1f16b6627b runtime/qemu: Rework QMP/HMP support
PR #6146 added the possibility to control QEMU with an extra HMP socket
as an aid for debugging. This is great for development or bug chasing
but this raises some concerns in production.

The HMP monitor allows to temper with the VM state in a variety of ways.
This could be intentionally or mistakenly used to inject subtle bugs in
the VM that would be extremely hard if not even impossible to debug. We
definitely don't want that to be enabled by default.

The feature is currently wired to the `enable_debug` setting in the
`[hypervisor.qemu]` section of the configuration file. This setting has
historically been used to control "debug output" and it is used as such
by some downstream users (e.g. Openshift). Forcing people to have the
extra HMP backdoor at the same time is abusive and dangerous.

A new `extra_monitor_socket` is added to `[hypervisor.qemu]` to give
fine control on whether the HMP socket is wanted or not. This setting
is still gated by `enable_debug = true` to make it clear it is for
debug only. The default is to not have the HMP socket though. This
isn't backward compatible with #6416 but it is for the sake of "better
safe than sorry".

An extra monitor socket makes the QEMU instance untrusted. A warning is
thus logged to the journal when one is requested.

While here, also allow the user to choose between HMP and QMP for the
extra monitor socket. Motivation is that QMP offers way more options to
control or introspect the VM than HMP does. Users can also ask for
pretty json formatting well suited for human reading. This will improve
the debugging experience.

This feature is only made visible in the base and GPU configurations
of QEMU for now.

Fixes #7952

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-09-18 12:13:01 +02:00
Peng Tao
6eedd9b0b9
Merge pull request #7738 from Xuanqing-Shi/7732/handle-non-empty-endpoints-in-RemoveEndpoints
runtime: incorrect handling of non-empty []Endpoint parameter in Remo…
2023-09-18 10:58:28 +08:00
Jianyong Wu
241c355e07 clh:arm64: use arm AMBA uart for hypervisor debug
cloud hypervisor on arm64 only support arm AMBA UART(pl011) as
tty. So, the console should be set to "ttyAMA0" instead of "ttyS0"
when enable hypervisor debug mode.

Fixes: #5080
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2023-09-15 01:44:23 +00:00
Jeremi Piotrowski
3a1db7a86b runtime: clh: Support enabling iommu
by enabling IOMMU on the default PCI segment. For hotplug to work we need a
virtualized iommu and clh exposes one if there is some device or PCI segment
that requests it. I would have preferred to add a separate PCI segment for
hotplugging vfio devices but unfortunately kata assumes there is only one
segment all over the place. See create_pci_root_bus_path(),
split_vfio_pci_option() and grep for '0000'.

Enabling the IOMMU on the default PCI segment requires passing enabling IOMMU on
every device that is attached to it, which is why it is sprinkled all over the
place.

CLH does not support IOMMU for VirtioFs, so I've added a non IOMMU segment for
that device.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
2023-09-14 14:23:28 +02:00
Jeremi Piotrowski
fc51e4b9eb runtime: Check config for supported CLH (cold|hot)_plug_vfio values
The only supported options are hot_plug_vfio=root-port or no-port.
cold_plug_vfio not supported yet.

Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
2023-09-14 14:23:28 +02:00
Peng Tao
55ca7e8aec
Merge pull request #7907 from Xuanqing-Shi/7876/network-devices-naming-conflict
runtime: Naming conflict of network devices
2023-09-13 19:29:41 +08:00
shixuanqing
1636abbe1c runtime: issue with non-empty []Endpoint in RemoveEndpoints
In the RemoveEndpoints(), when the endpoints paramete isn't empty,
using idx may result in wrong endpoint removals. To improve,
directly passing the endpoint parameter helps
locate the correct elements within n.eps.

Fixes: #7732

Signed-off-by: shixuanqing <1356292400@qq.com>

Fixes: #7732

Signed-off-by: shixuanqing <1356292400@qq.com>

Update src/runtime/virtcontainers/network_linux.go

Co-authored-by: Xuewei Niu <justxuewei@apache.org>
2023-09-13 09:47:18 +00:00
Peng Tao
9766f9090c
Merge pull request #7719 from beraldoleal/nullable
Remove gogoproto.nullable extension
2023-09-13 15:11:56 +08:00
shixuanqing
ca4b6b051d runtime: Naming conflict of network devices
When creating a new endpoint, we check existing endpoint names and automatically adjust the naming of the new endpoint to ensure uniqueness.

Fixes: #7876

Signed-off-by: shixuanqing <1356292400@qq.com>
2023-09-12 04:29:51 +00:00
Fabiano Fidêncio
6cd5d83a37
Merge pull request #7865 from gkurz/fix-more-virtiofs-args
runtime: Fix more virtiofs args
2023-09-09 21:30:16 +02:00
Greg Kurz
72c510d057 runtime/virtiofsd: Drop all references to "--cache=none"
This syntax belongs to the legacy C virtiofsd implementation that
we don't support anymore since kata-containers 3.1.3 because
of other API breaking changes.

People have been warned to switch from "none" to "never" since
kata-containers 2.5.2. Let's officially do that.

The compat code that would convert "none" to "never" isn't
needed anymore. Just drop it.

Fixes #7864

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-09-08 17:57:30 +02:00
Beraldo Leal
ead724bec1 protocol: removing gogo.nullable feature
gogo.nullable is the main gogo.protobuf' feature used here. Since we are
trying to remove gogo.protobuf, the first reasonable step seems to be
remove this feature. This is a core update, and it will change how the
structs are defined. I could spot only a few places using those structs,
based on make check/build.

Fixes #7723.

Signed-off-by: Beraldo Leal <bleal@redhat.com>
2023-09-08 11:49:01 -04:00
Peng Tao
435e890cd9
Merge pull request #7703 from bergwolf/github/nerdctl-fc
runtime: run prestart hooks before starting VM for FC
2023-09-07 10:55:31 +08:00
Greg Kurz
81536f21af runtime/qemu: Pass "--xattr" to virtiofsd instead of "-o xattr"
The "-o" syntax belongs to the legacy C virtiofsd. It is deprecated
with the rust implementation.

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-09-06 17:50:35 +02:00
Peng Tao
2e4c874726 runtime/vc: runPrestartHooks should ignore GetHypervisorPid failure
If we are running FC hypervisor, it is not started when prestart hooks
are executed. So we should just ignore such error and just go ahead and
run the hooks.

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2023-08-30 03:06:11 +00:00
Peng Tao
21204caf20 runtime: fail early when starting docker container with FC
FC does not support network device hotplug. Let's add a check to fail
early when starting containers created by docker.

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2023-08-30 02:52:01 +00:00
Peng Tao
32fd013716 runtime: run prestart hooks before starting VM for FC
Add a new hypervisor capability to tell if it supports device hotplug.
If not, we should run prestart hooks before starting new VMs as nerdctl
is using the prestart hooks to set up netns. To make nerdctl + FC
to work, we need to run the prestart hooks before starting new VMs.

Fixes: #6384
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2023-08-30 02:52:01 +00:00
Fabiano Fidêncio
d1b54ede29 qemu: tdx: Workaround SMP issue with TDX 1.5
`...,sockets=1,cores=numvcpus,threads=1,...` must be used.

Fixes: #7770

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-08-28 13:41:36 +02:00
Archana Shinde
1e34220c41 qemu: tdx: Adapt to the TDX 1.5 stack
QEMU for TDX 1.5 makes use of private memory map/unmap.
Make changes to govmm to support this. Support for private backing fd
for memory is added as knob to the qemu config.

Userspace's map/unmap operations are done by fallocate() ioctl on the
backing store fd.
Reference:
https://lore.kernel.org/linux-mm/20220519153713.819591-1-chao.p.peng@linux.intel.com/

Fixes: #7770

Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-08-28 13:41:36 +02:00
Peng Tao
18d42da21e runtime/fc: fix image/initrd annotation handling
Right now if we configure an image annotation and have a config file
setting initrd, the initrd config would override the image annotation.

Make sure annotations are preferred over config options in image and initrd
path handling.

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2023-08-23 03:47:28 +00:00
Peng Tao
9fda7059a5 runtime/clh: fix image/initrd annotation handling
We should make sure annotations are preferred over
config options in image and initrd path handling.

Fixes: #7705
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2023-08-23 03:47:28 +00:00
Peng Tao
1a0092d631 runtime/qemu: fix image/initrd annotation handling
Right now if we configure an image annotation and have a config file
setting initrd, the initrd config would override the image annotation.

Add a helper function ImageOrInitrdAssetPath to make sure annotations
are preferred over config options in image and initrd path handling.

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2023-08-23 03:47:27 +00:00
Fabiano Fidêncio
e107d1d94e
Merge pull request #7574 from microsoft/danmihai1/policy
agent: runtime: add Agent Policy feature
2023-08-15 11:29:13 +02:00
Chelsea Mafrica
22465d22f0
Merge pull request #7638 from ManaSugi/fix/virtcontainers-doc
docs: Remove installation step in virtcontainers doc
2023-08-14 10:21:57 -07:00
Dan Mihai
ab829d1038 agent: runtime: add the Agent Policy feature
Fixes: #7573

To enable this feature, build your rootfs using AGENT_POLICY=yes. The
default is AGENT_POLICY=no.

Building rootfs using AGENT_POLICY=yes has the following effects:

1. The kata-opa service gets included in the Guest image.

2. The agent gets built using AGENT_POLICY=yes.

After this patch, the shim calls SetPolicy if and only if a Policy
annotation is attached to the sandbox/pod. When creating a sandbox/pod
that doesn't have an attached Policy annotation:

1. If the agent was built using AGENT_POLICY=yes, the new sandbox uses
   the default agent settings, that might include a default Policy too.

2. If the agent was built using AGENT_POLICY=no, the new sandbox is
   executed the same way as before this patch.

Any SetPolicy calls from the shim to the agent fail if the agent was
built using AGENT_POLICY=no.

If the agent was built using AGENT_POLICY=yes:

1. The agent reads the contents of a default policy file during sandbox
   start-up.

2. The agent then connects to the OPA service on localhost and sends
   the default policy to OPA.

3. If the shim calls SetPolicy:

   a. The agent checks if SetPolicy is allowed by the current
      policy (the current policy is typically the default policy
      mentioned above).

   b. If SetPolicy is allowed, the agent deletes the current policy
      from OPA and replaces it with the new policy it received from
      the shim.

   A typical new policy from the shim doesn't allow any future SetPolicy
   calls.

4. For every agent rpc API call, the agent asks OPA if that call
   should be allowed. OPA allows or not a call based on the current
   policy, the name of the agent API, and the API call's inputs. The
   agent rejects any calls that are rejected by OPA.

When building using AGENT_POLICY_DEBUG=yes, additional Policy logging
gets enabled in the agent. In particular, information about the inputs
for agent rpc API calls is logged in /tmp/policy.txt, on the Guest VM.
These inputs can be useful for investigating API calls that might have
been rejected by the Policy. Examples:

1. Load a failing policy file test1.rego on a different machine:

opa run --server --addr 127.0.0.1:8181 test1.rego

2. Collect the API inputs from Guest's /tmp/policy.txt and test on the
   machine where the failing policy has been loaded:

curl -X POST http://localhost:8181/v1/data/agent_policy/CreateContainerRequest \
--data-binary @test1-inputs.json

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2023-08-14 17:07:35 +00:00
Manabu Sugimoto
416445e7eb docs: Remove installation step in virtcontainers doc
Remove the installation step in the virtcontainers doc
because the virtcontainers install/uninstall targets have
been removed by 86723b51ae
and they are not used anymore.

Fixes: #7637

Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
2023-08-14 15:15:24 +09:00
Pradipta Banerjee
ab13ef87ee runtime: propagate configmap/secrets etc changes for remote-hyp
For remote hypervisor, the configmap, secrets, downward-api or project-volumes are
copied from host to guest. This patch watches for changes to the host files
and copies the changes to the guest.

Note that configmap updates takes significantly longer than updates via downward-api.
This is similar across runc and Kata runtimes.

Fixes: #7210

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Signed-off-by: Julien Ropé <jrope@redhat.com>
(cherry picked from commit 3081cd5f8e)
(cherry picked from commit 68ec673bc4d9cd853eee51b21a0e91fcec149aad)
2023-08-11 16:31:08 +01:00
Yohei Ueda
c074ec4df1 runtime: Copy shared files recursively
This patch enables recursive file copying
when filesystem sharing is not used.

Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
(cherry picked from commit 5422a056f2)
(cherry picked from commit 16055ce040bbd724be2916bc518d89b69c9e0ca5)

Fixes: #7210
2023-08-11 16:16:52 +01:00
Manabu Sugimoto
cc922be5ec versions: Update firecracker version to 1.4.0
This patch upgrades Firecracker version from v1.1.0 to v1.4.0.

* Generate swagger models for v1.4.0 (from `firecracker.yaml`)
  - The version of go-swagger used is v0.30.0
* The firecracker v1.4.0 includes the following changes.
  - Added
    * Added support for custom CPU templates allowing users to adjust vCPU features
    exposed to the guest via CPUID, MSRs and ARM registers.
    * Introduced V1N1 static CPU template for ARM to represent Neoverse V1 CPU
    as Neoverse N1.
    * Added support for the virtio-rng entropy device. The device is optional. A
    single device can be enabled per VM using the /entropy endpoint.
    * Added a cpu-template-helper tool for assisting with creating and managing
    custom CPU templates.
  - Changed
    * Set FDP_EXCPTN_ONLY bit (CPUID.7h.0:EBX[6]) and ZERO_FCS_FDS bit
    (CPUID.7h.0:EBX[13]) in Intel's CPUID normalization process.
  - Fixed
    * Fixed feature flags in T2S CPU template on Intel Ice Lake.
    * Fixed CPUID leaf 0xb to be exposed to guests running on AMD host.
    * Fixed a performance regression in the jailer logic for closing open file
    descriptors.
    * A race condition that has been identified between the API thread and the VMM
    thread due to a misconfiguration of the api_event_fd.
    * Fixed CPUID leaf 0x1 to disable perfmon and debug feature on x86 host.
    * Fixed passing through cache information from host in CPUID leaf 0x80000006.
    * Fixed the T2S CPU template to set the RRSBA bit of the IA32_ARCH_CAPABILITIES
    MSR to 1 in accordance with an Intel microcode update.
    * Fixed the T2CL CPU template to pass through the RSBA and RRSBA bits of the
    IA32_ARCH_CAPABILITIES MSR from the host in accordance with an Intel microcode
    update.
    * Fixed passing through cache information from host in CPUID leaf 0x80000005.
    * Fixed the T2A CPU template to disable SVM (nested virtualization).
    * Fixed the T2A CPU template to set EferLmsleUnsupported bit
    (CPUID.80000008h:EBX[20]), which indicates that EFER[LMSLE] is not supported.

Fixes: #7610

Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
2023-08-10 16:48:13 +09:00
Wedson Almeida Filho
4fbe0a3a53 runtime: bind-mount mounted block device into container
When the mounted block device isn't a layer, we want to mount it into
containers, but since it's already mounted with the correct fs (e.g.,
tar, ext4, etc.) in the pod, we just bind-mount it into the container.

Fixes: #7536

Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
2023-08-03 17:58:39 -03:00
Wedson Almeida Filho
7e1b1949d4 runtime: add support for kata overlays
When at least one `io.katacontainers.fs-opt.layer` option is added to
the rootfs, it gets inserted into the VM as a layer, and the file system
is mounted as an overlay of all layers using the overlayfs driver.

Additionally, if the `io.katacontainers.fs-opt.block_device=file` option
is present in a layer, it is mounted as a block device backed by a file
on the host.

Fixes: #7536

Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
2023-08-03 17:58:39 -03:00
Zvonko Kaiser
cddcde1d40 vfio: Fix vfio device ordering
If modeVFIO is enabled we need 1st to attach the VFIO control group
device /dev/vfio/vfio an 2nd the actuall device(s) afterwards.Sort the
devices starting with device #1 being the VFIO control group device and
the next the actuall device(s)
/dev/vfio/<group>

Fixes: #7493

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-07-31 11:26:27 +00:00
Zvonko Kaiser
1fc715bc65 s390x: Add AP Attach/Detach test
Now that we have propper AP device support add a
unit test for testing the correct Attach/Detach of AP devices.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-07-23 13:44:19 +00:00
Zvonko Kaiser
545de5042a vfio: Fix tests
Now with more elaborate checking of cold|hot plug ports
we needed to update some of the tests.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-07-20 13:42:44 +00:00
Zvonko Kaiser
62aa6750ec vfio: Added better handling of VFIO Control Devices
Depending on the vfio_mode we need to mount the
VFIO control device additionally into the container.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-07-20 13:42:42 +00:00
Zvonko Kaiser
dd422ccb69 vfio: Remove obsolete HotplugVFIOonRootBus
Removing HotplugVFIOonRootBus which is obsolete with the latest PCI
topology changes, users can set cold_plug_vfio or hot_plug_vfio either
in the configuration.toml or via annotations.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-07-20 07:25:40 +00:00
Zvonko Kaiser
114542e2ba s390x: Fixing device.Bus assignment
The device.Bus was reset if a specific combination of
configuration parameters were not met. With the new
PCIe topology this should not happen anymore

Fixes: #7381

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-07-20 07:24:26 +00:00
Peng Tao
581be92b25
Merge pull request #4492 from zvonkok/pcie-topology
runtime: fix PCIe topology for GPUDirect use-case
2023-07-03 09:17:12 +08:00
Fabiano Fidêncio
6a21e20c63 runtime: Add "none" as a shared_fs option
Currently, even when using devmapper, if the VMM supports virtio-fs /
virtio-9p, that's used to share a few files between the host and the
guest.

This *needed*, as we need to share with the guest contents like secrets,
certificates, and configurations, via Kubernetes objects like configMaps
or secrets, and those are rotated and must be updated into the guest
whenever the rotation happens.

However, there are still use-cases users can live with just copying
those files into the guest at the pod creation time, and for those
there's absolutely no need to have a shared filesystem process running
with no extra obvious benefit, consuming memory and even increasing the
attack surface used by Kata Containers.

For the case mentioned above, we should allow users, making it very
clear which limitations it'll bring, to run Kata Containers with
devmapper without actually having to use a shared file system, which is
already the approach taken when using Firecracker as the VMM.

Fixes: #7207

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-06-30 20:45:00 +02:00
Greg Kurz
a43ea24dfc virtiofsd: Convert legacy -o sub-options to their -- replacement
The `-o` option is the legacy way to configure virtiofsd, inherited
from the C implementation. The rust implementation honours it for
compatibility but it logs deprecation warnings.

Let's use the replacement options in the go shim code. Also drop
references to `-o` from the configuration TOML file.

Fixes #7111

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-06-16 11:42:54 +02:00
Greg Kurz
8e00dc6944 virtiofsd: Drop -o no_posix_lock
The C implementation of virtiofsd had some kind of limited support
for remote POSIX locks that was causing some workflows to fail with
kata. Commit 432f9bea6e hard coded `-o no_posix_lock` in order
to enforce guest local POSIX locks and avoid the issues.

We've switched to the rust implementation of virtiofsd since then,
but it emits a warning about `-o` being deprecated.

According to https://gitlab.com/virtio-fs/virtiofsd/-/issues/53 :

   The C implementation of the daemon has limited support for
   remote POSIX locks, restricted exclusively to non-blocking
   operations. We tried to implement the same level of
   functionality in #2, but we finally decided against it because,
   in practice most applications will fail if non-blocking
   operations aren't supported.

   Implementing support for non-blocking isn't trivial and will
   probably require extending the kernel interface before we can
   even start working on the daemon side.

There is thus no justification to pass `-o no_posix_lock` anymore.

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-06-16 11:42:39 +02:00
Greg Kurz
2a15ad9788 virtiofsd: Stop using deprecated -f option
The rust implementation of virtiofsd always runs foreground and
spits a deprecation warning when `-f` is passed.

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-06-16 10:30:40 +02:00
Zvonko Kaiser
72f2cb84e6 gpu: Reset cold or hot plug after overriding
If we override the cold, hot plug with an annotation
we need to reset the other plugging mechanism to NoPort
otherwise both will be enabled.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-15 17:51:01 +00:00
Zvonko Kaiser
fbacc09646 gpu: PCIe topology, consider vhost-user-block in Virt
In Virt the vhost-user-block is an PCIe device so
we need to make sure to consider it as well. We're keeping
track of vhost-user-block devices and deduce the correct
amount of PCIe root ports.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-15 17:39:55 +00:00
Zvonko Kaiser
b11246c3aa gpu: Various fixes for virt machine type
The PCI qom path was not deduced correctly added regex for correct
path walking.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:33:57 +00:00
Zvonko Kaiser
40101ea7db vfio: Added annotation for hot(cold) plug
Now it is possible to configure the PCIe topology via annotations
and addded a simple test, checking for Invalid and RootPort

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
8f0d4e2612 vfio: Cleanup of Cold and Hot Plug
Removed the configuration of PCIeRootPort and PCIeSwitchPort, those
values can be deduced in createPCIeTopology

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
b5c4677e0e vfio: Rearrange the bus assignemnt
Refactor the bus assignment so that the call to GetAllVFIODevicesFromIOMMUGroup
can be used by any module without affecting the topology.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
b1aa8c8a24 gpu: Moved the PCIe configs to drivers
The hypervisor_state file was the wrong location for the PCIe Port
settings, moved everything under device umbrella, where it can be
consumed more easily and we do not get into circular deps.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
55a66eb7fb gpu: Add config to TOML
Update cold-plug and hot-plug setting to include bridge, root and
switch-port

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
da42801c38 gpu: Add config settings tests for hot-plug
Updated all references and config settings for hot-plug to match
cold-plug

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Zvonko Kaiser
de39fb7d38 runtime: Add support for GPUDirect and GPUDirect RDMA PCIe topology
Fixes: #4491

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-06-14 08:20:24 +00:00
Yushuo
aaa96c749b feat(runtime-rs): modify onlineCpuMemRequest
Some vmms, such as dragonball, will actively help us
perform online cpu operations when doing cpu hotplug.
Under the old onlineCpuMem interface, it is difficult
to adapt to this situation.

So we modify the semantics of nb_cpus in onlineCpuMemRequest.
In the original semantics, nb_cpus represents the number of
newly added CPUs that need to be online. The modified
semantics become that the number of online CPUs in the guest
needs to be guaranteed.

Fixes: #5030

Signed-off-by: Yushuo <y-shuo@linux.alibaba.com>
Signed-off-by: Ji-Xinyou <jerryji0414@outlook.com>
2023-06-12 17:53:16 +08:00
Beraldo Leal
0e47cfc4c7 runtime: sending SIGKILL to qemu
There is a race condition when virtiofsd is killed without finishing all
the clients. Because of that, when a pod is stopped, QEMU detects
virtiofsd is gone, which is legitimate.

Sending a SIGTERM first before killing could introduce some latency
during the shutdown.

Fixes #6757.

Signed-off-by: Beraldo Leal <bleal@redhat.com>
2023-05-24 11:31:28 -04:00
Fabiano Fidêncio
9aae333343
Merge pull request #6871 from kmjohansen/bugfix/ptmx
runtime: make debug console work with sandbox_cgroup_only
2023-05-23 22:24:51 +02:00
Archana Shinde
2c9efbe04c
Merge pull request #6907 from likebreath/0519/clh_v32.0
Upgrade to Cloud Hypervisor v32.0
2023-05-22 09:53:05 -07:00
Bo Chen
35c3d7b4bc runtime: clh: Re-generate the client code
This patch re-generates the client code for Cloud Hypervisor v32.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.

Fixes: #6632

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-05-19 12:49:45 -07:00
Krister Johansen
eff6ed2d5f runtime: make debug console work with sandbox_cgroup_only
If a hypervisor debug console is enabled and sandbox_cgroup_only is set,
the hypervisor can fail to open /dev/ptmx, which prevents the sandbox
from launching.

This is caused by the absence of a device cgroup entry to allow access
to /dev/ptmx.  When sandbox_cgroup_only is not set, the hypervisor
inherits the default unrestrcited device cgroup, but with it enabled it
runs into allow / deny list restrictions.

Fix by adding an allowlist entry for /dev/ptmx when debug is enabled,
sandbox_cgroup_only is true, and no /dev/ptmx is already in the list of
devices.

Fixes: #6870

Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
2023-05-18 10:36:24 -07:00
Gabriela Cervantes
11a34a72e2 docs: Update container network model url
This PR updates the container network model url that is part of the
virtcontainers documentation.

Fixes #6889

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
2023-05-18 15:08:08 +00:00
Salvador Fuentes
b76058c979
Merge pull request #6721 from nedsouza/virtcontainers-qemu-go-coverage
virtcontainers/qemu_test.go: Improve coverage
2023-05-16 11:11:43 -06:00
James O. D. Hunt
a96fcfd5be
Merge pull request #6735 from nedsouza/258/tests-coverage-compatoci
virtcontainers/pkg/compatoci/: Improved coverage for  for Kata 2.0
2023-05-16 15:36:35 +01:00
Tamas K Lengyel
20cb875087 virtcontainers/qemu_test.go: Improve test coverage
Rework TestQemuCreateVM routine to be a table driven test with
various config variations passed to it. After CreateVM a handful
of additional functions are exercised to improve code-coverage.
Also add partial coverage for StartVM routine.

Currently improving from 19.7% to 35.7%

Credit PR to Hackathon Team3

Fixes: #267

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
2023-05-15 15:26:35 -04:00
LiuWeijie
50cc9c582f tests: Improve coverage for virtcontainers/pkg/compatoci/ for Kata 2.0
Add test cases for ParseConfigJson function and GetContainerSpec function

Fixes: #258

Signed-off-by: LiuWeijie <weijie.liu@intel.com>
2023-05-15 11:58:17 +08:00
Archana Shinde
32b39ee347
Merge pull request #6763 from nedsouza/266/tests_coverage_virtcontainers_fc
virtcontainers: Improved test coverage for fc.go from 4.6% to 18.5%
2023-05-12 11:53:27 -07:00
Feng Wang
4e0dce6802
Merge pull request #6738 from fengwang666/oss-fix-fd-leak
runtime: Fix virtiofs fd leak
2023-05-08 10:52:36 -07:00
Eduardo Berrocal
a4c0303d89 virtcontainers: Fixed static checks for improved test coverage for fc.go
Expanded tests on fc_test.go to cover more lines of code. Coverage went from 4.6% to 18.5%.
Fixed very simple static check fail on line 202.

Fixes: #266

Signed-off-by: Eduardo Berrocal <eduardo.berrocal@intel.com>
2023-05-07 00:17:36 -07:00
Peng Tao
65670e6b0a
Merge pull request #6699 from zvonkok/cold-plug-vfio
gpu: cold plug VFIO devices
2023-05-05 10:04:29 +08:00
Archana Shinde
9443c4aea7
Merge pull request #6729 from nedsouza/259/tests_coverage_virtcontainers_persist
virtcontainers/persist: Improved test coverage 65% to 87.5%
2023-05-04 16:18:55 -07:00
Archana Shinde
09134c30de
Merge pull request #6737 from nedsouza/265/virtcontainers-clh-go-coverage
virtcontainers/clh_test.go: improve unit test coverage
2023-05-04 16:15:43 -07:00
Zvonko Kaiser
13d7f39c71 gpu: Check for VFIO port assignments
Bailing out early if the port is wrong, allowed port settings are
no-port, root-port, switch-port

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-05-03 12:32:33 +00:00
Eduardo Berrocal
03a8cd69c2 virtcontainers: Improved test coverage for fc.go from 4.6% to 18.5%
Expanded tests on fc_test.go to cover more lines of code. Coverage went from 4.6% to 18.5%.

Fixes: #266

Signed-off-by: Eduardo Berrocal <eduardo.berrocal@intel.com>
2023-04-28 15:40:45 -07:00
Eduardo Berrocal
6bf1fc6051 virtcontainers/factory: Improved test coverage
Expanded tests on factory_test.go to cover more lines of code. Coverage went from 34% to 41.5% in the case of user-mode run tests,
and from 77.7% to 84% in the case of priviledge-mode run tests.

Fixes: #260

Signed-off-by: Eduardo Berrocal <eduardo.berrocal@intel.com>
2023-04-27 13:08:35 -07:00
Zvonko Kaiser
f7ad75cb12 gpu: Cold-plug extend the api.md
Make the hypervisorconfig consistent in code and api.md

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-27 09:35:05 +00:00
Zvonko Kaiser
0fec2e6986 gpu: Add cold-plug test
Cold plug setting is now correctly decoded in toml

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-27 09:30:24 +00:00
Feng Wang
205909fbed runtime: Fix virtiofs fd leak
The kata runtime invokes removeStaleVirtiofsShareMounts after
a container is stopped to clean up the stale virtiofs file caches.

Fixes: #6455
Signed-off-by: Feng Wang <fwang@confluent.io>
2023-04-26 15:53:39 -07:00
Tamas K Lengyel
0f45b0faa9 virtcontainers/clh_test.go: improve unit test coverage
Credit PR to Hackathon Team3

Fixes: #265

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
2023-04-26 19:12:51 +00:00
Zvonko Kaiser
dded731db3 gpu: Add OVMF setting for MMIO aperture
The default size of OVMFs aperture is too low to
initialized PCIe devices with huge BARs

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-26 09:47:37 +00:00
Zvonko Kaiser
2a830177ca gpu: Add fwcfg helper function
Added driver util function for easier handling of VFIO
devices outside of the VFIO module. At the sandbox level
we may need to set options depending if we have a VFIO/PCIe
device, like the fwCfg for confiential guests.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-26 09:47:37 +00:00
Zvonko Kaiser
c8cf7ed3bc gpu: Add ColdPlug of VFIO devices with devManager
If we have a VFIO device and cold-plug is enabled
we mark each device as ColdPlug=true and let the VFIO
module do the attaching.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-26 09:47:37 +00:00
Zvonko Kaiser
e2b5e7f73b gpu: Add Rawdevices to hypervisor
RawDevics are used to get PCIe device info early before the sandbox
is started to make better PCIe topology decisions

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-26 09:47:37 +00:00
Zvonko Kaiser
377ebc2ad1 gpu: Add configuration option for cold-plug VFIO
Users can set cold-plug="root-port" to cold plug a VFIO device in QEMU

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2023-04-26 09:47:37 +00:00
Eduardo Berrocal
9c38204f13 virtcontainers/persist: Improved test coverage 65% to 87.5%
Expanded tests on manager_test.go to cover more lines of code.

Fixes: #259

Signed-off-by: Eduardo Berrocal <eduardo.berrocal@intel.com>
2023-04-25 23:53:46 +00:00
Fabiano Fidêncio
f478b9115e clh: tdx: Update timeouts for confidential guest
Booting up TDX takes more time than booting up a normal VM.  Those
values are being already used as part of the CCv0 branch, and we're just
bringing them to the `main` branch as well.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-04-13 10:18:07 +02:00
Fabiano Fidêncio
3b3656d96d
Merge pull request #6522 from fidencio/topic/add-tdx-artefacts-from-2023ww01-to-main
tdx: Add artefacts from the latest TDX tools release into main
2023-04-11 20:43:02 +02:00
Fabiano Fidêncio
50ce33b02d
Merge pull request #6205 from fengwang666/non-root-clh
runtime: support non-root for clh
2023-04-11 19:34:00 +02:00
Fabiano Fidêncio
ed145365ec runtime/qemu: Drop "kvm-type=tdx"
This is not supported since 22ww49.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-04-11 15:23:42 +02:00
Fabiano Fidêncio
25b3cdd38c virtcontainers: Drop check for the tdx CPU flag
In the recent kernels provided by Intel the `tdx` CPU flag is not
present anymore.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-04-11 15:23:42 +02:00
Fabiano Fidêncio
01bdacb4e4 virtcontainers: Also check /sys/firmwares/tdx for TDX
Let's make sure we also check /sys/firmwares/tdx for TDX guest
protection, as the location may depend on whether TDX Seam is being used
or not.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-04-11 15:23:42 +02:00
Bin Liu
75987aae72
Merge pull request #6408 from jongwu/nydus_rm_hybrid
nydus: upgrad to v2.2.0
2023-03-28 11:07:56 +08:00
Hyounggyu Choi
96baa83895 agent: Bring in VFIO-AP device handling again
This PR is a continuing work for (kata-containers#3679).

This generalizes the previous VFIO device handling which only
focuses on PCI to include AP (IBM Z specific).

Fixes: kata-containers#3678
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2023-03-16 18:14:12 +09:00
Jakob Naucke
f666f8e2df agent: Add VFIO-AP device handling
Initial VFIO-AP support (#578) was simple, but somewhat hacky; a
different code path would be chosen for performing the hotplug, and
agent-side device handling was bound to knowing the assigned queue
numbers (APQNs) through some other means; plus the code for awaiting
them was written for the Go agent and never released. This code also
artificially increased the hotplug timeout to wait for the (relatively
expensive, thus limited to 5 seconds at the quickest) AP rescan, which
is impractical for e.g. common k8s timeouts.

Since then, the general handling logic was improved (#1190), but it
assumed PCI in several places.

In the runtime, introduce and parse AP devices. Annotate them as such
when passing to the agent, and include information about the associated
APQNs.

The agent awaits the passed APQNs through uevents and triggers a
rescan directly.

Fixes: #3678
Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2023-03-16 10:07:48 +09:00
Jakob Naucke
b546eca26f runtime: Generalize VFIO devices
Generalize VFIO devices to allow for adding AP in the next patch.
The logic for VFIOPciDeviceMediatedType() has been changed and IsAPVFIOMediatedDevice() has been removed.

The rationale for the revomal is:

- VFIODeviceMediatedType is divided into 2 subtypes for AP and PCI
- Logic of checking a subtype of mediated device is included in GetVFIODeviceType()
- VFIOPciDeviceMediatedType() can simply fulfill the device addition based
on a type categorized by GetVFIODeviceType()

Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2023-03-16 10:06:37 +09:00
Jakob Naucke
4c527d00c7 agent: Rename VFIO handling to VFIO PCI handling
e.g., split_vfio_option is PCI-specific and should instead be named
split_vfio_pci_option. This mutually affects the runtime, most notably
how the labels are named for the agent.

Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2023-03-16 07:43:39 +09:00
Sidhartha Mani
a6c67a161e
runtime: add support for ephemeral mounts to occupy entire sandbox memory
On hotplug of memory as containers are started, remount all ephemeral mounts with size option set to the total sandbox memory

Fixes: #6417

Signed-off-by: Sidhartha Mani <sidhartha_mani@apple.com>
2023-03-10 13:36:02 -08:00
Jianyong Wu
395645e1ce runtime: hybrid-mode cause error in the latest nydusd
When update the nydusd to 2.2, the argument "--hybrid-mode" cause
the following error:

thread 'main' panicked at 'ArgAction::SetTrue / ArgAction::SetFalse is defaulted'

Maybe we should remove it to upgrad nydusd

Fixes: #6407
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
2023-03-04 12:58:48 +08:00
Chelsea Mafrica
703589c279
Merge pull request #6369 from XDTG/6082/Fix-path-check-bypassed
runtime: use filepath.Clean() to clean the mount path
2023-02-27 17:24:50 -08:00
Bo Chen
3ac6f29e95 runtime: clh: Re-generate the client code
This patch re-generates the client code for Cloud Hypervisor v30.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.

Fixes: #6375

Signed-off-by: Bo Chen <chen.bo@intel.com>
2023-02-24 10:20:29 -08:00
XDTG
dc86d6dac3 runtime: use filepath.Clean() to clean the mount path
Fix path check bypassed issuse introduced by #6082,
use filepath.Clean() to clean path before check

Fixes: #6082

Signed-off-by: XDTG <click1799@163.com>
2023-02-24 15:48:09 +08:00
Feng Wang
cbe6ad9034 runtime: support non-root for clh
This change enables to run cloud-hypervisor VMM using a non-root user
when rootless flag is set true in the configuration

Fixes: #2567

Signed-off-by: Feng Wang <fwang@confluent.io>
2023-02-22 13:57:09 -08:00
James O. D. Hunt
5f6d747e6d
Merge pull request #6272 from cmaf/tracing-clh-returnctx-startVM
runtime: tracing: Fix missing ctx return
2023-02-14 08:17:45 +00:00
Bin Liu
e812c5ce66
Merge pull request #6076 from zhaojizhuang/reconnect
runtime: add reconnect timeout for vhost user block
2023-02-14 10:39:20 +08:00
Archana Shinde
7b4e5751ca
Merge pull request #5007 from larrydewey/update-rpb-main
SEV: Update ReducedPhysBits
2023-02-13 14:56:38 -08:00
Chelsea Mafrica
c453919911 runtime: tracing: Fix missing ctx return
Normally we return the context when creating a trace span so that the
ordering of spans w.r.t. calls is maintained in tracing output. Add
missing context for StartVM() for Cloud Hypervisor.

Fixes #6271

Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
2023-02-13 12:37:52 -08:00
zhaojizhuang
ca02c9f512 runtime: add reconnect timeout for vhost user block
Fixes: #6075
Signed-off-by: zhaojizhuang <571130360@qq.com>
2023-02-13 14:33:46 +08:00
Bin Liu
8a9392fd9d
Merge pull request #6188 from yahaa/Typo-fix
Typo: change tabs in comment to spaces
2023-02-11 11:19:11 +08:00
Larry Dewey
67b8f0773f SEV: Update ReducedPhysBits
Updating this field, as `cpuid` provides host level data, which is not
what a guest would expect for Reduced Phsycial Bits. In almost all
cases, we should be using `1` for the value here.

Amend: Adding unit test change.

Fixes: #5006

Signed-off-by: Larry Dewey <larry.dewey@amd.com>
2023-02-10 13:19:33 -06:00
yaoyinnan
bdf20b5d26 rootfs: support EROFS filesystem
For kata containers, rootfs is used in the read-only way.
EROFS can noticably decrease metadata overhead.

On the basis of supporting the EROFS file system, it supports using the config parameter to switch the file system used by rootfs.

Fixes: #6063

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: yaoyinnan <yaoyinnan@foxmail.com>
2023-02-11 00:44:13 +08:00
Bin Liu
407d3146e6
Merge pull request #6234 from UiPath/fix-clh-timeout
clh: Enforce API timeout only for vm.boot request
2023-02-08 21:33:56 +08:00
Alexandru Matei
ac64b021a6 clh: Enforce API timeout only for vm.boot request
launchClh already has a timeout of 10seconds for launching clh, e.g.
if launchClh or setupVirtiofsDaemon takes a few seconds the context's
deadline will already be expired by the time it reaches bootVM

Fixes #6240
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
2023-02-08 11:14:51 +02:00
Bin Liu
56071c6e7b virtiofsd: change cache mod to const
Change cache mod from literal to const and place them in one place.

Also set default cache mode from `none` to `never` in
`pkg/katautils/config-settings.go.in`.

Fixes: #6151

Signed-off-by: Bin Liu <bin@hyper.sh>
2023-02-08 15:06:52 +08:00
joannejchen
9794c52c65 improvement: Fix naming conventions for span name and log subsystem
Normally, the span name should be the same as the function name, and the log subsystem should not contain spaces.

Fixes #6153

Signed-off-by: joannejchen <chenjjoanne@gmail.com>
2023-02-06 08:25:49 -06:00
yahaa
e071d9251f Typo: change tabs in comment to spaces
Fixes: #6150

Signed-off-by: yahaa <1477765176@qq.com>
2023-02-02 12:08:33 +08:00
Greg Kurz
334c4b8bdc runtime: Drop QEMU log file support
The QEMU log file is essentially about fine grain tracing of QEMU
internals and mostly useful for developpers, not production. Notably,
the log file isn't limited in size, nor rotated in any way. It means
that a container running in the VM could possibly flood the log file
with a guest triggerable trace. For example, on openshift, the log
file is supposed to reside on a per-VM 14 GiB tmpfs mount. This means
that each pod running with the kata runtime could potentially consume
this amount of host RAM which is not acceptable.

Error messages are best collected from QEMU's stderr as kata is doing
now since PR #5736 was merged. Drop support for the QEMU log file
because it doesn't bring any value but can certainly do harm.

Fixes #6173

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-01-31 09:20:29 +01:00
zhaojizhuang
9092c23a2e runtime: Add hmp for qemu
Fixes: #6092
Signed-off-by: zhaojizhuang <571130360@qq.com>
2023-01-29 14:22:04 +08:00
Greg Kurz
af125b1498
Merge pull request #5736 from gkurz/no-qemu-daemonize
runtime: Start QEMU undaemonized and get logs
2023-01-27 16:33:48 +01:00
Greg Kurz
39fe4a4b6f runtime: Collect QEMU's stderr
LaunchQemu now connects a pipe to QEMU's stderr and makes it
usable by callers through a Go io.ReadCloser object. As
explained in [0], all messages should be read from the pipe
before calling cmd.Wait : introduce a LogAndWait helper to handle
that.

Fixes #5780

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-01-24 23:09:17 +01:00
Greg Kurz
a5319c6be6 runtime: Start QEMU undaemonized
QEMU has always been started daemonized since the beginning. I
could not find any justification for that though, but it certainly
introduces a problem : QEMU stops logging errors when started this
way, which isn't accaptable from a support standpoint. The QEMU
community discourages the use of -daemonize ; mostly because
libvirt, QEMU's primary consummer, doesn't use this option and
prefers getting errors from QEMU's stderr through a pipe in order
to enforce rollover.

Now that virtcontainers knows how to start QEMU with a pre-
established QMP connection, let's start QEMU without -daemonize.
This requires to handle the reaping of QEMU when it terminates.
Since cmd.Wait() is blocking, call it from a goroutine.

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-01-24 23:09:11 +01:00
Greg Kurz
bf4e3a618f runtime: Launch QEMU with cmd.Start()
LaunchCustomQemu() currently starts QEMU with cmd.Run() which is
supposed to block until the child process terminates. This assumes
that QEMU daemonizes itself, otherwise LaunchCustomQemu() would
block forever. The virtcontainers package indeed enables the
Daemonize knob in the configuration but having such an implicit
dependency on a supposedly configurable setting is ugly and fragile.

cmd.Run() is :

func (c *Cmd) Run() error {
	if err := c.Start(); err != nil {
		return err
	}
	return c.Wait()
}

Let's open-code this : govmm calls cmd.Start() and returns the
cmd to virtcontainers which calls cmd.Wait().

If QEMU doesn't start, e.g. missing binary, there won't be any
errors to collect from QEMU output. Just drop these lines in govmm.
Similarily there won't be any log file to read from in virtcontainers.
Drop that as well.

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-01-24 23:09:11 +01:00
Greg Kurz
8a1723a5cb runtime: Pre-establish the QMP connection
Running QEMU daemonized ensures that the QMP socket is ready to
accept connections when LaunchQemu() returns. In order to be
able to run QEMU undaemonized, let's handle that part upfront.
Create a listener socket and connect to it. Pass the listener
to QEMU and pass the connected socket to QMP : this ensures
that we cannot fail to establish QMP connection and that we
can detect if QEMU exits before accepting the connection.
This is basically what libvirt does.

Signed-off-by: Greg Kurz <groug@kaod.org>
2023-01-24 23:09:11 +01:00
Tim Zhang
20196048bf
Merge pull request #6030 from liubin/fix/6029-use-system-hugepagesize
runtime: use system pagesize for hugepage test
2023-01-16 16:57:55 +08:00
Eric Ernst
3d573ba579
Merge pull request #6050 from egernst/goos-the-vc
virtcontainers: split out linux-specific bits for mount, factory
2023-01-13 15:28:42 -08:00
Eric Ernst
458fe865ea
Merge pull request #6052 from egernst/add-darwin-skeletons
Add darwin skeletons
2023-01-13 13:14:16 -08:00
Eric Ernst
923cd3fda1 virtcontainers: split out Linux parts from mount
Mount handling is often unique in Linux. Let's ensure that the common
parts remain in mount.go, while Linux speific parts are within a linux
file.

Fixes: #6049

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2023-01-13 11:14:56 -08:00
Eric Ernst
f82918f872
Merge pull request #6045 from egernst/fix-6044
Address issues with the initial vCPU pinning functionality
2023-01-13 11:06:42 -08:00
GabyCT
9c6e90fd55
Merge pull request #6043 from GabyCT/topic/fixerrormsg
virtcontainers: Fix misspelling in error message
2023-01-13 09:16:34 -06:00
Eric Ernst
60ff230d80 virtcontainers: Split the factory package into Linux and Darwin bits
- split template
- split factory
- add stubs for darwin

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2023-01-12 16:51:28 -08:00
Samuel Ortiz
ea06fe3afc virtcontainers: Add a Network API skeleton for Darwin
Empty for now.

Fixes: #6051

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2023-01-12 15:53:28 -08:00
Eric Ernst
6ee550e9a5 runtime: vCPUs pinning is sandbox specific, not hypervisor
While at it, make sure we persist this and fix a misc typo.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2023-01-12 15:44:25 -08:00
Peng Tao
4a4232b851
Merge pull request #6037 from bergwolf/github/no-netns
runtime: fix up disable_netns handling
2023-01-12 09:58:24 +08:00
Eric Ernst
e3d3b72fa2 virtcontainers: use resource control for setting CPU affinity
Let's abstract the CPU affinity, instead of calling linux only code from
sandbox.

Fixes: #6044

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2023-01-11 17:55:53 -08:00
Gabriela Cervantes
fc17d7cc41 virtcontainers: Fix misspelling in error message
This PR fixes a misspelling in the error message when it tries to run
a system without Confidential computing support.

Fixes #6042

Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
2023-01-11 21:58:07 +00:00
Peng Tao
12fd6ffc1f runtime: fix up disable_netns handling
With `disable_netns=true`, we should never scan the sandbox netns which
is the host netns in such case.

Fixes: #6021
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2023-01-11 12:25:24 +00:00
Bin Liu
8551853cfe runtime: use system pagesize for hugepage test
In TestHandleHugepages it will do a mount operation with different pagesizes,
but some systems only support 2M pagesize, test for a 1g pagesize will fail.

This commit try to fix by only mount pagesizes under `/sys/kernel/mm/hugepages`, which are
supported to mount by the OS.

Fixes: #6029

Signed-off-by: Bin Liu <bin@hyper.sh>
2023-01-11 17:02:58 +08:00
Eric Ernst
07e77f5be7
Merge pull request #5994 from dcantah/virtcontainers_tests_darwin
virtcontainers: tests: Ensure Linux specific tests are just run on Linux
2023-01-10 17:13:28 -08:00
Fabiano Fidêncio
147c56bb8d
Merge pull request #6019 from liubin/fix/6018-virtiofsd-cache-mod
Change cache mode from none to never
2023-01-10 23:12:13 +01:00
Bin Liu
8225d8044e
Merge pull request #6003 from dcantah/fs-skeleton
virtcontainers: fs_share: Add Darwin skeleton
2023-01-10 17:48:45 +08:00
Bin Liu
86a82cace9 runtime: change cache mode from none to never
New Rust virtiofsd's `cache` mode doesn't support `none` mode,
we should use `never` to replace it.

Fixes: #6018

Signed-off-by: Bin Liu <bin@hyper.sh>
2023-01-10 17:29:48 +08:00
Eric Ernst
4d53303a7d
Merge pull request #6005 from dcantah/vfw-skeleton
virtcontainers: Add a Virtualization.framework skeleton
2023-01-09 15:50:04 -08:00
Bin Liu
1bae41a4d4
Merge pull request #5996 from dcantah/vfw-initial
virtcontainers: Introduce hypervisor_darwin
2023-01-09 11:37:02 +08:00
Samuel Ortiz
fa9ae9362c virtcontainers: Add a Virtualization.framework skeleton
Fixes: #6004

A Virtualization.framework based Hypervisor implementation.
This is just stubs for now to eventually get this building.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
Signed-off-by: Danny Canter <danny@dcantah.dev>
2023-01-08 07:40:21 -08:00
Eric Ernst
d48b22bb13 virtcontainers: fs_share: add Darwin skeleton
Fixes: #6002

As a first pass for testing, let's add a skeleton for filesystem
sharing support on Darwin..

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Signed-off-by: Danny Canter <danny@dcantah.dev>
2023-01-07 19:56:47 -08:00
Bin Liu
bc8a6423e0
Merge pull request #5986 from dcantah/nydus-nonetns
nydus: net-ns handling needs to be only executed on Linux hosts
2023-01-07 11:19:07 +08:00
Eric Ernst
fafc7a8b1a virtcontainers: tests: Ensure Linux specific tests are just run on Linux
Fixes: #5993

Several tests utilize linux'isms like Mounts, bindmounts, vsock etc.

Let's ensure that these are still tested on Linux, but that we also skip
these tests when on other operating systems (Darwin). This commit just
moves tests; there shouldn't be any functional test changes. While the
tests still won't be runnable on Darwin/other hosts yet, this is a necessary
step forward.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Signed-off-by: Danny Canter <danny@dcantah.dev>
2023-01-06 11:09:11 -08:00
Fabiano Fidêncio
efa4fc0b25 clh: Add hotplug support for network devices
This is needed in order to have Moby / Docker working properly with
Cloud Hypervisor, as Moby / Docker relies on hotplugging a network
device to the VM as a preStartHook.

Fixes: #5997

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-01-06 18:59:47 +01:00
Fabiano Fidêncio
1074d2c1d3 clh: Make vmAddNetPutRequest capable of doing hotplugs
THe only bit needed for having the vmAddNetPutRequest() capable of
dealing with hotplugs, instead of only coldplugs, is making sure it
doesn't error out in case a `200` response is returned.

The 200 response means:
"""
The new device was successfully added to the VM instance.
"""

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2023-01-06 18:55:55 +01:00
Fabiano Fidêncio
175794458f
Merge pull request #5972 from bergwolf/github/hook
fix moby prestart hook handling
2023-01-06 14:54:39 +01:00
Eric Ernst
9ec8a13985 virtcontainers: introduce hypervisor_darwin
Fixes: #5995

Placeholder skeleton at this point - implementation will be added after
basic build refactoring lands.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Signed-off-by: Danny Canter <danny@dcantah.dev>
2023-01-06 02:03:34 -08:00
Peng Tao
8bb68a9f28 vc/network: skip existing endpoints when scanning for new ones
So that addAllEndpoints() becomes re-entrant and we can use it to scan
netns changes.

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2023-01-06 10:01:19 +00:00
Danny Canter
3886aad199 nydus: net-ns handling needs to be only executed on Linux hosts
Fixes: #5985

With nydus not being its own pkg, it is challenging to implement cleanly
in a virtcontainers package that isn't necesarily Linux-only. The
existing code utilizes network namespace code in order to ensure nydus
is launched in the host netns. This is very Linux specific - so let's
make sure we only carry this out in a linux specific file.

In the Darwin case, to allow for compilation at least, let's add a stub
for doNetNS. Ideally the nydus and vc code can be refactored /
decoupled.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Signed-off-by: Danny Canter <danny@dcantah.dev>
2023-01-05 11:48:43 -08:00
Peng Tao
d085389127 vc: fix up UT for CreateSandbox API change
Need to adapt the UT as well.

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2023-01-03 22:30:42 +08:00
Peng Tao
578a9c25f0 vc: rescan network endpoints after running prestart hooks
Moby relies on the prestart hooks to configure network endpoints. We
should rescan the netns after running them so that the newly added
endpoints can be found and plugged to the guest.

Fixes: #5941
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2023-01-03 22:30:41 +08:00
Danny Canter
86ee24b33c Runtime: Clarify mutability of global var
Was about to change `urandomdev` to a constant when I realized it's
intentionally mutable so it can be mocked in tests. There's other
comments to the same effect so clarify here as well.

Fixes: #5965

Signed-off-by: Danny Canter <danny@dcantah.dev>
2023-01-02 01:13:34 -08:00
Fabiano Fidêncio
f1381eb361
Merge pull request #4813 from ManaSugi/fix/add-selinux-agent
runtime,agent: Add SELinux support for containers inside the guest
2022-12-13 11:24:53 +01:00
Alexandru Matei
d04d45ea05 runtime: use pidfd to wait for processes on Linux
Use pidfd_open and poll on newer versions of Linux to wait
for the process to exit. For older versions use existing wait logic

Fixes: #5617

Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
2022-12-06 16:31:05 +02:00
Alexandru Matei
e9ba0c11d0 runtime: use exponential backoff for process wait
Initial wait period between checks is 1ms, and the
next ones are min(wait_period*5, 50ms)

Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
2022-12-06 16:30:58 +02:00
Alexandru Matei
71491a69c3 runtime: move process wait logic to another function
extract process wait logic to another function

Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
2022-12-05 13:32:04 +02:00
Alexandru Matei
92ebe61fea runtime: reap force killed processes
reap child processes after sending SIGKILL

Fixes #5739

Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
2022-12-05 13:31:58 +02:00
Manabu Sugimoto
c617bbe70d runtime: Pass SELinux policy for containers to the agent
Pass SELinux policy for containers to the agent if `disable_guest_selinux`
is set to `false` in the runtime configuration. The `container_t` type
is applied to the container process inside the guest by default.
Users can also set a custom SELinux policy to the container process using
`guest_selinux_label` in the runtime configuration. This will be an
alternative configuration of Kubernetes' security context for SELinux
because users cannot specify the policy in Kata through Kubernetes's security
context. To apply SELinux policy to the container, the guest rootfs must
be CentOS that is created and built with `SELINUX=yes`.

Fixes: #4812

Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
2022-11-29 19:07:56 +09:00
Bin Liu
6af037d379
Merge pull request #5154 from Yuan-Zhuo/main
agent: support systemd cgroup for kata agent.
2022-11-28 18:40:10 +08:00