Although we don't seem to be affected by
https://nvd.nist.gov/vuln/detail/CVE-2024-21626, we vendor and use the
runc package in a few different places of our code, and we better update
the package to its latest release.
Fixes: #9097
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Last but not least, all placeholders for argument replacement
should be configured to generate a configuration file when `QEMUCMD`
is defined. This enriches those variables.
Additionally, this involves creating a symbolic link to `configuration-qemu.toml`
if QEMU is defined as the default hypervisor.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
There are some variables newly introduced to runtime-rs, such as:
- runtime.name
- runtime.hypervisor_name
- runtime.agent_name
- vm_rootfs_driver
Additionally some of the placeholders for argument replacement are
made hypervisor-specific based on the changes made for dragonball.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
For example, Kata CI's k8s-copy-file.bats transfers files between the
Host and the Guest using "kubectl exec", and that results in
CloseStdinRequest being called from the Host.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
1. Remove PullImageRequest because that is not used in the main
branch. It was used in the CCv0 branch.
2. Add default false values for the remaining Kata Agent ttrpc
requests.
These changes don't change the functionality of the auto generated
Policy, but they help with easier understanding the Policy text and
the logging from the Rego rules.
Fixes: #9049
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Since cri-o doesn't seem to use address for event publishing as mentioned
in the previous commit it will not send it. However, the exact way of
not sending it is unfortunately different from what is assumed by
runtime-rs. Due to an implementation detail of cri-o which uses containerd
libraries for some low-level tasks, TTRPC_ADDRESS will not be missing from
environment as assumed, instead it will be present with an empty value.
This commit contains a small adjustment to account for that and use
LogForwarder even if TTRPC_ADDRESS is present, but with an empty value.
Fixes#8985
Signed-off-by: Pavel Mores <pmores@redhat.com>
This is needed to fix the bug which is not allowing to create SEV container
on SNP enabled host anymore. This is a regression that was introduced as
part of the following commit:
de39fb7d38Fixes: #9036
Signed-off-by: Niteesh Dubey <niteesh@us.ibm.com>
It makes sense to reuse a configuration template for runtime-golang
as a base. This is simply to copy it into the config directory.
Fixes: #8441
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
It appears that under the shim v2 protocol, a shim has no use of its own
for the -address value, it just passes it back to container runtime's
(mostly containerd or cri-o) event-publishing binary. Since the -address
value only flows through the shim, being passed to the shim by a container
runtime and then essentially passed back by shim to the container runtime,
it seems inappropriate for a shim to validate the value that is fully
owned and only used by the container runtime.
This commit removes such validation from runtime-rs. Doing so, it solves
(part of) an interoperability problem between runtime-rs and cri-o. cri-o
seems to intentionally choose not to implement the event-publishing part
of the shim v2 protocol and thus it has no value it could pass to
runtime-rs for -address. As a result, it sends an empty string which has
been failing the excessive validation performed by runtime-rs so far.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The emergent Kata CI tests for Policy use confidential_guest = false
in genpolicy-settings.json. That value is inconsistent with the
following mount settings:
"emptyDir": {
"mount_type": "local",
"mount_source": "^$(cpath)/$(sandbox-id)/local/",
"mount_point": "^$(cpath)/$(sandbox-id)/local/",
"driver": "local",
"source": "local",
"fstype": "local",
"options": [
"mode=0777"
]
},
We need to keep those settings for confidential_guest = true, and
change confidential_guest = false to use:
"emptyDir": {
"mount_type": "local",
"mount_source": "^$(cpath)/$(sandbox-id)/rootfs/local/",
"mount_point": "^$(cpath)/$(sandbox-id)/local/",
"driver": "local",
"source": "local",
"fstype": "local",
"options": [
"mode=0777"
]
},
The value of the mount_source field is different.
This change unblocks testing using Kata CI's pod-empty-dir.yaml:
genpolicy -u -y pod-empty-dir.yaml
kubectl apply -f pod-empty-dir.yaml
k get pod sharevol-kata
NAME READY STATUS RESTARTS AGE
sharevol-kata 1/1 Running 0 53s
Fixes: #8887
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Implementing Persist API for cloud-hypervisor was done partially with
initial support for cloud-hypervisor. Store and retrieve additional
fields to/from the hypervisor state.
Fixes: #6202
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
The original handling method does not reach user expectations.
When the ClientSocketAddress method stats the corresponding
path of runtime-rs and has not found it yet, we should return
an error message here that includes the reason for the failure
(which should be an error display indicating that both runtime-go
and runtime-rs were not found). Instead of simply displaying the
corresponding path of runtime-rs as the final error message to
users.
It is also necessary to return the error promptly to the caller
for further error handling.
Fixes: #8999
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Now that we have a confidential image / initrd being built, instead of a
specific one for each TEE, let's use it everywhere possible.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we're building a single confidential kernel, we should rely on it
rather than keep using the specific ones for TDX / SEV / SNP.
However, for debugability-sake, let's do this change TEE by TEE.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we're building a single confidential kernel, we should rely on it
rather than keep using the specific ones for TDX / SEV / SNP.
However, for debugability-sake, let's do this change TEE by TEE.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we're building a single confidential kernel, we should rely on it
rather than keep using the specific ones for TDX / SEV / SNP.
However, for debugability-sake, let's do this change TEE by TEE.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
With this we can properly generate and the the `-confidential` kernel,
which supports SEV / SNP / TDX as part of our configuration files.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Until now, runtime-rs couldn't be compiled on s390x.
We need to lift those restrictions in Makefile first.
Fixes: #8446
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
It fails to compile virt_container because Dragonball is only
used in the implementation of the trait method Persist::restore().
As the hypervisor is not compiled on s390x and QEMU implements
the trait method, this commit is to let the method use QEMUi's.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Dragonball and cloud-hypervisor are not supported on s390x.
We need to exclude the plugins for these hypervisors from compilation.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
This patch can reduce load on systemd process, and
increase the k8s deployment density when using go runtime.
Fixes: #8758
Signed-off-by: Zhigang Wang <wangzhigang17@huawei.com>
Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
Kata CI's pod-sandbox-vcpus-allocation.yaml ends with "---", so the
empty YAML document following that line should be ignored.
To test this fix:
genpolicy -u -y pod-sandbox-vcpus-allocation.yaml
Fixes: #8895
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Allow users to specify in genpolicy-settings.json a default cluster
namespace other than "default". For example, Kata CI uses as default
namespace: "kata-containers-k8s-tests".
Fixes: #8976
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
There is a race condition in agent HVSOCK_STREAMS hashmap, where a
stream may be taken before it is inserted into the hashmap. This patch
add simple retry logic to the stream consumer to alleviate this issue.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Linux forbids opening an existing socket through /proc/<pid>/fd/<fd>,
making some images relying on the special file /dev/stdout(stderr),
/proc/self/fd/1(2) fail to boot in passfd io mode, where the
stdout/stderr of a container process is a vsock socket.
For back compatibility, a pipe is introduced between the process
and the socket, and its read end is set as stdout/stderr of the
container process instead of the socket. The agent will do the
forwarding between the pipe and the socket.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
In passfd io mode, when not using a terminal, the stdout/stderr vsock
streams are directly used as the stdout/stderr of the child process.
These streams are non-blocking by default.
The stdout/stderr of the process should be blocking, otherwise
the process may encounter EAGAIN error when writing to stdout/stderr.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
We want the io connection keep connected when the containerd closed
the io pipe, thus it can be attached on the io stream.
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
Support the hybrid fd passthrough mode with passing pipe fd,
which can specify this connection kept even when the pipe
peer closed, and this connection can be reget wich re-opening
the pipe.
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
In linux, when a FIFO is opened and there are no writers, the reader
will continuously receive the HUP event. This can be problematic
when creating containers in detached mode, as the stdin FIFO writer
is closed after the container is created, resulting in this situation.
In passfd io mode, open stdin fifo with O_RDWR|O_NONBLOCK to avoid the
HUP event.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
When container exits, the agent should clean up the term master fd,
otherwise the fd will be leaked.
Fixes: kata-containers#6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
When one end of the connection close, the epoll event will be triggered
forever. We should close the connection and kill the connection.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Currently in the kata container, every io read/write operation requires
an RPC request from the runtime to the agent. This process involves
data copying into/from an RPC request/response, which are high overhead.
To solve this issue, this commit utilize the vsock fd passthrough, a
newly introduced feature in the Dragonball hypervisor. This feature
allows other host programs to pass a file descriptor to the Dragonball
process, directly as the backend of an ordinary hybrid vsock connection.
The runtime-rs now utilizes this feature for container process io. It
open the stdin/stdout/stderr fifo from containerd, and pass them to
Dragonball, then don't bother with process io any more, eliminating
the need for an RPC for each io read/write operation.
In passfd io mode, the agent uses the vsock connections as the child
process's stdin/stdout/stderr, eliminating the need for a pipe
to bump data (in non-tty mode).
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Two toml options, `use_passfd_io` and `passfd_listener_port` are introduced
to enable and configure dragonball's vsock fd passthrough io feature.
This commit is a preparation for vsock fd passthrough io feature.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Using custom input paths with -i is counter-intuitive. Simplify path handling with explicit flags for rules.rego and genpolicy-settings.json.
Fixes: #8568
Signed-Off-By: Malte Poll <1780588+malt3@users.noreply.github.com>
When creating a cgroup, add a SingleContainer when obtaining the OCI Spec to apply to ctr, podman, etc.
Fixes: #5240
Signed-off-by: yaoyinnan <35447132+yaoyinnan@users.noreply.github.com>
Removed the setting of default values for runtime fields. Added explicit checks for missing or empty fields, reporting errors with clear messages.
Fixes: #8838
Signed-off-by: yaoyinnan <35447132+yaoyinnan@users.noreply.github.com>
The `noop-method-call` is a rustc lint that has existed since v1.52.0.
This lint has been moved to the warn by default lint level since v1.73.0.
Therefore build is failing with this version and above.
This commit removes the unnecessary call to `<&T as Deref>::deref` on `T: !Deref`.
Fixes: #8586
Signed-off-by: Kvlil <kalil.pelissier@gmail.com>
The auto-generated Policy already allows these volumes to be mounted,
regardless if they are:
- Present, or
- Missing and optional
Fixes: #8893
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Qemu stderr monitoring runs in its own asynchronous green thread.
For that, `stderr` is taken out of the Child representing the qemu child
process to avoid partial move and make it possible for the main thread
still to call functions on QemuInner::qemu_process (e.g. kill(), id()).
Fixes#8937
Signed-off-by: Pavel Mores <pmores@redhat.com>
We'll want to capture qemu's stderr in parallel with normal runtime-rs
execution. Tokio's primitives make this much easier than std's. This
also makes child process management more consistent across runtime-rs
(i.e. virtiofsd child process is already launched and managed using tokio).
Some changes were necessary due to tokio functions being slightly different
from their std counterparts. Child::kill() is now async and Child::id()
now returns an Option.
Signed-off-by: Pavel Mores <pmores@redhat.com>
Allow Kata CI's pod-nested-configmap-secret.yaml to work with
genpolicy and current cbl-mariner images:
1. Ignore the optional type field of Secret input YAML files.
It's possible that CoCo will need a more sophisticated Policy
for Secrets, but this change at least unblocks CI testing for
already-existing genpolicy features.
2. Adapt the value of the settings field below to fit current CI
images for testing on cbl-mariner Hosts:
"kata_config": {
"confidential_guest": false
},
Switching this value from true to false instructs genpolicy to
expect ConfigMap volume mounts similar to:
"configMap": {
"mount_type": "bind",
"mount_source": "$(sfprefix)",
"mount_point": "^$(cpath)/watchable/$(bundle-id)-[a-z0-9]{16}-",
"driver": "watchable-bind",
"fstype": "bind",
"options": [
"rbind",
"rprivate",
"ro"
]
},
instead of:
"confidential_configMap": {
"mount_type": "bind",
"mount_source": "$(sfprefix)",
"mount_point": "$(sfprefix)",
"driver": "local",
"fstype": "bind",
"options": [
"rbind",
"rprivate",
"ro"
]
}
},
This settings change unblocks CI testing for ConfigMaps.
Simple sanity testing for these changes:
genpolicy -u -y pod-nested-configmap-secret.yaml
kubectl apply -f pod-nested-configmap-secret.yaml
kubectl get pods | grep config
nested-configmap-secret-pod 1/1 Running 0 26s
Fixes: #8892
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This fixes a flaw pointed out in review of PR #8185. Creation of the
directory semantically fits better into VM preparation than VM launch.
Signed-off-by: Pavel Mores <pmores@redhat.com>
Validating the node name is currently outside the scope of the CoCo
policy.
This change unblocks testing using Kata CI's test-pod-file-volume.yaml
and pv-pod.yaml.
Fixes: #8888
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Remove the unused DriverInfo declaration or integrate it into the codebase where applicable.
Fixes: #8927
Signed-off-by: yaoyinnan <35447132+yaoyinnan@users.noreply.github.com>
Add metadata containing the Policy annotation if the user didn't
provide any metadata in the input yaml file.
For a simple sanity test using a Kata CI YAML file:
genpolicy -u -y job.yaml
kubectl apply -f job.yaml
kubectl get pods | grep job
job-pi-test-64dxs 0/1 Completed 0 14s
Fixes: #8891
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
SharedVersion fiel add a versiontable property that isn't supported by upstream QEMU.
This is dead code since virtcontainers isn't setting SharedVersions to true.
Fixes: #7720
Signed-off-by: Kvlil <kalil.pelissier@gmail.com>
Ignore pod DNS settings because policing the network traffic is
currently outside the scope of the Agent Policy.
Example from Kata CI: pod-custom-dns.yaml
Fixes: #8832
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Deploy the framework added by the previous commit to generate qemu
command line and launch the VM.
We now properly store the child process object which allows us to
implement remaining Hypervisor functions necessary for a simple but
successful VM lifecycle, get_vmm_master_tid() and stop_vm().
Fixes#8184
Signed-off-by: Pavel Mores <pmores@redhat.com>
- test_volume_capacity_stats: verify the file block size against the fetched size via statfs()
- test_reseed_rng: Correct the request codes for RNDADDTOENTCNT and RNDRESEEDCRNG when platform is ppc64le
- test list_routes: Add the route only if destination is not empty
- test_new_fs_manager: skip the test if cgroups v2 is used by default
- skip test cases rpc::tests::test_do_write_stream, sandbox::tests::test_find_process, sandbox::t
ests::test_find_container_process and sandbox::tests::add_and_get_container on ppc64le as they are fl
aky
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
kata-ctl currently fails to build on ppc64le. Skip it for running static checks and the issues will be fixed and tracked in a seperate issue.
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
A few CPU related test cases were failing as the version was being verified against Power8 while the CI machine is Power9.
Fixes: #5531
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
At the moment, a project `dragonball` and `runtime-rs` does not support
for s390x. During the enablement, some errors due to the misconfiguration
of Makefile for `make check` and `make vendor` were identified.
This is to skip the build for the affected target of the projects.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Some linting errors were identified during the enablement of `make check`.
These have not been found by the Jenkins CI job because `make test` was
only triggered.
The errors for the `agent` occurs under the s390x specific tests while
the other ones for the `kata-ctl` are the architecture-specific code.
This commit is to fix those errors.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The idea of most of these is just to prevent running into todo!()s where
we can at the moment, while implementing the fundamental functionality of
VM launch.
Signed-off-by: Pavel Mores <pmores@redhat.com>
DirectVolume/Rawblock doesn't work well when device's block driver
is virtio-blk-pci and the storage handler is DRIVER_BLK_PCI_TYPE.
Fixes: #8707
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Temporarily disable the allow_storages() rules, because they are based
on the tarfs snapshotter + container image integrity information that
are not available yet in the main branch - see #8833.
Fixes: #8834
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Adjust genpolicy-settings.json to match the container root path from
the main branch + cbl-mariner Guest VMs.
This configuration might have to be adjusted again when other types of
Guest VMs will be tested during CI using genpolicy, in the future.
Also, improve logging from allow_root_path(), to easier debug these
issues in the future.
Fixes: #8835
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Replace the `todo!()` calls with a minimal NOP implementation to return
the CH driver to working order since the `todo!()`'s forcibly crash the
driver at runtime. Full implementations for these APIs will be added on
issues #8800, #8801, and #8802.
Fixes: #8784.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Remove the `todo!()` macro which would cause a runtime crash and replace
with a implementation that returns an error as a stop-gap until #8800 is
implemented.
Fixes: #8785.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
It is a little bit heavy for the runtime-rs to forwards events via
containerd CLI, contrast to the ttrpc way. Plus, for runtimes that haven't
this mechanism, e.g. CRI-O, we can't get those events anywhere.
This patch introduces two types of forwarders:
- `ContainerdForwarder`: Acquire ttrpc address from environment variables
and forward events via ttrpc connection.
- `LogForwarder`: Write event info into logs.
Fixes: #7881
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
This is not necessarily meant to work, just to stub out unimplemented
functionality while focusing on more fundamental things.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The agent registers an event fd in `memory.oom_control`. An OOM event is
forwarded to containerd when the event is emitted, regardless of the
content in that file.
I observed content indicating that events should not be forwarded, as shown
below. When `oom_kill` is set to 0, it means no OOM has occurred. Therefore,
it is important to check the content to avoid mistakenly forwarding OOM
events.
```
oom_kill_disable 0
under_oom 0
oom_kill 0
```
Fixes: #8715
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Don't release the lock between is_allowed and set_policy calls,
because the policy might change in between these calls.
Also, move more policy code into policy.rs.
Fixes: #8734
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
- `ttrpc` from `0.7.1` to `0.8`.
- `containerd-shim-protos` from `0.3.0` to `0.6.0`.
Fixes: #8756
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
In order to avoid rust-vmm upstream change breaks Dragonball
compilation, we introduce Cargo.lock to dbs crates.
fixes: #8770
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
In order to avoid rust-vmm upstream change breaks Dragonball
compilation, we introduce Cargo.lock to dbs crates.
fixes: #8770
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
As reported in #8767, we have found that the root cause is that rust-vmm's vmm-sys-utils
introduce a new release 0.12.1 and dbs-pci rely on rust-vmm's vfio-ioctls which uses >=
to declare vmm-sys-utils so it automatically upgrade vmm-sys-utils to 0.12.1.
That's how two different versions of vmm-sys-utils is introduced and this breaks the compilation.
In order to fix this and also avoid future problems, we introduce Cargo.lock file to dbs crates.
fixes: #8770
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Previously, Dragonball did not support PCI device hot-plugging or
VFIO device passthrough. Therefore, the runtime-rs support for
Dragonball was incomplete. it is time to complete it so that users
can use Dragonball's PCI hot-plugging and VFIO passthrough capabilities.
Fixes: #8748
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
vfio commits introduce quite a lot change in runtime-rs, this commit is
for all the changes related to ci, including compilation errors and so on.
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Introduce two new vmm action to implement pci hotplug
and pci hot-unplug: PrepareRemoveHostDevice and RemoveHostDevice.
PrepareRemoveHostDevice is to call upcall to unregister the pci device
in the guest kernel.
RemoveHostDevice should be called after PrepareRemoveHostDevice, it is used
to clean the PCI resource in the Dragonball side.
fixes: #8741
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Helin Guo <helinguo@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Introduce a new vmm action InsertHostDevice to passthrough
host pci devices like NIC or GPU devices into guest so that
users could have high performance usage of those devices.
fixes: #8741
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Helin Guo <helinguo@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Add a pcie_topology field to DeviceManager and initialize
pcie_topology when ResourceManager calls DeviceManager's new()
with TopologyConfigInfo.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Before calling the device driver to attach a device, register
the device to PCIe topology and allocate a PciPath for it.
However, for some hypervisor such as CLH, the allocation is invalid
when plugging devices to VM, they have the ability to return
DeviceInfo containing PciPath. It'll update the PciPath with the
returned pci path in the PCIe topology for them to prevent the
inferred pcipath from being different from the actual value returned.
But the update will not be executed if the pcipath value doesn't change.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Introduce helper macros to simplify PCIe device register/unregister
and update, which provides a convenient way to handle devices in
topology.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Add one more argument with type &mut Option<&mut PCIeTopology>
in attach and detach to inroduce methods within PCIe Topology.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Implement Trait PCIeDevice register/unregister for pcie/pci
device, such as vfio device which needs set/get device's pci
path for kata agent's device handler.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Introduce Trait PCIeDevice with register/unregister, which are
used to register or unregister pcie device within the PCIe topology.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Due to different ways that different VMMs handle PCI devices,
we expect to provide a general PCIe topology processing framework
that is as compatible as possible with VMMs such as dragonball,
qemu, clh(Though it has its own management method, no conflict).
Currently,it's mainly developed for kinds of PCIe/PCI devices in
dragonball/clh which are attached on the pci/pcie root bus directly.
More will be added when Qemu is ready in runtime-rs.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
A TopologyConfigInfo added to store device config info for PCIe/PCI
devices in the VM from Hypervisor DeviceInfo.
And TopologyConfigInfo::new will be the entry to initialize PCIe
Topology for each VM.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
vfio mod collects lots of information related to the vfio operations, including VfioMsi and VfioMsix capability & state,
vfio interrupt info, pci region infor and vfio pci device info & state.
fixes: #8722
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Shifang Feng <fengshifang@linux.alibaba.com>
Signed-off-by: Yang Su <yang.su@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Xin Lin <jingshan@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Bridge the gap between user requirements for direct block device access
and the DirectVolume capabilities provided by Kata runtimes
(kata-runtime/runtime-rs), and facilitate seamless integration with CSI
to improve user experience.
It aims to integrate DirectVolume CSI support into Kata, enabling users
to benefit from its performance and flexibility advantages.
Fixes: #8602
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
This patch introduces a feature of supporting vhost-user-blk device.
This device needs to be defined before the VM instance is started,
which can be done through the dbs-cli tool with --virblks option:
--virblks '{
"drive_id": "8623",
"device_type": "Spdk",
"path_on_host": "spdk:///var/tmp/vhost.sock",
"is_root_device": false,
"is_read_only": false,
"is_direct": false,
"no_drop": false,
"num_queues": 1,
"queue_size": 256
}'
Fixes: #8631
Signed-off-by: Eric Ren <renzhen@linux.alibaba.com>
Signed-off-by: fupan <fupan.lfp@antgroup.com>
Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
Signed-off-by: Qinqi Qu <quqinqi@linux.alibaba.com>
The compiler will give a warning if a developer forget to add an arm for
a new variants defined.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
DAN reads vhost-user-net device from JSON config. It only supports VMM
running as server right now.
Fixes: #8625
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
The changes involve:
- Expose VhostUserConfig struct to runtime-rs.
- Set a default value while num_queues or queue_size are 0.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
This commit introduces VhostUserEndpoint and supports relative to
vhost-user-net devices for device manager. For now, Dragonball is able to
attach vhost-user-net devices.
Fixes: #8625
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Config space of network device is shared and accord with virtio 1.1 spec.
It is a good way to abstract the common part into one function.
`set_config_space()` implements this.
Plus, this patch removes `vq_pairs` from vhost-net devices, since there is
a possibility of data inconsistency. For example, some places read that
from `self.vq_pairs`, others read from `queue_sizes.len() / 2`.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Also support alternative media type and update samples
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Add application that infers K8s user's intentions based on user's
K8s YAML file, and generates a Rego/OPA based policy for that YAML.
Just Pod YAML files are supported as input using this initial source
code. Support for other types of YAML files will come with upcoming
commits.
Fixes: #7673
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
introduce msi/msix mod to maintain information for PCI Message Signalled
Interrupt Extended Capability. It will be initialized when parsing pci
configuration space and used when getting interrupt capabilities.
fixes: #8661
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Shifang Feng <fengshifang@linux.alibaba.com>
Signed-off-by: Yang Su <yang.su@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Xin Lin <jingshan@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
This PR introduces vhost-user-net devices to Dragonball. The devices are
allowed to run as server on the VMM side.
Fixes: #8502
Signed-off-by: Eric Ren <renzhen@linux.alibaba.com>
Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Vhost-user-net has a dependency on address space from `MmioV2DeviceState`.
The addition of the address space is introduced in this patch. Plus, it
makes sure all unit tests have the according parameter as well.
Fixes: #8502
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
It's important to ensure that these tasks which setup vfio
devices are completed before add_device.
So Moving vfio device setup code to a dedicated method at device
building time which does not affect the behavior of other code.
And this change makes it easier to understand the difference
between create and attach, and also makes the boundaries
clearer.
Fixes: #8665
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Make VhostUserConfig pci_path's type more specific, change it
from Option<String> to Option<PciPath>.
Fixes: #8665
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Add implementations for the following `Hypervisor` trait methods which
simply return the same details as the `get_vmm_master_tid()` method:
- `get_thread_ids()`
- `get_pids()`
Fixes: #6438.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
This is to reintroduce a configuration rule for IBM Z Secure Execution,
where no initrd path should be configured. For the TEE of interest,
only a kernel image should be specified with `confidential_guest=true`.
Fixes: #8692
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
`make SUPPORT_VIRTUALIZATION=1 test` iterates through all subcrates and
does test.
Plus, this patch fixes some issues about unit tests:
- Feed too much parameters to `I8042Device::new()`.
- Virtqueue checks have been introduced since `virtio-queue v0.7.0`.
- GHA might have no access to `/var/tmp` dir on runner.
Fixes: #8690
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
As a follow-up for #8516, guest_cid and vhost_fd are not necessarily initialised
via new(). Instead, the fields should be initialised later when they are really
used to construct hypervisor's parameters.
This commit is to separate init_config() from new() to initialise guest_cid
and vhost_fd and leave only the assignment of id for the existing function.
Fixes: #8671
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
We use a matching direct-volume path to determine whether an OCI mount
is a DirectVolume. However, we should handle the case where no match is
found appropriately.
This error will be defined as a non-DirectVolume type when judging the
OCI mount but not failed.
Fixes: #8619
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
DirectVolume structure in runtime-rs is different from it in kata-runtime,
which causes they has no unified handling method for DirectVolumeMountInfo
and MountInfo.
We should align the two by simply adding the attribute #[serde(rename="x")
to each field in DirectVolumeMountInfo
Fixes: #8619
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Dragonball sets a default queue config in the case of `None`. The
queue_size and num_queues of vhost-net are set to `Some(0)` by default.
Therefore, we might get an invalid queue config. This patch fixes this
issue.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
This patch set vhost-net as default backend of networking. It allows users
to set `disable_vhost_net` to `true` to reenable virtio-net backend.
Plus, which backend to use is a matter of hypervisor, runtime-rs will no
longer need to know that.
Fixes: #8608
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Make the CH (Cloud Hypervisor) `stop_vm()` method check the VM state before
attempting to stop the VM, and update the state once the VM has stopped.
This avoids the method failing if called multiple times which will
happen if the workload exits before the container manager requests that
the container stop.
This change ensures the CH driver finishes cleanly.
Fixes: #8629.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Add a `--show-default-config-paths` command line option for parity with
`kata-runtime`.
Note that this requires the `KataCtlCli.command` to be optional so that
the user can run simply:
```bash
$ kata-ctl --show-default-config-paths
```
... without also specifying a (sub-)command.
Fixes: #8640.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
The layout of packed virtqueue isn't supported by `Endpoint::negotiate()`.
Communication between device and driver will be failed due to the failure
of parsing virtqueue if we don't disable the packed feature. This patch
fixes this issue.
Fixes: #8633
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
In order to follow up the PCI implementation in Dragonball, we need to
add PCI root device and root bus support.
root device is a pseudo PCI root device to manage accessing to PCI
configuration space.
root bus is mainly for emulating PCI root bridge and also create the PCI
root bus with the given bus ID with the PCI root bridge.
fixes: #8563
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Shifang Feng <fengshifang@linux.alibaba.com>
Signed-off-by: Yang Su <yang.su@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Xin Lin <jingshan@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Right now, cargo fmt check in Dragonball only test with the default
features but not all features. This will cause some code being untested
by the fmt tool.
This PR adds --all option for the Dragonball CI and also fix some code
that forgets to do cargo fmt --all.
fixes: #8598
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
If a wrong configuration.toml file is used by accidentally, runtime-rs
binary could run into panic because of unwrap().
This fixes the panic by returning errors instead of unwrap().
fixes: #8565
Signed-off-by: Liu Bo <liub.liubo@gmail.com>
Log-parser-rs was always intended to become a sub-functionality of
kata-ctl, but it was useful to develop it and initaly merge it as a
standalone program, and migrate it to a subcommand later.
Fixes#6797
Signed-off-by: Gabe Venberg <gabevenberg@gmail.com>
Since cloud-hypervisor is no longer built as an optional feature,
lets mention cloud-hypervisor in the list of hypervisors supported by
runtime-rs.
Fixes: #8587
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
This patch implements the virtio-fs device used for filesystem sharing
and heavily based on the vhost-user protocol.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
Signed-off-by: Huang Jianan <jnhuang@linux.alibaba.com>
Signed-off-by: Qinqi Qu <quqinqi@linux.alibaba.com>
Implement try_from trait function to convert runtime-rs BlockConfig
to cloud-hypervisor DiskConfig. This can allow for code reuse in the
future.
Fixes: #8581
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Add some test cases for vhost-user-fs function.
Signed-off-by: Beiyue <beiyue@linux.alibaba.com>
Signed-off-by: Huang Jianan <jnhuang@linux.alibaba.com>
This patch implements the virtio-fs device used for filesystem sharing
and heavily based on the vhost-user protocol.
This vhost-user-fs device defines 5 parameters:
- path: vhost-user socket path
- tag: mount tag used from the guest to mount the filesystem
- req_num_queues: number of request virtqueues
- queue_size: depth of each virtqueue
- cache_size: cache window size for dax
This device needs to be defined before the VM instance is started,
which can be done through the dbs-cli tool with --fs option:
--fs '{
"sock_path":"/path/to/virtiofs.socket",
"tag":"myfs",
"num_queues":1,
"queue_size":1024,
"cache_size":0,
"thread_pool_size":1,
"cache_policy":"auto",
"writeback_cache":true,
"no_open":true,
"xattr":true,
"drop_sys_resource":false,
"mode":"vhostuser",
"fuse_killpriv_v2":true,
"no_readdir":false,
}'
Fixes: #8428
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
Signed-off-by: Huang Jianan <jnhuang@linux.alibaba.com>
This is required for clh to work with nerdtcl and docker.
This fixes the issues seen with nerdctl while starting a container.
Hoewever, container exit with docker is still broken due to an unrelated
issue.
Fixes: #8579
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
PR #8483 changed the location of the rust runtime config files to
`/etc/kata-containers/runtime-rs/`. However, if you haven't updated your
system to create that directory, attempting to create a container using
the rust runtime was giving the following cryptic message
(formatted for easier reading):
```
failed to handler message try init runtime instance
Caused by:
0: load config
1: load toml config
2: entity not found
```
Now, the message is as follows (again, reformatted for easier reading):
```
failed to handle message try init runtime instance
Caused by:
0: load config
1: load TOML config failed (tried [
\"/etc/kata-containers/runtime-rs/configuration.toml\",
\"/usr/share/defaults/kata-containers/runtime-rs/configuration.toml\",
\"/opt/kata/share/defaults/kata-containers/runtime-rs/configuration.toml\"
])
```
Fixes: #8557.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Bring support for legacy vsock and add Vsock to the ResourceConfig
enum type, and add the processing flow of the Vsock device to the
prepare_before_start_vm function.
Fixes: #8474
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Instroduce prepare_vm_socket_config to VirtSandbox for vm
socket config, including Vsock and Hybrid Vsock.
Use the capabilities() trait of the hypervisor to get the
vm socket supported in VMM.
Fixes: #8474
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Add Cap of HybridVsockSupport for hypervisors CLH and Dragonball
which use hybrid-vsock, default for Qemu, which uses legacy vsock.
Fixes: #8474
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Introduce HybridVsock Cap to judge which kind of vm socket will
be supported by the Hypervisor.
Use `is_hybrid_vsock_supported` to tell if an hypervisor supports
hybrid-vsock, if not, it supports legacy vsock.
Fixes: #8474
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Remove the redundant `kata-ctl` `root` check when running the `env`
command. This check duplicated the `GuestProtection` check, and that
check is now no longer necessary anyway.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
It is no longer necessary to be `root` to query the guest protection
(TDX) on `x86_64` systems, so drop the requirement.
> **Note:**
>
> This change drops the `nix` `Uid` import required for the `root` check.
> But at the same time it adds it for PPC64le since that implementation of
> `available_guest_protection()` needs it and it was previously missing.
Fixes: #8548.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
PR #8311 inadvertently broke the logging since no log messages below the
`Info` level are logged now, regardless of the requested log level.
Resolve the issue by storing the requested log level in the
`RuntimeComponentLevelFilter` and using that level in the `log()`
function, rather than hard-coding `Info` as the default where no entry
is found in the `FILTER_RULE` hashmap.
Fixes: #8546.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Device mapper is the only supported block device driver so far,
which seems limiting. Kata Containers can work well with other
block devices. It is necessary to enhance supporting of multiple
kinds of host block device.
Fixes#4714
Signed-off-by: yuchen.cc <yuchen.cc@alibaba-inc.com>
This commit inits dbs-pci lib for Dragonball to use.
It contains several implementation now:
1. PCI configuration space
2. PCI bus
More info of the design & behavior of those two features could be found
in the README of dbs-pci.
fixes: #8479
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Shifang Feng <fengshifang@linux.alibaba.com>
Signed-off-by: Yang Su <yang.su@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Xin Lin <jingshan@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
This reverts commit b0157ad73a.
```
commit b0157ad73a
Refs: 3.3.0-alpha0-124-gb0157ad73
Author: Fabiano Fidêncio <fabiano.fidencio@intel.com>
AuthorDate: Fri Aug 11 14:55:11 2023 +0200
Commit: Fabiano Fidêncio <fabiano.fidencio@intel.com>
CommitDate: Fri Nov 10 12:58:20 2023 +0100
runtime: confidential: Do not set the max_vcpu to cpu
We don't have to do this since we're relying on the
`static_sandbox_resource_mgmt` feature, which gives us the correct
amount of memory and CPUs to be allocated.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
```
This commit was removing a requirement that was made previously, but due
to the SMP issue we're facing with the QEMU used for TDX (see commit
d1b54ede290e95762099fff4e0bcdad10f816126*), QEMU will fail to start due
to:
```
Invalid CPU topology: product of the hierarchy must match maxcpus:
sockets (1) * dies (1) * cores (1) * threads (1) != maxcpus (240)"
```
This has no affect on the SEV / SNP workflow and hopefully we'll be able
to re-revet this soon enough, when this gets solved on te QEMU side.
Last but not least, this is not a "clean" revert as we're using
conf.NumVCPUs() instead of conf.NumVCPUs, to ensure we're dealing with
uint32.
Fixes: #8532
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
- Update the remote hypervisor code to match the re-genned code for
the ttrpc Hypervisor Service
Fixes: #8519
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The new clean-generated-files make target allows for removing the
generated files (including the configuration.toml files).
The tools/packaging/static-build/shim-v2/build.sh script now uses that
target to always force the re-generation of those files.
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
As the configuration files are different, we can safely remove those as
any new installation of the binary should also bring in the new
configurations.
This makes things less error-prone in the future, as we're ensuring that
the rust runtime will only be reading the rust configuration files.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Update the `DEFAULT_RUNTIME_CONFIGURATIONS` list to include a number of
rust runtime specific paths to try to load before checking the
"traditional" (golang) runtime configuration paths.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Install the rust runtime configuration files to a `runtime-rs/`
directory to distinguish them from the golang config files (which may
have a different syntax).
The default values mean that the rust config files are now installed to
`/opt/kata/share/defaults/kata-containers/runtime-rs/` rather than
`/opt/kata/share/defaults/kata-containers/`.
See: https://github.com/kata-containers/kata-containers/issues/6020Fixes: #8444.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
(1) Add enum DirectVolumeType for direct volumes.
(2) Reimplement spdk volume into direct_volume and
do alignment of rawblock volume.
Fixes: #8300
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
The current infra(K8S, CSI, CRI, Containerd) for Kata containers is
unable to properly handle direct volumes, resulting in the need for
workarounds like searching/comparision and then patch up volume type.
In this commit, reimplement of handling method is added to support
raw block volume which backends may be rawdisk or other format file.
Fixes: #8300
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
As vsock device will be used in Qemu or other VMMs, the Vsoock
is reintroduced to DeviceType enum.
Fixes: #8474
Signed-off-by: Pavel Mores <pmores@redhat.com>
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Currently encounters difficulty in utilizing the clone operation
on VsockConfig due to the implicit management of the vhost fd
within the runtime-rs. This responsibility should be delegated to
the VMM(especially QEMU) child process, as it's not runtime-rs core
responsibilities. We'll remove the member vhost_fd from VsockConfig
and make the VsockConfig/VsockDevice Cloneable.
Fixes: #8474
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Introduce a new function generate_vhost_vsock_cid to generate
a guest CID and set guest CID for vsock fd.
Also this commit wouldn't introduce functional change and it's
just splited from the previous VsockDevice::new().
Fixes: #8474
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
(1) rawblock volume is directvol mount type.
(2) block volume is based on the bind mount type.
Fixes: #8300
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
1) Creating storage for all `io.katacontainers.volume=` messages in rootFs.Options,
and then aggregates all storages into `containerStorages`.
2) Creating storage for other data volumes and push them into `volumeStorages`.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
To enhance the construction and administration of `Katavirtualvolume` storages,
this commit expands the 'sharedFile' structure to manage both
rootfs storages(`containerStorages`) including `Katavirtualvolume` and other data volumes storages(`volumeStorages`).
NOTE: `volumeStorages` is intended for future extensions to support Kubernetes data volumes.
Currently, `KataVirtualVolume` is exclusively employed for container rootfs, hence only `containerStorages` is actively utilized.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
The snapshotter will place `KataVirtualVolume` information
into 'rootfs.options' and commence with the prefix 'io.katacontainers.volume='.
The purpose of this commit is to transform the encapsulated KataVirtualVolume data into device information.
Fixes: #8495
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Feng Wang <feng.wang@databricks.com>
Co-authored-by: Samuel Ortiz <sameo@linux.intel.com>
Co-authored-by: Wedson Almeida Filho <walmeida@microsoft.com>
Update cloud hypervisor implementation to allow hybrid vsock device to
be handled.
Fixes#6692
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
In the case of Cloud Hypervisor running on arm64 architecture,
only arm AMBA UART (pl011) is supported as the TTY. Consequently,
when enabling Hypervisor debug mode, it's essential to configure
the console as "ttyAMA0" rather than "ttyS0
Fixes: #8381
Signed-off-by: briwan01 <brian.wang@arm.com>
The test utils will be used by the upcoming feature tests: vhost-user-net,
vhost-user-blk and vhost-user-fs.
Signed-off-by: Beiyue <beiyue@linux.alibaba.com>
Signed-off-by: Huang Jianan <jnhuang@linux.alibaba.com>
The vhost-user connection management logic will be used by
the upcoming features: vhost-user-net, vhost-user-blk and
vhost-user-fs.
Fixes: #8448
Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
Signed-off-by: Qinqi Qu <quqinqi@linux.alibaba.com>
Signed-off-by: Huang Jianan <jnhuang@linux.alibaba.com>
It mainly focus on the two parts:
(1) redesign the ShareFsConfig with ShareFsMountConfig
The device mount operation must depend on the fact that sharefs
device exists, and re-design the structure of SharesFsConfig and
move the ShareFsMountConfig into it with Option type, which is to
describe the relation between ShareFsConfig and ShareFsMountConfig.
(2) move virtiofs into device manager
Currently, virtio-fs is still outside of the device manager.
To do Enhancement of device manager, it will bring virtio-fs
device in device-manager for unified management
Fixes: #7915
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
In order to support PCI VFIO functionality in Dragonball, we should
first add PCI bus and PCI device Interrupt information in Dragonball
mptable setup process.
This patch add :
1. pci_legacy_irqs transfered to setup_mptable function.
2. pci bus support in mptable mem
3. pci interrupt support in mptable mem
fixes: #8449
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Add the corresponding data structure in the runtime part according to
kata-containers/kata-containers/pull/7698.
Fixes: #8472
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
In order to support different pod VM instance type via
remote hypervisor implementation (cloud-api-adaptor),
we need to pass machine_type, default_vcpus
and default_memory annotations to cloud-api-adaptor.
The cloud-api-adaptor then uses these annotations to spin
up the appropriate cloud instance.
Reference PR for cloud-api-adaptor
https://github.com/confidential-containers/cloud-api-adaptor/pull/1088Fixes: #7140
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
(based on commit 004f07f076)
This patch updates the template configuration file for
the remote hypervisor to set static_sandbox_resource_mgmt
to be true. The remote hypervisor uses the peer pod config
to determine the sandbox size, so requires this to be set to
true by default.
Fixes: #6616
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
(based on commit 938447803b)
Add the SELinux setting to ensure it is passed through to the remote
hypervisor
Fixes: #5936
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
(based on commit 3ef2fd1784)
This patch adds the support of the remote hypervisor type.
Shim opens a Unix domain socket specified in the config file,
and sends TTPRC requests to a external process to control
sandbox VMs.
Fixes#4482
Co-authored-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
(based on commit f9278f22c3)
This patch adds a protobuf definiton of the remote hypervisor type.
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
(based on commit 150e8aba6d)
This PR fixes k8's configmap/secrets etc update propagation when filesystem sharing is disabled.
The commit introduces below changes with some limitations:
- creates new timestamped directory in guest
- updates the '..data' symlink
- creates user visible symlinks to newly created secrets.
- Limitation: The older timestamped directory and stale user visible symlinks exist in guest
due to missing DELETE api in agent.
Fixes: #7398
Signed-off-by: Sumedh Alok Sharma <sumsharma@microsoft.com>
Improve the `CODEOWNERS` file by specifying more groups.
Since GitHub automatically checks the `CODEOWNERS` file when a PR is
created and adds all matching groups as reviewers for the PR, this may
help reduce the PR backlog since the right people will be alerted and
requested to review the PR. That should improve the quality of reviews
(and thus the quality of the landed code). It may also have a positive
effect on PR velocity.
> **Note:**
>
> This PR combines the other `CODEOWNERS` files so we have
> a single, visible, top-level file.
See: https://github.com/kata-containers/community/issues/253Fixes: #3804.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
This commit enables StratoVirt hypervisor to be tested in kata GHA,
incluing k8s, metrics, cri-containerd, nydus and so on.
Meanwhile, adding some unit tests for StratoVirt to make sure it works.
Fixes: #7794
Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
Add configuration-stratovirt.toml.in to generate the StratoVirt configuration,
and parser to deliver config to StratoVirt.
Fixes: #7794
Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
Initial support of the MicroVM machine type of StratoVirt
hypervisor for the kata go runtime.
Fixes: #7794
Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
When multiple containers in a kata pod share one direct volume,
it's important to make sure that the corresponding block device
is only mounted once in the guest. This means that there should
be only one mount entry for the device in the mount information.
Fixes: #8328
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
When a direct volume is used by multiple containers in Kata,
Generating many shared paths with cids will cause IO error
as the result of one direct volume mounts more than once.
To correct it, use the device_id instead of cid which
ensures that the guest only mounts the FS once.
Fixes: #8328
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
This patch is to remove vhost-net dependency on virtio-net for
dbs-virtio-devices crate. Then, the feature of vhost-net is able to enable
without enabling virtio-net device, error, etc.
Fixes: #8423
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Introduce the `update_device` trait in Hypervisor to enable
device updates for VMMs.This trait will initially be utilized
for virtiofs Mount operations.
Fixes: #7915
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
PR #8311 inadvertently broke the runtime-rs / Cloud Hypervisor TDX
handling. It also introduced unrecoverable failure scenarios. Hence,
replace slow, fallible regex matching in logging fast path with single pass
non-failing multi-string log level matching.
Also, added a unit test for `parse_ch_log_level()`.
Fixes: #8418.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
- Remove two panic statements from InsertNetworkDevice test.
- Rename `NUM_QUEUES` to `DEFAULT_NUM_QUEUES`, `QUEUE_SIZE` to
`DEFAULT_QUEUE_SIZE` for vhost-net and virtio-net.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
`test_networkconfig_to_netconfig` from clh depends on `NetworkConfig` which
has some new fields in this PR. Therefore, this commit gives the test
missing fields.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
- Dragonball's vhost-net feature not depends on virtio-net feature.
- Remove `TapError` from dbs-virtio-devices's Error, and add `VirtioNet`
and `VhostNet` two fields.
- Downgrade visiblity of two fields of `VhostNetDeviceMgr` from
`pub(crate)`.
- File an issue to record a todo for network rate limiter.
- Print internal errors with `{0:?}.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
With the change done in the last commit, instead of calculating milli
cpus, we're actually converting the CPUs to a fraction number, a float.
Let's update the function name (and associated vars) to represent that
change.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
First of all, this is a controversial piece, and I know that.
In this commit we're trying to make a less greedy approach regards the
amount of vCPUs we allocate for the VMM, which will be advantageous
mainly when using the `static_sandbox_resource_mgmt` feature, which is
used by the confidential guests.
The current approach we have basically does:
* Gets the amount of vCPUs set in the config (an integer)
* Gets the amount of vCPUs set as limit (an integer)
* Sum those up
* Starts / Updates the VMM to use that total amount of vCPUs
The fact we're dealing with integers is logical, as we cannot request
500m vCPUs to the VMMs. However, it leads us to, in several cases, be
wasting one vCPU.
Let's take the example that we know the VMM requires 500m vCPUs to be
running, and the workload sets 250m vCPUs as a resource limit.
In that case, we'd do:
* Gets the amount of vCPUs set in the config: 1
* Gets the amount of vCPUs set as limit: ceil(0.25)
* 1 + ceil(0.25) = 1 + 1 = 2 vCPUs
* Starts / Updates the VMM to use 2 vCPUs
With the logic changed here, what we're doing is considering everything
as float till just before we start / update the VMM. So, the flow
describe above would be:
* Gets the amount of vCPUs set in the config: 0.5
* Gets the amount of vCPUs set as limit: 0.25
* ceil(0.5 + 0.25) = 1 vCPUs
* Starts / Updates the VMM to use 1 vCPUs
In the way I've written this patch we introduce zero regressions, as
the default values set are still the same, and those will only be
changed for the TEE use cases (although I can see firecracker, or any
other user of `static_sandbox_resource_mgmt=true` taking advantage of
this).
There's, though, an implicit assumption in this patch that we'd need to
make explicit, and that's that the default_vcpus / default_memory is the
amount of vcpus / memory required by the VMM, and absolutely nothing
else. Also, the amount set there should be reflected in the
podOverhead for the specific runtime class.
One other possible approach, which I am not that much in favour of
taking as I think it's **less clear**, is that we could actually get the
podOverhead amount, subtract it from the default_vcpus (treating the
result as a float), then sum up what the user set as limit (as a float),
and finally ceil the result. It could work, but IMHO this is **less
clear**, and **less explicit** on what we're actually doing, and how the
default_vcpus / default_memory should be used.
Fixes: #6909
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
We don't have to do this since we're relying on the
`static_sandbox_resource_mgmt` feature, which gives us the correct
amount of memory and CPUs to be allocated.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
TestCheckHostIsVMContainerCapable removes sysModuleDir to simulate a
case that the kernel modules are not loaded. However,
checkKernelModules() executes modprobe <module> if a module not
found in that directory. Loading those modules is required to be denied
temporarily.
Fixes: #8390
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
- Disable device cgroup restriction while pod cgroup is not available.
- Remove balcklist-related names and change whitelist-related names to
allowed_all.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
FSManager of systemd cgroup manager is responsible for setting up cgroup
path. The container launching will be failed if the FSManager is in
read-only mode.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
The changes include:
- Change to debug logging level for resources after processed.
- Remove a todo for pod cgroup cleanup.
- Add an anyhow context to `get_paths_and_mounts()`.
- Remove code which denys access to VMROOTFS since it won't take effect. If
blackmode is in use, the VMROOTFS will be denyed as default. Otherwise,
device cgroups won't be updated in whitelist mode.
- Add a unit test for `default_allowed_devices()`.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
The runk is a standard OCI runtime that isnt' aware of concept of sandbox.
Therefore, the `devcg_info` argument of `LinuxContainer::new()` is
unneccessary to be provided.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
The target is to guarantee that containers couldn't escape to access extra
devices, like vm rootfs, etc.
Assume that there is a cgroup, such as `/A/B`. The `B` is container cgroup,
and the `A` is what we called pod cgroup. No matter what permissions are
set for the container (`B`), the `A`'s permission is always `a *:* rwm`. It
leads that containers could acquire permission to access to other devices
in VM that not belongs to themselves.
In order to set devices cgroup properly, the order of setting cgroups is
that the pod cgroup comes first and the container cgroup comes after.
The `Sandbox` has a new field, `devcg_info`, to save cgroup states. To
avoid setting container cgroup too early, an initialization should be done
carefully. `inited`, one of the states, is a boolean to indicate if the pod
cgroup is initialized. If no, the pod cgroup should be created firstly, and
set default permissions. After that, the pause container cgroup is created
and inherits the permissions from the pod cgroup.
If whitelist mode which allows containers to access all devices in VM is
enabled, then device resources from OCI spec are ignored.
This feature not supports systemd cgroup and cgroup v2, since:
- Systemd cgroup implemented on Agent hasn't supported devices subsystem so
far, see: https://github.com/kata-containers/kata-containers/issues/7506.
- Cgroup v2's device controller depends on eBPF programs, which is out of
scope of cgroup.
Fixes: #7507
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Since moving from network coldplug to hotplug, the only case verified
was veth endpoints. Support for network hotplug for ipvlan and macvlan was
broken/not added. Fix it.
Fixes: #8391
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Remove the redundant `VmConfigError::EmptyVsockSocketPath` error from
the Cloud Hypervisor config crate since this scenario is already handled
by the `VsockConfigError::NoVsockSocketPath` error.
Fixes: #8385.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Remove the redundant `parse_mac()` function: this was never used and we
already have an implementation in `crates/resource/src/network/utils/mod.rs`.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Virtio-net and vhost-net share a common virtio config, and vhost-user-net
uses another config, named `VhostUserConfig`. Thus, the virtio config could
be added into `NetworkConfig` instead of `Backend`.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Moving Dragonball structs convertions out of device drivers to keep driver
neutral. The convertions include `NetworkBackend` to
`DragonballNetworkBackend` and `NetworkConfig` to
`DragonballNetworkConfig`.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Changes include:
- Merge `VhostNetDeviceError` import item.
- Replace if with match in `add_vhost_net_device()`
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Network backends determine the virtio dataplane implementations. Common
protocols include virtio-net, vhost-net and vhost-user-net, etc. Network
config has a new field named `backend` to specify which protocol to use.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
PLEASE NOTE THAT this pull request just implements vhost-net support for
Dragonball, and adaptation for the Runtime-rs. And this pull request
DOESN'T provide an item to config which backend to use. To sum up,
virtio-net as a default backend is only choice for the user so far.
This pull request introduces vhost-net device for the Dragonball. In
addition, this pull request includes changes of Runtime-rs to improve
network configuration abilities.
The Dragonball part implements a vhost-net device and a vhost-net device
manager, named `VhostNetDeviceMgr`, to manage vhost-net device.
`NetworkInterfaceConfig` is introduced as a high-level abstract for network
config. Then, the Dragonball is able to distinguish network backends, e.g.
virtio-net, vhost-net, vhost-user-net(WIP), etc.
The Runtime-rs part adds support of multiple network backends as well.
`NetworkConfig` has a couple of new fields, like `backend`,
`use_shared_irq`, etc. And Dragonball's network config structs are
implmented `From` trait which allow to be converted from the Runtime-rs's
network config conveniently.
Fixes: #7674
Signed-off-by: Eric Ren <renzhen@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: wllenyj <wllenyj@linux.alibaba.com>
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
TestCheckHostIsVMContainerCapable is failing on AMD machines.
kata-check_amd64_test.go:96 has no AMD modules, also getCPUType is
missing.
Fixes#8384.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
This update includes necessary changes due to the version bump of
containerd and its dependencies. It's part of a broader initiative to
phase out gogo protobuf, which has been deprecated, and to align with
the current supported libraries.
Fixes#7420.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
The +fieldpath option, specific to gogoprotobuf, enabled dynamic field
access in protobuf messages, allowing nested fields to be accessed via
string paths.
This change is part of a larger effort to transition to the official Go
protobuf library for better maintainability and community support.
Upon review, no instances of dynamic field access were found in the
codebase, confirming that the feature is not in use.
By removing this unused feature, we simplify the build process and make
it easier to complete the transition away from gogoprotobuf.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
Those mappings are not used by our .proto files and there is no
difference between .pb.go files generated.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
Remove earlier functionality that tries to assign PCI path to vfio
devices from the host assuming pci slots to start from 1.
Get this from the hypervisor instead.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
If PCI path for block device is not empty for a block device, use
that as identifier for agent instead of virt path which is valid only
for mmio devices.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Block(virtio-blk) and vfio devices are currently not handled correctly
by the agent as the agent is not provided with correct PCI paths for
these devices.
The PCI paths for these devices can be inferred from the PCI information
provided by the hypervisor when the device is added.
Hence changing the add_device trait function to return a device copy
with PCI info potentially provided by the hypervisor. This can then be
provided to the agent to correctly detect devices within the VM.
This commit includes implementation for PCI info update for
cloud-hupervisor for virtio-blk devices with stubs provided for other
hypervisors.
Removing Vsock from the DeviceType enum as Vsock currently does not
implement the Device Trait, it has no attach and detach trait functions
among others. Part of the reason is because these functions require Vsock
to implement Clone trait as these functions need cloned copies to be
passed down the hypervisor.
The change introduced for returning a device copy from the add_device
hypervisor trait explicitly requires a device to implement
Copy trait. Hence removing Vsock from the DeviceType enum for now, as
its implementation is incomplete and not currently used.
Note, one of the blockers for adding the Clone trait to Vsock is that it
currently includes a file handle which cannot be cloned. For Clone and
Device Traits to be implemented for Vsock, it requires an implementation
change in the future for it to be cloneable.
Fixes: #8283
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
This patch re-generates the client code for Cloud Hypervisor v35.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.
Fixes: #8378
Signed-off-by: Bo Chen <chen.bo@intel.com>
make check is giving us the following error:
error: this expression creates a reference which is immediately
dereferenced by the compiler.
Fixes#8344
Signed-off-by: Beraldo Leal <bleal@redhat.com>
By modifying RuntimeLevelFilter drain to improve logging control,
enabling isolation of change effect of the loggers between components,
tuning clh logs to be logged according to their log levels
given by cloud-hypervisor.
Fixes: #8310
Signed-off-by: Ruoqing He <linuxwatcher@outlook.com>
This PR adds the tracing capability for dragonball and it depends on the tracing::Subscriber of the upper layer.
Fixes: #7249
Signed-off-by: Songqian Li <mail@lisongqian.cn>
We used the approach of cold-plugging network interface for pre-shimv2
support for docker.Since the hotplug approach was not required,
we never really got to implementing hotplug support for certain network
endpoints, ipvlan and macvlan being among them.
Since moving to shimv2 interface as the default for
runtime, we switched to hotplugging the network interface for supporting
docker and nerdctl. This was done for veth endpoints only.
Implement the hot-attach apis for ipvlan and macvlan as well to support
ipvlan and macvlan networks with docker and nerdctl.
Fixes: #8333
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Legacy device manager adds device metrics to METRICS when a device is created and removes metrics when a device is dropped.
Fixes: #7248
Signed-off-by: Songqian Li <mail@lisongqian.cn>
Add the hypervisor security details to the output of the `kata-runtime
env` and `kata-ctl env` commands so the user can see, amongst other
things, the value of `confidential_guest`.
Fixes: #8313.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
The config template file for clh is in the new format for runtime-rs.
It is a result of merging the new format file and options supportted by
cloud-hypervisor.
Some config options from the golang runtime are missing as they may not
be currently supported by the rust runtime. An example of this is the
selinux options, rate limiting options as these are not currently
supported or verified with the rust runtime.
Fixes: #8249
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Balloon device manager adds balloon device metrics to METRICS when a device is created and remove metrics when a device is dropped.
Fixes: #7248
Signed-off-by: Songqian Li <mail@lisongqian.cn>
This is to skip a flaky test `create_tmpfs()` on s390x until a root cause is identified and fixed.
Fixes: #4248
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Remove the ability to block access to kata agent endpoints by using
agent-config.toml. That functionality is now implemented using the
Agent Policy feature (#7573).
The CCv0 branch relied on blocking endpoints using agent-config.toml
but will set-up an equivalent default policy file instead (#8219).
Fixes: #8228
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Add the missing closing bracket to the output of the TDX details,
so rather than:
```bash
$ sudo kata-ctl env 2>/dev/null | grep available_guest_protection
available_guest_protection = "tdx (major_version: 1, minor_version: 0"
: ^
: Missing ')' !
```
... we now have:
```bash
$ sudo kata-ctl env 2>/dev/null | grep available_guest_protection
available_guest_protection = "tdx (major_version: 1, minor_version: 0)"
: ^
: Aha!
```
Added a unit test for this scenario.
Fixes: #8257.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
If you attempt to create a container (a TD) on a TDX system using a
custom build of Cloud Hypervisor (CH) that was not built with the `tdx`
CH feature, Kata will report the following, somewhat cryptic, CH error:
```
ApiError(VmBoot(InvalidPayload))
```
Newer versions of CH now report their build-time features in the ping
API response message so we now use that, if available, to detect this
scenario and generate a user-friendly error message instead.
This changes improves the readability of `handle_guest_protection()` and
adds a couple of additional tests for that method.
Fixes: #8152.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Improve the way `handle_guest_protection()` is structured by inverting
the logic and checking the value of the `confidential_guest` setting
before checking the guest protection. This makes the code easier to
understand.
> **Notes:**
>
> - This change also unconditionally saves the available guest protection
> (where previously it was only saved when `confidential_guest=true`).
> This explains the minor unit test fix.
>
> - This changes also errors if the CH driver finds an unexpected
> protection (since only Intel TDX is currently tested).
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
ACPI PCI device hotplug on qemu virt is not supported. The only way to
hotplug pci device is pcie native way. Thus we need create pcie root
port as default.
Pcie root port number depends on following:
1. reserved one for network device as default;
2. virtio-mem dev;
3. add enough port for vhost user blk dev;
Fixes: #7646
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Add GetEndpointsNum API for Network Interface to get the number of
network endpoints. This is used for caculate the number of pcie root
port for QemuVirt.
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
1. enable virtio-fs-pro in Dragonball to have the ability to process nydus backend registry
2. change passthrough for rw layer's readonly config to false to have the accurate read write ability.
Fixes:#8013
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Since Nydus snapshotter has been updated in previous commits, there is a
problem that the config passthrough to Dragonball during mount_rafs is
RafsConfig instead of ConfigV2, but Dragonball could only serde ConfigV2
so it will panic.
We need to add the support for RafsConfig
Fixes:#8013
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Allow access to the ReseedRandomDev endpoint by default. Using false
for ReseedRandomDevRequest was unintended.
Fixes: #8225
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Previously, if you accidentally modified the name of the hypervisor
section in the config file, the default golang runtime gives a cryptic
error message ("`VM memory cannot be zero`"). This can be demonstrated
using the `kata-runtime` utility program which uses the same golang
config package as the actual runtime (`containerd-shim-kata-v2`):
```bash
$ kata-runtime env >/dev/null; echo $?
0
$ sudo sed -i 's!^\[hypervisor\.qemu\]!\[hypervisor\.foo\]!g' /etc/kata-containers/configuration.toml
$ kata-runtime env >/dev/null; echo $?
VM memory cannot be zero
1
```
The hypervisor name is now validated so that the behaviour becomes:
```bash
$ kata-runtime env >/dev/null; echo $?
0
$ sudo sed -i 's!^\[hypervisor\.qemu\]!\[hypervisor\.foo\]!g' /etc/kata-containers/configuration.toml
$ ./kata-runtime env >/dev/null; echo $?
/etc/kata-containers/configuration.toml: configuration file contains invalid hypervisor section: "foo"
1
```
Fixes: #8212.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Improve the `GuestProtection` handling to detect the version of
Intel TDX available.
The TDX version is now logged by the Cloud Hypervisor driver.
Fixes: #8147.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Direct-volume needs to use the same base64 character set as
kata-runtime/direct-volume does.
Fixes: #8175
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
This change adds support for adding and removing vfio devices for
cloud-hypervisor.
Fixes: #6691
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
To pick up fix for the following issue:
A maliciously crafted HTTP/2 stream could cause excessive CPU
consumption in the HPACK decoder, sufficient to cause a denial of
service from a small number of small requests.
Fixes: #8190
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
Enable the Cloud Hypervisor driver (the `cloud-hypervisor` build feature) for the rust runtime.
Fixes: #6264.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
This will help to build the agent binary as part of the kata-deploy
localbuild, as we need to pass the DESTDIR to where the agent will be
installed, and also whether we're building the agent with policy support
enabled or not.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
We need to do proper sandbox sizing when we're doing cold-plug introduce CDI,
the de-facto standard for enabling devices in containers. containerd
will pass-through annotations for accumulated CPU,Memory and now CDI
devices. With that information sandbox sizing can be derived correctly.
Fixes: #7331
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Pause and resume task do not currently update the status of the
container to paused or running, so fix this. This is specifically for
pausing the task and not the VM.
Fixes#6434
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
Allow Cloud Hypervisor to create a confidential guest (a TD or
"Trust Domain") rather than a VM (Virtual Machine) on Intel systems
that provide TDX functionality.
> **Notes:**
>
> - At least currently, when built with the `tdx` feature, Cloud Hypervisor
> cannot create a standard VM on a TDX capable system: it can only create
> a TD. This implies that on TDX capable systems, the Kata Configuration
> option `confidential_guest=` must be set to `true`. If it is not, Kata
> will detect this and display the following error:
>
> ```
> TDX guest protection available and must be used with Cloud Hypervisor (set 'confidential_guest=true')
> ```
>
> - This change expands the scope of the protection code, changing
> Intel TDX specific booleans to more generic "available guest protection"
> code that could be "none" or "TDX", or some other form of guest
> protection.
Fixes: #6448.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Introduce a few new constants (for PCI segment count and FS queues) and
move the disk queue constants to `convert.rs` to allow them to be used
there too.
> **Note:**
>
> This change gives the `ShareFs` code it's own set of values rather
> than relying on the disk queue constants.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Modify the Cloud Hypervisor `add_device()` method to add `ShareFs` and
`Network` devices to the list of pending devices since only these two
device types need to be cached before VM startup. Full details in the
comments.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Remove the `VIRTIO_BLK_MMIO` check which appears to have been added
erroneously in the first place.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
This patch re-generates the client code for Cloud Hypervisor v35.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.
Fixes: #8057
Signed-off-by: Bo Chen <chen.bo@intel.com>
The cgroup stats come from resourcecontrol package in the form of pointers
to structs. The sandbox Stat() method incorrectly was expecting structs.
This caused the cpu and memory stats to always be 0, which in turn caused
incorrect pod overhead metrics.
Fixes#8035
Signed-off-by: Peteris Rudzusiks <rye@stripe.com>
Firecracker supports noflush semantic via Unsafe cache type.
There is no support for direct i/o, remove it from config file
Fixes: #7823
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
Clh suports direct i/o for disks. It doesn't
offer any support for noflush, removed passing
of option to cloud-hypervisor internal config
Fixes: #7798
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
Support quoted kernel command line parameters that include space
characters. Example:
dm-mod.create="dm-verity,,,ro,0 736328 verity 1
/dev/vda1 /dev/vda2 4096 4096 92041 0 sha256
f211b9f1921ef726d57a72bf82be23a510076639fa8549ade10f85e214e0ddb4
065c13dfb5b4e0af034685aa5442bddda47b17c182ee44ba55a373835d18a038"
Fixes: #8003
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This is part of a bigger effort to drop gogoprotobuff from our code
base. IIUC, those options are basically used by *pb_test.go, and since
we are dropping gogoprotobuff and those are auto generated tests, let's
just remove it.
Fixes#7978.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
Users have noticed that this is needed, as CLH does not yet implement a
way to hotplug resources on aarh64.
With this patch, when building for x86_64, I can see the this is the
resulting config:
```
$ ARCH=amd64 make
...
$ cat config/configuration-clh.toml | grep static_sandbox_resource_mgmt
static_sandbox_resource_mgmt=false
```
And when building for aarch64:
```
$ ARCH=arm64 make
...
$ cat config/configuration-clh.toml | grep static_sandbox_resource_mgmt
static_sandbox_resource_mgmt=true
```
Fixes: #7941
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
PR #6146 added the possibility to control QEMU with an extra HMP socket
as an aid for debugging. This is great for development or bug chasing
but this raises some concerns in production.
The HMP monitor allows to temper with the VM state in a variety of ways.
This could be intentionally or mistakenly used to inject subtle bugs in
the VM that would be extremely hard if not even impossible to debug. We
definitely don't want that to be enabled by default.
The feature is currently wired to the `enable_debug` setting in the
`[hypervisor.qemu]` section of the configuration file. This setting has
historically been used to control "debug output" and it is used as such
by some downstream users (e.g. Openshift). Forcing people to have the
extra HMP backdoor at the same time is abusive and dangerous.
A new `extra_monitor_socket` is added to `[hypervisor.qemu]` to give
fine control on whether the HMP socket is wanted or not. This setting
is still gated by `enable_debug = true` to make it clear it is for
debug only. The default is to not have the HMP socket though. This
isn't backward compatible with #6416 but it is for the sake of "better
safe than sorry".
An extra monitor socket makes the QEMU instance untrusted. A warning is
thus logged to the journal when one is requested.
While here, also allow the user to choose between HMP and QMP for the
extra monitor socket. Motivation is that QMP offers way more options to
control or introspect the VM than HMP does. Users can also ask for
pretty json formatting well suited for human reading. This will improve
the debugging experience.
This feature is only made visible in the base and GPU configurations
of QEMU for now.
Fixes#7952
Signed-off-by: Greg Kurz <groug@kaod.org>
Otherwise `make test` will simply fail with:
```
error[E0583]: file not found for module `config`
```
Fixes: #7974 -- part 0
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This makes it pssible to run the tests in the cost free runners, which
are not KVM capable.
Fixes: #7974 -- part 0
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Otherwise `make test` will simply fail with:
```
error[E0583]: file not found for module `version`
```
Fixes: #7974 -- part 0
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Otherwise `make test` will fail with:
```
error[E0583]: file not found for module `version`
```
Fixes: #7974 -- part 0
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This PR adds support for adding a network device before starting the
cloud-hypervisor VM.
Support for adding and removing network devices is not really added to
the resource manager, so supporting this for cloud-hypervisor is not
scoped in this PR.
This also changes "pending_devices" for clh implementation from an
Option of vector to simply a vector. This simplifies the structure a bit
as we can simple iterate over the pending devices instead of having to
check for a "Some" value as this is not really required.
Fixes: #6333
Signed-off-by: Shuaiyi Zhang <zhang_syi@qq.com>
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
cloud hypervisor on arm64 only support arm AMBA UART(pl011) as
tty. So, the console should be set to "ttyAMA0" instead of "ttyS0"
when enable hypervisor debug mode.
Fixes: #5080
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
by enabling IOMMU on the default PCI segment. For hotplug to work we need a
virtualized iommu and clh exposes one if there is some device or PCI segment
that requests it. I would have preferred to add a separate PCI segment for
hotplugging vfio devices but unfortunately kata assumes there is only one
segment all over the place. See create_pci_root_bus_path(),
split_vfio_pci_option() and grep for '0000'.
Enabling the IOMMU on the default PCI segment requires passing enabling IOMMU on
every device that is attached to it, which is why it is sprinkled all over the
place.
CLH does not support IOMMU for VirtioFs, so I've added a non IOMMU segment for
that device.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
There is no way for this branch to be hit, as port is only set when it is
different than config.NoPort.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
These test cases shows which options are valid for CLH/Qemu, and test that we
correctly catch unsupported combinations.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
The only supported options are hot_plug_vfio=root-port or no-port.
cold_plug_vfio not supported yet.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
hot_plug_vfio needs to be set to root-port, otherwise attaching vfio devices to
CLH VMs fails. Either cold_plug_vfio or hot_plug_vfio is required, and we have
not implemented support for cold_plug_vfio in CLH yet.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
In the RemoveEndpoints(), when the endpoints paramete isn't empty,
using idx may result in wrong endpoint removals. To improve,
directly passing the endpoint parameter helps
locate the correct elements within n.eps.
Fixes: #7732
Signed-off-by: shixuanqing <1356292400@qq.com>
Fixes: #7732
Signed-off-by: shixuanqing <1356292400@qq.com>
Update src/runtime/virtcontainers/network_linux.go
Co-authored-by: Xuewei Niu <justxuewei@apache.org>
- In rust 1.72, clippy warned clippy::non-minimal-cfg
as the cfg has only one condition, so doesn't
need to be wrapped in the any combinator.
Fixes: #7902
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
- Allow `clippy::redundant-closure-call`
which has issues with the guard function passed into
the `run_if_auto_values` macro
Fixes: #7902
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The bindgen generated code is triggering lots of
ambiguous-glob-reexports warnings in rust 1.70+
Fixes: #7902
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
- In rust 1.72, clippy warned clippy::non-minimal-cfg
as the cfg has only one condition, so doesn't
need to be wrapped in the all combinators.
Fixes: #7902
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
- Allow `clippy::redundant-closure-call` in `from_cmdline`
which has issues with the guard function passed into
the `parse_cmdline_param` macro
Fixes: #7902
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
When creating a new endpoint, we check existing endpoint names and automatically adjust the naming of the new endpoint to ensure uniqueness.
Fixes: #7876
Signed-off-by: shixuanqing <1356292400@qq.com>
This commit allows us to specify the huge page backend when enabling huge
page. Currently, we support two backends: thp and hugetlbfs, the default
is hugetlbfs.
To ensure backward compatibility, we introduce another configuration item
"hugepage_type" to select the memory backend, which is available only when
"enable_hugepages" is true. Besides, we add an annotation
"io.katacontainers.config.hypervisor.hugepage_type" to configure huge page
type per pod.
Fixes: #6703
Signed-off-by: Guixiong Wei <weiguixiong@bytedance.com>
Signed-off-by: Yipeng Yin <yinyipeng@bytedance.com>
To support the removal of the `initcall_debug` and `earlyprintk=`
options from the default guest kernel cmdline, add `kernel_params` to the list
of enabled annotations to allow those kernel options (or others) to be
set using `kata-deploy` for either runtime.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Removed the following kernel command line options:
- `earlyprintk=ttyS0`
- `initcall_debug`
Both these options are only useful when debugging a guest kernel failure
which is not a common occurrence.
Further, the `earlyprintk=` option can have a large negative performance
impact (it can increase the VM boot time significantly).
If the user wishes to use either of these options, they can add them to the
`kernel_params=` setting in the Kata configuration file's hypervisor
stanza.
Fixes: #7886.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
1. Directly support CgroupManager::freeze through systemd API.
2. Avoid always passing unit_name by storing it into DBusClient.
3. Realize CgroupManager::destroy more accurately by killing systemd unit rather than stop it.
4. Ignore no such unit error when destroying systemd unit.
5. Update zbus version and corresponding interface file.
Acknowledgement: error handling for no such systemd unit error refers to
Fixes: #7080, #7142, #7143, #7166
Signed-off-by: Yuan-Zhuo <yuanzhuo0118@outlook.com>
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
This syntax belongs to the legacy C virtiofsd implementation that
we don't support anymore since kata-containers 3.1.3 because
of other API breaking changes.
People have been warned to switch from "none" to "never" since
kata-containers 2.5.2. Let's officially do that.
The compat code that would convert "none" to "never" isn't
needed anymore. Just drop it.
Fixes#7864
Signed-off-by: Greg Kurz <groug@kaod.org>
gogo.nullable is the main gogo.protobuf' feature used here. Since we are
trying to remove gogo.protobuf, the first reasonable step seems to be
remove this feature. This is a core update, and it will change how the
structs are defined. I could spot only a few places using those structs,
based on make check/build.
Fixes#7723.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
There is no reference to PROTO_FILE and this is not working. Also we are
not inside a Makefile, so makes sense to adapt the usage to reflect the
script instead of a make command.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
import_path is used as the default package when no input files specify
go_package. However, all the files we are currently building already
have a go_package definition, making this behavior both redundant and
error-prone.
Additionally, one of our files (types.pb.go) resides outside the grpc
directory, indicating that it's indeed ignored but also inconsistent.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
Currently, the script searches for .proto files within $GOPATH/.
Consequently, modifications to a definition file in the current working
directory won't influence the output .pb.go if the directory is outside
of $GOPATH. For developers, it's more intuitive to alter the local
codebase than the version stored in $GOPATH.
With this modification, the generated .pb.go files will be relative to
the current working directory, removing the need to clone this project
under $GOPATH/src/github.com/kata-containers.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
The definitions are already specified in the .proto files using the
go_package option. Centralizing them in one location reduces the
potential for errors and simplifies the script.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
Currently, the dbs-upcall features have 2 problems that are needed to be
fixed :
There are redundant dbs-upcall features that are needed to be removed.
Some place should be controlled by dbs-upcall but not being implemented.
This commit will fix those two problems.
fixes: #6878
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Some use cases may just require passing extra arguments to virtiofsd,
and having this disabled by default makes it impossible to set when
using kata-deploy, as changes in the configuration file would be
overwritten by the daemon-set.
With this in mind, let's allow users to pass whatever thet need (and
here I'm specifically looking at `--xattr`) as a virtio_fs_extra_arg.
Fixes: #7853
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Currently, virtio_vsock are still outside of the device
manager. This causes some management issues,such as the
inability to unify PCI address management.
Just do some work for hybrid vsock.
Fixes: #7655
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
When a storage device is used by more than one container, the second
and forth instances will cause storage device reference count leakage,
thus cause storage device leakage. The reason is:
add_storages() will increase reference count of existing storage device,
but forget to add the device to the `mount_list` array, thus leak the
reference count.
Fixes: #7820
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Use AGENT_POLICY=yes when building the Guest images, and add a
permissive test policy to the k8s tests for:
- CBL-Mariner
- SEV
- SNP
- TDX
Also, add an example of policy rejecting ExecProcessRequest.
Fixes: #7667
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Let's make sure we use the TDX image as part of the QEMU TDX
configuration, which will help us to have the policies tested here.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Update the protection checking code to detect newer versions of Intel
TDX (whose userland interface has now stabilised).
> **Note:** that we don't need to retain the existing behaviour since:
>
> - We haven't yet landed the TDX feature (#6448).
> - Systems wishing to use TDX will need to use the latest available
> system components (such as firmware and host kernel).
Also added an explicit TDX unit test.
Fixes: #7384.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
IoCopy is a tricky function (I don't claim to fully understand its contract),
but here is what I see: The goroutine that runs it spawns 3 goroutines - one
for each stream to handle (stdin/stdout/stderr). The goroutine then waits for
the stream goroutines to exit. The idea is that when the process exits and is
closed, the stdout goroutine will be unblocked and close stdin - this should
unblock the stdin goroutine. The stderr goroutine will exit at the same time as
the stdout goroutine. The iocopy routine then closes all tty.io streams.
The problem is that the stdout goroutine decrements the WaitGroup before
closing the stdin stream, which causes the iocopy goroutine to race to close
the streams. Move the wg.Done() of the stdout routine past the close so that
*this* race becomes impossible. I can't guarantee that this doesn't affect some
unspecified behavior.
Fixes: #5031
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
If we are running FC hypervisor, it is not started when prestart hooks
are executed. So we should just ignore such error and just go ahead and
run the hooks.
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
FC does not support network device hotplug. Let's add a check to fail
early when starting containers created by docker.
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
Add a new hypervisor capability to tell if it supports device hotplug.
If not, we should run prestart hooks before starting new VMs as nerdctl
is using the prestart hooks to set up netns. To make nerdctl + FC
to work, we need to run the prestart hooks before starting new VMs.
Fixes: #6384
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
When running on amd machines, those tests will fail because there is no
vmx flag. Following other tests that checks for cpuType, let's adapt
them to restrict vmx only on Intel machines.
Fixes#7788.
Related #5066
Signed-off-by: Beraldo Leal <bleal@redhat.com>
QEMU for TDX 1.5 makes use of private memory map/unmap.
Make changes to govmm to support this. Support for private backing fd
for memory is added as knob to the qemu config.
Userspace's map/unmap operations are done by fallocate() ioctl on the
backing store fd.
Reference:
https://lore.kernel.org/linux-mm/20220519153713.819591-1-chao.p.peng@linux.intel.com/Fixes: #7770
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
The log_forwarder task does not check if the peer has closed, causing a
meaningless loop during the period of “kata vm exit”, when the peer
closed, and “ShutdownContainer RPC received” that aborts the log forwarder.
This patch fixes the problem.
Fixes: #7741
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
There are several processes for container exit:
- Non-detach mode: `Wait` request is sent by containerd, then
`wait_process()` will be called eventually.
- Detach mode: `Wait` request is not sent, the `wait_process()` won’t be
called.
- Killed by ctr: For example, a container runs `tail -f /dev/null`, and
is killed by `sudo ctr t kill -a -s SIGTERM <CID>`. Kill request is
sent, then `kill_process()` will be called. User executes `sudo ctr c
rm <CID>`, `Delete` request is sent, then `delete_process()` will be
called.
- Exited on its own: For example, a container runs `sleep 1s`. The
container’s state goes to `Stopped` after 1 second. User executes
the delete command as below.
Where do we do container cleanup things?
- `wait_process()`: No, because it won’t be called in detach mode.
- `delete_process()`: No, because it depends on when the user executes the
delete command.
- `run_io_wait()`: Yes. A container is considered exited once its IO ended.
And this always be called once a container is launched.
Fixes: #7713
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Refine storage related code by:
- remove the STORAGE_HANDLER_LIST
- define type alias
- move code near to its caller
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Introduce StorageDevice and StorageHandlerManager, which will be used
to refine storage device management for kata-agent.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Simplify the way to manage storage objects, and introduce
StorageStateCommon structures for coming extensions.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Right now if we configure an image annotation and have a config file
setting initrd, the initrd config would override the image annotation.
Make sure annotations are preferred over config options in image and initrd
path handling.
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
We should make sure annotations are preferred over
config options in image and initrd path handling.
Fixes: #7705
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
Right now if we configure an image annotation and have a config file
setting initrd, the initrd config would override the image annotation.
Add a helper function ImageOrInitrdAssetPath to make sure annotations
are preferred over config options in image and initrd path handling.
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
When the FileMode field for the device is unset (0), use a default value instead
to allow the use of the device from the container.
This behaviour is seen from cri-o typically.
Note: this is what runc is doing, which is why regular containers don't have an
issue. This change makes sure kata behaves the same as runc.
Fixes: #7717
Signed-off-by: Julien Ropé <jrope@redhat.com>
Introduce structure KataVirtualVolume to to encapsulate information
for extra mount options and direct volumes, so we could build a common
infrastructure to handle these cases.
Fixes: #7699
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
When building with AGENT_POLICY=yes and AGENT_INIT=yes:
1. Include OPA and the Policy settings in rootfs.
2. Start OPA from the kata agent.
Before these changes, building with both AGENT_POLICY=yes and
AGENT_INIT=yes was unsupported.
Starting OPA from systemd (when AGENT_INIT=no) was already supported.
Fixes: #7615
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
The error message when the kill command is executed with the container's
state == Stopped should be "container not running" because the containerd
tests expect that OCI runtimes return the error message and compare it.
If the error message is different from the expected one, the tests fail.
Fixes: #7650
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
We extend the `Result` and `Option` types with associated types that
allows converting a `Result<T, E>` and `Option<T>` into
`ttrpc::Result<T>`.
This allows the elimination of many `match` statements in favor of
calling the map function plus the `?` operator. This transformation
simplifies the code.
Fixes: #7624
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Fixes: #7573
To enable this feature, build your rootfs using AGENT_POLICY=yes. The
default is AGENT_POLICY=no.
Building rootfs using AGENT_POLICY=yes has the following effects:
1. The kata-opa service gets included in the Guest image.
2. The agent gets built using AGENT_POLICY=yes.
After this patch, the shim calls SetPolicy if and only if a Policy
annotation is attached to the sandbox/pod. When creating a sandbox/pod
that doesn't have an attached Policy annotation:
1. If the agent was built using AGENT_POLICY=yes, the new sandbox uses
the default agent settings, that might include a default Policy too.
2. If the agent was built using AGENT_POLICY=no, the new sandbox is
executed the same way as before this patch.
Any SetPolicy calls from the shim to the agent fail if the agent was
built using AGENT_POLICY=no.
If the agent was built using AGENT_POLICY=yes:
1. The agent reads the contents of a default policy file during sandbox
start-up.
2. The agent then connects to the OPA service on localhost and sends
the default policy to OPA.
3. If the shim calls SetPolicy:
a. The agent checks if SetPolicy is allowed by the current
policy (the current policy is typically the default policy
mentioned above).
b. If SetPolicy is allowed, the agent deletes the current policy
from OPA and replaces it with the new policy it received from
the shim.
A typical new policy from the shim doesn't allow any future SetPolicy
calls.
4. For every agent rpc API call, the agent asks OPA if that call
should be allowed. OPA allows or not a call based on the current
policy, the name of the agent API, and the API call's inputs. The
agent rejects any calls that are rejected by OPA.
When building using AGENT_POLICY_DEBUG=yes, additional Policy logging
gets enabled in the agent. In particular, information about the inputs
for agent rpc API calls is logged in /tmp/policy.txt, on the Guest VM.
These inputs can be useful for investigating API calls that might have
been rejected by the Policy. Examples:
1. Load a failing policy file test1.rego on a different machine:
opa run --server --addr 127.0.0.1:8181 test1.rego
2. Collect the API inputs from Guest's /tmp/policy.txt and test on the
machine where the failing policy has been loaded:
curl -X POST http://localhost:8181/v1/data/agent_policy/CreateContainerRequest \
--data-binary @test1-inputs.json
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Remove the installation step in the virtcontainers doc
because the virtcontainers install/uninstall targets have
been removed by 86723b51ae
and they are not used anymore.
Fixes: #7637
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
Remove configuration file shared_fs = none warnings
now that there is a solution to updating configMaps, secrets etc
Fixes: #7210
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
This patch allows copying of directories and symlinks when
static file copying is used between host and guest. This change is
necessary to support recursive file copying between shim and agent.
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
(cherry picked from commit de232b8030)
For remote hypervisor, the configmap, secrets, downward-api or project-volumes are
copied from host to guest. This patch watches for changes to the host files
and copies the changes to the guest.
Note that configmap updates takes significantly longer than updates via downward-api.
This is similar across runc and Kata runtimes.
Fixes: #7210
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Signed-off-by: Julien Ropé <jrope@redhat.com>
(cherry picked from commit 3081cd5f8e)
(cherry picked from commit 68ec673bc4d9cd853eee51b21a0e91fcec149aad)
This patch upgrades Firecracker version from v1.1.0 to v1.4.0.
* Generate swagger models for v1.4.0 (from `firecracker.yaml`)
- The version of go-swagger used is v0.30.0
* The firecracker v1.4.0 includes the following changes.
- Added
* Added support for custom CPU templates allowing users to adjust vCPU features
exposed to the guest via CPUID, MSRs and ARM registers.
* Introduced V1N1 static CPU template for ARM to represent Neoverse V1 CPU
as Neoverse N1.
* Added support for the virtio-rng entropy device. The device is optional. A
single device can be enabled per VM using the /entropy endpoint.
* Added a cpu-template-helper tool for assisting with creating and managing
custom CPU templates.
- Changed
* Set FDP_EXCPTN_ONLY bit (CPUID.7h.0:EBX[6]) and ZERO_FCS_FDS bit
(CPUID.7h.0:EBX[13]) in Intel's CPUID normalization process.
- Fixed
* Fixed feature flags in T2S CPU template on Intel Ice Lake.
* Fixed CPUID leaf 0xb to be exposed to guests running on AMD host.
* Fixed a performance regression in the jailer logic for closing open file
descriptors.
* A race condition that has been identified between the API thread and the VMM
thread due to a misconfiguration of the api_event_fd.
* Fixed CPUID leaf 0x1 to disable perfmon and debug feature on x86 host.
* Fixed passing through cache information from host in CPUID leaf 0x80000006.
* Fixed the T2S CPU template to set the RRSBA bit of the IA32_ARCH_CAPABILITIES
MSR to 1 in accordance with an Intel microcode update.
* Fixed the T2CL CPU template to pass through the RSBA and RRSBA bits of the
IA32_ARCH_CAPABILITIES MSR from the host in accordance with an Intel microcode
update.
* Fixed passing through cache information from host in CPUID leaf 0x80000005.
* Fixed the T2A CPU template to disable SVM (nested virtualization).
* Fixed the T2A CPU template to set EferLmsleUnsupported bit
(CPUID.80000008h:EBX[20]), which indicates that EFER[LMSLE] is not supported.
Fixes: #7610
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
Since the passed fd through unix socket would be any
stream fd such as pipe/fifo fd or any other socket
fd, thus we should deal with it as a normal hybrid
stream instead of a unix stream.
Fixes:#7584
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
There are many places where the code currently creates new `Vec`
instances when it's not really needed. The result is a perf hit because
it allocates memory, copies all elements, then frees the memory; in some
cases, copying elements also involves extra allocations (e.g., when
elements are strings, or structs containing strings).
This patch addresses a number of these cases.
Fixes: #7203
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Refine implementation of mount by:
- log message with `path.display()` instead of `{:?}`
- add prefix "_" to unused variables
- pass by reference instead of by value to avoid creating redundant
array
- exactly matching prefix "fsgid=" instead of "fsgid"
- avoid redundant clone() operations
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
There's a bug in function update_ephemeral_mounts() which only handles
the first storage object and ignores all other storage objects.
Fixes: #7551
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Simplify function online_cpu_memory() by on calling update_cpuset_path()
for containers with cpuset configured.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Refine style of code related to sandbox by:
- remove unnecessary comments for caller to take lock, we have already taken
`&mut self`.
- change "*count < 1 " to "*count == 0", `count` is type of u32.
- make remove_sandbox_storage() to take `&mut self` instead of `&self`.
- group related function to each others
- avoid search the map twice in function find_process()
- avoid unwrap() in function run_oom_event_monitor()
- avoid unwrap() in online_resources()
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Avoid unwrap() in function do_remove_container(), and also make
implmementation symmetric for both timeout and non-timeout cases.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Optimize agent rpc implementation by:
- avoid clone objects when possible
- avoid unwrap() when possible
- explictly drop object to ensure order
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
This pull request is mainly for updating vm-memory and vmm-sys-util.
The affacted crates include:
- vm-memory: from 0.9.0 to 0.10.0
- vmm-sys-util: from 0.10.0 to 0.11.0
- virtio-queue: from 0.6.0 to 0.7.0
- fuse-backend-rs: from 0.10.4 to 0.10.5
- linux-loader: from 0.6.0 to 0.8.0
- nydus-api: from 0.3.0 to 0.3.1
- nydus-rafs: from 0.3.1 to 0.3.2
- nydus-storage: from 0.6.3 to 0.6.4
Fixes: #0000
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
These calls cause two extra atomic instructions each time they're used,
one to increment and another one to decrement the refcount.
Since we don't need them because the referred value is guaranteed to
outlive the function, remove the calls.
Fixes: #7190
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
When the mounted block device isn't a layer, we want to mount it into
containers, but since it's already mounted with the correct fs (e.g.,
tar, ext4, etc.) in the pod, we just bind-mount it into the container.
Fixes: #7536
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
When at least one `io.katacontainers.fs-opt.layer` option is added to
the rootfs, it gets inserted into the VM as a layer, and the file system
is mounted as an overlay of all layers using the overlayfs driver.
Additionally, if the `io.katacontainers.fs-opt.block_device=file` option
is present in a layer, it is mounted as a block device backed by a file
on the host.
Fixes: #7536
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
This causes the overlay-fs driver to add the `upperdir` and `workdir`
options to an overlay-fs mount so that the mount becomes writable using
a discardable directory under the container id.
Fixes: #7536
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
This is so that file systems don't fail when we pass kata-specific
options from the snapshotter to kata.
Fixes: #7536
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Version 0.10.5, which was just released, breaks `nydus-storage`.
This is a workaround to fix the CI which is blocking other PRs.
Fixes: #7541
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Allow `clippy::redundant_clone` in the agent's unit tests
because rustc>=1.70 shows the errors as false-negatives.
These `clone()` are required because the following codes
refer to the variable, but the clippy analyzes them by mistake,
using the conservative and limited approach.
Ref. https://rust-lang.github.io/rust-clippy/master/index.html#/redundant_cloneFixes: #7534
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
Kata containers as VM-based containers are allowed to run in the host
netns. That is, the network is able to isolate in the L2. The network
performance will benefit from this architecture, which eliminates as many
hops as possible. We called it a Directly Attachable Network (DAN for
short).
The network devices are placed at the host netns by the CNI plugins. The
configs are saved at {dan_conf}/{sandbox_id}.json in the format of JSON,
including device name, type, and network info. At the very beginning stage,
the DAN only supports host tap devices. More devices, like the DPDK, will
be supported in later versions.
The format of file looks like as below:
```json
{
"netns": "/path/to/netns",
"devices": [{
"name": "eth0",
"guest_mac": "xx:xx:xx:xx:xx",
"device": {
"type": "vhost-user",
"path": "/tmp/test",
"queue_num": 1,
"queue_size": 1
},
"network_info": {
"interface": {
"ip_addresses": ["192.168.0.1/24"],
"mtu": 1500,
"ntype": "tuntap",
"flags": 0
},
"routes": [{
"dest": "172.18.0.0/16",
"source": "172.18.0.1",
"gateway": "172.18.31.1",
"scope": 0,
"flags": 0
}],
"neighbors": [{
"ip_address": "192.168.0.3/16",
"device": "",
"state": 0,
"flags": 0,
"hardware_addr": "xx:xx:xx:xx:xx"
}]
}
}]
}
```
Fixes: #1922
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
If modeVFIO is enabled we need 1st to attach the VFIO control group
device /dev/vfio/vfio an 2nd the actuall device(s) afterwards.Sort the
devices starting with device #1 being the VFIO control group device and
the next the actuall device(s)
/dev/vfio/<group>
Fixes: #7493
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Multiple instances of task service may get registered by
ServiceManager::run(), fix it by making operation symmetric.
Fixes: #7479
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
The previous kata-monitor in golang could not communicate with runtime-rs
to gather metrics due to different sandbox addresses.
This PR adds the subcommand monitor in kata-ctl to gather metrics from
runtime-rs and monitor itself.
Fixes: #5017
Signed-off-by: Yuan-Zhuo <yuanzhuo0118@outlook.com>
Several functions in kata-ctl need to establish a connection with runtime-rs through MgmtClient.
This PR provides a global TIMEOUT to avoid multiple definitions.
Fixes: #5017
Signed-off-by: Yuan-Zhuo <yuanzhuo0118@outlook.com>
1. Implemented metrics collection for runtime-rs shim and dragonball hypervisor.
2. Described the current supported metrics in runtime-rs.(docs/design/kata-metrics-in-runtime-rs.md)
Fixes: #5017
Signed-off-by: Yuan-Zhuo <yuanzhuo0118@outlook.com>
use device manager to handle vm rootfs, after attach the block device of
vm rootfs, we need to increase index number
Fixes: #7119
Signed-off-by: Zhongtao Hu <zhongtaohu.tim@linux.alibaba.com>
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Change block index in SharedInfo to 0 for vda.
Fixes#7119
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Remove unused `mut` because the agent compilation fails
when the rust compiler is >= 1.71. This is related to #7425Fixes: #7438
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
Now that we have propper AP device support add a
unit test for testing the correct Attach/Detach of AP devices.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Removing HotplugVFIOonRootBus which is obsolete with the latest PCI
topology changes, users can set cold_plug_vfio or hot_plug_vfio either
in the configuration.toml or via annotations.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
The device.Bus was reset if a specific combination of
configuration parameters were not met. With the new
PCIe topology this should not happen anymore
Fixes: #7381
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
currently when fsGroup is used with direct-assign, kata agent
recursively changes ownership and permission for each file including
symlinks. However the problem with symlinks is, the permission of
the symlink itself may not be same as the underlying file. So while
doing recursive ownership and permission changes we should skip
symlinks.
Fixes: #7364
Signed-off-by: Alakesh Haloi <a_haloi@apple.com>
In order to make it easier for developers to contribute to Dragonball,
we decide to migrate all dragonball-sandbox crates to Kata.
fixes: #7262
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Allow runk to launch a container even though users don't specify the
pid namespace in `config.json` because general container runtimes
such as runc also can launch a container without the namespace.
On the other hand, Kata Containers doesn't allow it due to security issue
so this feature should be enabled in only runk.
Fixes: #7168
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
Issue #4747 and pull request #4748 fix exec hang issues where the exec
command hangs when a process's stdout is not closed. However, the PR might
cause the exec command not to work as expected, leading to CI failure. The
PR was reverted in #7042. This PR resolves the exec hang issues and has
undergone 1000 rounds of testing to verify that it would not cause any CI
failures.
Fixes: #4747
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Add an extra parameter in `bind_mount_unchecked` to specify
the propagation type: "shared" or "slave".
Fixes: #7017
Signed-off-by: Anastassios Nanos <ananos@nubificus.co.uk>
Since these have been added to kata-sys-util, remove these from
kata-ctl. Change all invocations to get platform protection to make use
of kata-sys-util.
Fixes: #7144
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Remove cpu related functions which have been moved to kata-sys-util.
Change invocations in kata-ctl to make use of functions now moved to
kata-sys-util.
Signed-off-by: Nathan Whyte <nathanwhyte35@gmail.com>
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Make certain imports architecture specific as these are not used on all
architectures.
Move additional constants and functionality to cpu.rs.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Move get_single_cpu_info and get_cpu_flags into kata-sys-util.
Add new functions that get a list of flags and check if a flag
exists in that list.
Fixes#6383
Signed-off-by: Nathan Whyte <nathanwhyte35@gmail.com>
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Currently, network endpoints are separate from the device manager
and need to be included for proper management. In order to do so,
we need to refactor the implementation of the network endpoints.
The first step is to restructure the NetworkConfig and NetworkDevice
structures.
Next, we will implement the virtio-net driver and add the Network
device to the Device Manager.
Finally, we'll unify entries with do_handle_device for each endpoint.
Fixes: #7215
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
When running cargo test in container, test_mknod_dev may fail sometimes
because of "Operation not permitted". Change the device path to
"/dev/fifo-test" to avoid this case.
Fixes: #7284
Signed-off-by: xuejun-xj <jiyunxue@linux.alibaba.com>
1. Update memory end assert because address space layout differs between
x86 and arm.
2. Set guest_addr for aarch64 in test_handler_insert_region case.
Fixes: #7284
TODO: #7290
Signed-off-by: xuejun-xj <jiyunxue@linux.alibaba.com>
We've noticed this caused regressions with the k8s-oom tests, and then
decided to take a step back and do this in the same way it was done
before 67972ec48a.
Moreover, this step back is also more reasonable in terms of the
controlling logic.
And by doing this we can re-enable the k8s-oom.bats tests, which is done
as part of this PR.
Fixes: #7271
Depends-on: github.com/kata-containers/tests#5705
Signed-off-by: Yushuo <y-shuo@linux.alibaba.com>
Introduce tracing into runtime-rs, only some functions are instrumented.
Fixes: #5239
Signed-off-by: Ji-Xinyou <jerryji0414@outlook.com>
Signed-off-by: Yushuo <y-shuo@linux.alibaba.com>
Let's take the same approach of the go runtime, instead, and allocate
the maximum allowed number of vcpus instead.
Fixes: #7270
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This reverts commit 25d2fb0fde.
The reason we're reverting the commit is because it to check whether
it's the cause for the regression on devmapper tests.
Fixes: #7253
Depends-on: github.com/kata-containers/tests#5705
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Remove shadowed get_mounts(), added slog-term as a new crate,
slog can directly log to stdout and we can capture output
in the test-cases that are created in the function to be tested.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Using an initrd and setting KATA_INIT=yes meaning we're using the kata-agent
as the init process we need to make sure that the agent is not segfaulting
if mounts are already happened. Some workloads need to configure several
things in the initrd before the kata-agent starts which involves having
/proc or /sys already mounted.
Fixes: #6992
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Currently, even when using devmapper, if the VMM supports virtio-fs /
virtio-9p, that's used to share a few files between the host and the
guest.
This *needed*, as we need to share with the guest contents like secrets,
certificates, and configurations, via Kubernetes objects like configMaps
or secrets, and those are rotated and must be updated into the guest
whenever the rotation happens.
However, there are still use-cases users can live with just copying
those files into the guest at the pod creation time, and for those
there's absolutely no need to have a shared filesystem process running
with no extra obvious benefit, consuming memory and even increasing the
attack surface used by Kata Containers.
For the case mentioned above, we should allow users, making it very
clear which limitations it'll bring, to run Kata Containers with
devmapper without actually having to use a shared file system, which is
already the approach taken when using Firecracker as the VMM.
Fixes: #7207
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
There is nothing in them that requires them to be macros. Converting
them to functions allows for better error messages.
Fixes: #7201
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
There is nothing in it that requires it to be a macro. Converting it to
a function allows for better error messages.
Fixes: #7201
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Having a function allows for better error messages from the type checker
and it makes it clearer to callers what can happen. For example:
is_allowed!(req);
Gives no indication that it may result in an early return, and no simple
way for callers to modify the behaviour. It also makes it look like
ownership of `req` is being transferred.
On the other hand,
is_allowed(&req)?;
Indicates that `req` is being borrowed (immutably) and may fail. The
question mark indicates that the caller wants an early return on
failure.
Fixes: #7201
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Since it is never modified, it doesn't really need a lock of any kind.
Removing the `RwLock` wrapper allows us to remove all `.read().await`
calls when accessing it.
Additionally, `AGENT_CONFIG` already has a static lifetime, so there is
no need to wrap it in a ref-counted heap allocation.
Fixes: #5409
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Vfio support introduce build error on AArch64. Remove arch related
annotation can avoid this error.
Fixes: #7187
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
The failure mainly caused by the encoded volume path and
the mount/src. As the src will be validated with stat,but
it's not a full path and encoded, which causes the stat
mount source failed.
Fixes: #7186
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Unlike the previous usage which requires creating
/dev/xxx by mknod on the host, the new approach will
fully utilize the DirectVolume-related usage method,
and pass the spdk controller to vmm.
And a user guide about using the spdk volume when run
a kata-containers. it can be found in docs/how-to.
Fixes: #6526
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
When running on a VM, the kernel parameter "unrestricted_guest" for
kernel module "kvm_intel" is not required. So, return success when running
on a VM without checking value of this kernel parameter.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Implement functionality to add to the env output if the host is capable
of running a VM.
Fixes: #6727
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
A new choice of using vfio devic based volume for kata-containers.
With the help of kata-ctl direct-volume, users are able to add a
specified device which is BDF or IOMMU group ID.
To help users to use it smoothly, A doc about howto added in
docs/how-to/how-to-run-kata-containers-with-kinds-of-Block-Volumes.
Fixes: #6525
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Limitations:
As no ready rust vmm's vfio manager is ready, it only supports
part of vfio in runtime-rs. And the left part is to call vmm
interfaces related to vfio add/remove.
So when vmm/vfio manager ready, a new PR will be pushed to
narrow the gap.
Fixes: #6525
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
The `-o` option is the legacy way to configure virtiofsd, inherited
from the C implementation. The rust implementation honours it for
compatibility but it logs deprecation warnings.
Let's use the replacement options in the go shim code. Also drop
references to `-o` from the configuration TOML file.
Fixes#7111
Signed-off-by: Greg Kurz <groug@kaod.org>
The C implementation of virtiofsd had some kind of limited support
for remote POSIX locks that was causing some workflows to fail with
kata. Commit 432f9bea6e hard coded `-o no_posix_lock` in order
to enforce guest local POSIX locks and avoid the issues.
We've switched to the rust implementation of virtiofsd since then,
but it emits a warning about `-o` being deprecated.
According to https://gitlab.com/virtio-fs/virtiofsd/-/issues/53 :
The C implementation of the daemon has limited support for
remote POSIX locks, restricted exclusively to non-blocking
operations. We tried to implement the same level of
functionality in #2, but we finally decided against it because,
in practice most applications will fail if non-blocking
operations aren't supported.
Implementing support for non-blocking isn't trivial and will
probably require extending the kernel interface before we can
even start working on the daemon side.
There is thus no justification to pass `-o no_posix_lock` anymore.
Signed-off-by: Greg Kurz <groug@kaod.org>
The rust implementation of virtiofsd always runs foreground and
spits a deprecation warning when `-f` is passed.
Signed-off-by: Greg Kurz <groug@kaod.org>
If we override the cold, hot plug with an annotation
we need to reset the other plugging mechanism to NoPort
otherwise both will be enabled.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
In Virt the vhost-user-block is an PCIe device so
we need to make sure to consider it as well. We're keeping
track of vhost-user-block devices and deduce the correct
amount of PCIe root ports.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Now it is possible to configure the PCIe topology via annotations
and addded a simple test, checking for Invalid and RootPort
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Removed the configuration of PCIeRootPort and PCIeSwitchPort, those
values can be deduced in createPCIeTopology
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Refactor the bus assignment so that the call to GetAllVFIODevicesFromIOMMUGroup
can be used by any module without affecting the topology.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
The hypervisor_state file was the wrong location for the PCIe Port
settings, moved everything under device umbrella, where it can be
consumed more easily and we do not get into circular deps.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Kubernetes and Containerd will help calculate the Sandbox Size and pass it to
Kata Containers through annotations.
In order to accommodate this favorable change and be compatible with the past,
we have implemented the handling of the number of vCPUs in runtime-rs. This is
This is slightly different from the original runtime-go design.
This doc introduce how we handle vCPU size in runtime-rs.
Fixes: #5030
Signed-off-by: Yushuo <y-shuo@linux.alibaba.com>
Signed-off-by: Ji-Xinyou <jerryji0414@outlook.com>
In this commit, we refactored the logic of static resource management.
We defined the sandbox size calculated from PodSandbox's annotation and
SingleContainer's spec as initial size, which will always be the sandbox
size when booting the VM.
The configuration static_sandbox_resource_mgmt controls whether we will
modify the sandbox size in the following container operation.
Signed-off-by: Yushuo <y-shuo@linux.alibaba.com>
Signed-off-by: Ji-Xinyou <jerryji0414@outlook.com>
Some vmms, such as dragonball, will actively help us
perform online cpu operations when doing cpu hotplug.
Under the old onlineCpuMem interface, it is difficult
to adapt to this situation.
So we modify the semantics of nb_cpus in onlineCpuMemRequest.
In the original semantics, nb_cpus represents the number of
newly added CPUs that need to be online. The modified
semantics become that the number of online CPUs in the guest
needs to be guaranteed.
Fixes: #5030
Signed-off-by: Yushuo <y-shuo@linux.alibaba.com>
Signed-off-by: Ji-Xinyou <jerryji0414@outlook.com>
The declaration of the cpu number in the cpuset is greater
than the actual number of vcpus, which will cause an error when
updating the cgroup in the guest.
This problem is difficult to solve, so we temporarily clean up
the cpuset in the container spec before passing in the agent.
Fixes: #5030
Signed-off-by: Yushuo <y-shuo@linux.alibaba.com>
Signed-off-by: Ji-Xinyou <jerryji0414@outlook.com>
Update the resource when delete container, which is in
stop_process in runtime-rs.
Fixes: #5030
Signed-off-by: Yushuo <y-shuo@linux.alibaba.com>
Signed-off-by: Ji-Xinyou <jerryji0414@outlook.com>
Updating vCPU resources and memory resources of the sandbox and
updating cgroups on the host will always happening together, and
they are all updated based on the linux resources declarations of
all the containers.
So we merge update_cgroups into the update_linux_resources, so we
can better manage the resources allocated to one pod in the host.
Fixes: #5030
Signed-off-by: Yushuo <y-shuo@linux.alibaba.com>
Signed-off-by: Ji-Xinyou <jerryji0414@outlook.com>
Support vcpu resizing on runtime side:
1. Calculate vcpu numbers in resource_manager using all the containers'
linux_resources in the spec.
2. Call the hypervisor(vmm) to do the vcpu resize.
3. Call the agent to online vcpus.
Fixes: #5030
Signed-off-by: Ji-Xinyou <jerryji0414@outlook.com>
Signed-off-by: Yushuo <y-shuo@linux.alibaba.com>
Nobody has volunteered to maintain the (currently broken) snap build, so
remove it.
Fixes: #6769.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Adding vhost and vhost-net to the kernel modules. These do not require
any kernel module parameters to be checked. Currently, kernel params is
a required field. Make this as optional. Could make this as <Option>,
but making this a slice instead, as a module could have multiple kernel
params. Refactor the function that checks are for kernel modules into
two with one specifically checking if the module is loaded and other
checking for module parameters.
Refactor some of the tests to take into account these changes.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
We introduce virtio-balloon device to support memory resize.
virtio-balloon device could reclaim memory from guest to host.
Fixes: #6719
Signed-off-by: Helin Guo <helinguo@linux.alibaba.com>
We introduce virtio-mem device to support memory resize. virtio-mem
device could hot-plug more memory blocks to guest and could also
hot-unplug them from guest.
Fixes: #6719
Signed-off-by: Helin Guo <helinguo@linux.alibaba.com>
As block/direct volume use similar steps of device adding,
so making full use of block volume code is a better way to
handle direct volume.
the only different point is that direct volume will use
DirectVolume and get_volume_mount_info to parse mountinfo.json
from the direct volume path. That's to say, direct volume needs
the help of `kata-ctl direct-volume ...`.
Details seen at Advanced Topics:
[How to run Kata Containers with kinds of Block Volumes]
docs/how-to/how-to-run-kata-containers-with-kinds-of-Block-Volumes.md
Fixes: #5656
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
In order to support virtio-mem and virtio-balloon devices, we need to
extend DeviceOpContext with VmConfigInfo and InstanceInfo.
Fixes: #6719
Signed-off-by: Helin Guo <helinguo@linux.alibaba.com>
The key aspects of the DM implementation refactoring as below:
1. reduce duplicated code
Many scenarios have similar steps when adding devices. so to reduce
duplicated code, we should create a common method abstracted and use
it in various scenarios.
do_handle_device:
(1) new_device with DeviceConfig and return device_id;
(2) try_add_device with device_id and do really add device;
(3) return device info of device's info;
2. return full info of Device Trait get_device_info
replace the original type DeviceConfig with full info DeviceType.
3. refactor find_device method.
Fixes: #5656
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
After we have a guest kernel with builtin initramfs which
provide the rootfs measurement capability and Kata rootfs
image with hash device, we need set related root hash value
and measure config to the kernel params in kata configuration file.
Fixes: #6674
Signed-off-by: Wang, Arron <arron.wang@intel.com>
This PR updates the link to the correspondent Developer Guide at the
enabling full containerd debug that we have for kata 2.0 documentation.
Fixes#7034
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
When the version of libc is upgraded to 0.2.145, older getrandom could not adapt
to new API, and this will make agent-ctl fail to compile.
We upgrade the version of `rand`, so the low version of getrandom will no longer
need.
Fixes: #7032
Signed-off-by: Yushuo <y-shuo@linux.alibaba.com>
Fixes: #5401, #6654
- Switch kata-ctl from eprintln!()/println!() to structured logging via
the logging library which uses slog.
- Adds a new create_term_logger() library call which enables printing
log messages to the terminal via a less verbose / more human readable
terminal format with colors.
- Adds --log-level argument to select the minimum log level of printed messages.
- Adds --json-logging argument to switch to logging in JSON format.
Co-authored-by: Byron Marohn <byron.marohn@intel.com>
Co-authored-by: Luke Phillips <lucas.phillips@intel.com>
Signed-off-by: Jayant Singh <jayant.singh@intel.com>
Signed-off-by: Byron Marohn <byron.marohn@intel.com>
Signed-off-by: Luke Phillips <lucas.phillips@intel.com>
Signed-off-by: Kelby Madal-Hellmuth <kelby.madal-hellmuth@intel.com>
Signed-off-by: Liz Lawrens <liz.lawrens@intel.com>
In hypervisors that do not support virtiofs we have to copy files in
the VM sandbox to properly setup the network (resolv.conf, hosts, and hostname).
To do that, we construct the volume as before, with the addition of an extra
variable that designates the path where the file will reside in the sandbox.
In this case, we issue a `copy_file` agent request *and* we patch the spec
to account for this change.
Fixes: #6978
Signed-off-by: Anastassios Nanos <ananos@nubificus.co.uk>
Signed-off-by: George Pyrros <gpyrros@nubificus.co.uk>
When dragonball update dbs-boot crate in commit
64c764c147, the Cargo.lock in runtime-rs
should also be updated.
Fixes: #6969
Signed-off-by: xuejun-xj <jiyunxue@linux.alibaba.com>
This commit implements the vcpu_boot_onlined vector in get_fdt_vm_info.
"boot_enabled" means whether this vcpu should be onlined at first boot.
It will be used by fdt, which write an attribute called boot_enabled,
and will be handled by guest kernel to pass the correct cpu number to
function "bringup_nonboot_cpus".
Fixes: #6010
Signed-off-by: xuejun-xj <jiyunxue@linux.alibaba.com>
This commit add support of resize_vcpu on aarch64. As kvm will check
whether vgic is initialized when calling KVM_CREATE_VCPU ioctl, all the
vcpu fds should be created before vm is booted.
To support resizing vcpu scenario, we use max_vcpu_count for
create_vcpus and setup_interrupt_controller interfaces. The
SetVmConfiguration API will ensure max_vcpu_count >= boot_vcpu_count.
Fixes: #6010
Signed-off-by: xuejun-xj <jiyunxue@linux.alibaba.com>
dbs-boot-v0.4.0 refectors the create_fdt interface. It simplifies the
parameters needed to be passed and abstracts them into three structs.
By the way, it also reserves some interfaces for future feature: numa
passthrough and cache passthrough.
Fixes: #6969
Signed-off-by: xuejun-xj <jiyunxue@linux.alibaba.com>
Rewrite the comment of Vm::init_microvm method for aarch64.
Fixes cargo test warnings on aarch64.
Fixes: #6969
Signed-off-by: xuejun-xj <jiyunxue@linux.alibaba.com>
Move the get_volume_mount_info to kata-types/src/mount.rs.
If so, it becomes a common method of DirectVolumeMountInfo
and reduces duplicated code.
Fixes: #6701
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
When run a exec process in backgroud without tty, the
exec will hang and didn't terminated.
For example:
crictl -i <container id> sh -c 'nohup tail -f /dev/null &'
Fixes: #4747
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
sandbox_bind_mounts supports kinds of mount patterns, for example:
(1) "/path/to", default readonly mode.
(2) "/path/to:ro", same as (1).
(3) "/path/to:rw", readwrite mode.
Both support configuration and annotation:
(1)[runtime]
sandbox_bind_mounts=["/path/to", "/path/to:rw", "/mnt/to:ro"]
(2) annotation will alse be supported, restricted as below:
io.katacontainers.config.runtime.sandbox_bind_mounts
= "/path/to /path/to:rw /mnt/to:ro"
Fixes: #6597
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
There is a race condition when virtiofsd is killed without finishing all
the clients. Because of that, when a pod is stopped, QEMU detects
virtiofsd is gone, which is legitimate.
Sending a SIGTERM first before killing could introduce some latency
during the shutdown.
Fixes#6757.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
use device type to store the config information for different kind of
devices
Fixes:#5375
Signed-off-by: Zhongtao Hu <zhongtaohu.tim@linux.alibaba.com>
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Support remove device after container stop
Fixes:#5375
Signed-off-by: Zhongtao Hu <zhongtaohu.tim@linux.alibaba.com>
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
support linux device in runtime-rs
Fixes:#5375
Signed-off-by: Zhongtao Hu <zhongtaohu.tim@linux.alibaba.com>
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
support block volume in runtime-rs
Fixes: #5375
Signed-off-by: Zhongtao Hu <zhongtaohu.tim@linux.alibaba.com>
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
support devmapper for block rootfs
Fixes: #5375
Signed-off-by: Zhongtao Hu <zhongtaohu.tim@linux.alibaba.com>
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
As dragonball support hotplug virtio-mmio device, we should handle it in agent
Fixes:#5375
Signed-off-by: Zhongtao Hu <zhongtaohu.tim@linux.alibaba.com>
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
add the trait implementation for vfio device,
Fixes:#5375
Signed-off-by: Zhongtao Hu <zhongtaohu.tim@linux.alibaba.com>
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Support device manager for runtime-rs, add block device handler for
device manager
Fixes:#5375
Signed-off-by: Zhongtao Hu <zhongtaohu.tim@linux.alibaba.com>
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
This patch re-generates the client code for Cloud Hypervisor v32.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.
Fixes: #6632
Signed-off-by: Bo Chen <chen.bo@intel.com>
If a hypervisor debug console is enabled and sandbox_cgroup_only is set,
the hypervisor can fail to open /dev/ptmx, which prevents the sandbox
from launching.
This is caused by the absence of a device cgroup entry to allow access
to /dev/ptmx. When sandbox_cgroup_only is not set, the hypervisor
inherits the default unrestrcited device cgroup, but with it enabled it
runs into allow / deny list restrictions.
Fix by adding an allowlist entry for /dev/ptmx when debug is enabled,
sandbox_cgroup_only is true, and no /dev/ptmx is already in the list of
devices.
Fixes: #6870
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
This PR updates the container network model url that is part of the
virtcontainers documentation.
Fixes#6889
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
When this option is enabled the runtime will attempt to determine the
appropriate sandbox size (memory, CPU) before booting the virtual
machine.
As TEEs do not support memory and CPU hotplug, this approach must be
used.
Fixes: #6818
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
The AmdSev firmware package should be used with
measured direct boot. If the expected hashes are not
injected into the firmware binary by the VMM, the
guest will not boot. This is required for security.
Currently the main branch does not have the extended
shim support for SEV, which tells the VMM to inject
the expected hashes.
We ship the standard OVMF package to use with SNP,
so let's switch SEV to that for now. This will need
to be changed back when shim support for SEV(-ES)
is added to main.
Signed-off-by: Tobin Feldman-Fitzthum <tobin@ibm.com>
When updating an interface, there's maybe an existed
interface whose name would be the same with the updated
required name, thus it would update failed with interface
name existed error. Thus we should rename the existed interface
with an temporary name and swap it with the previouse interface
name last.
Fixes: #6842
Signed-off-by: fupan <fupan.lfp@antgroup.com>
We have been using the C version of virtiofsd on ppc64le. Now that the issue with
rust virtiofsd have been fixed, let's switch to it.
Fixes: #4259
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
Supports both online and offline modes of interaction with simple-kbs
for SEV/SEV-ES confidential guests.
Fixes: #6795
Signed-off-by: Dov Murik <dovmurik@linux.ibm.com>
The sev package provides utilities for launching AMD SEV and SEV-ES
confidential guests.
Fixes: #6795
Signed-off-by: Dov Murik <dovmurik@linux.ibm.com>
Let's specifically name the `gpu` runtime class as `nvidia-gpu`. By
doing this we keep the door open and ease the life of the next vendor
adding GPU support for Kata Containers.
Fixes: #6553
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Update the kata-ctl install rule to allow it to be installed to a given directory
The Makefile was updated to use an INSTALL_PATH variable to track where the
kata-ctl binary should be installed. If the user doesn't specify anything,
then it uses the default path that cargo uses. Otherwise, it will install it
in the directory that the user specified. The README.md file was also updated
to show how to use the new option.
Fixes#5403
Co-authored-by: Cesar Tamayo <cesar.tamayo@intel.com>
Co-authored-by: Kevin Mora Jimenez <kevin.mora.jimenez@intel.com>
Co-authored-by: Narendra Patel <narendra.g.patel@intel.com>
Co-authored-by: Ray Karrenbauer <ray.karrenbauer@intel.com>
Co-authored-by: Srinath Duraisamy <srinath.duraisamy@intel.com>
Signed-off-by: Narendra Patel <narendra.g.patel@intel.com>
Rework TestQemuCreateVM routine to be a table driven test with
various config variations passed to it. After CreateVM a handful
of additional functions are exercised to improve code-coverage.
Also add partial coverage for StartVM routine.
Currently improving from 19.7% to 35.7%
Credit PR to Hackathon Team3
Fixes: #267
Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
With this fix the vCPU pinning feature chooses the correct
physical cores to pin the vCPU threads on rather than always using core 0.
Fixes#6831
Signed-off-by: Peteris Rudzusiks <rye@stripe.com>