add_network_device() doesn't need to be passed NetworkInfo since it
already has access to the full HypervisorConfig.
Also, one of the goals of QemuCmdLine interface's design is to avoid
coupling between QemuCmdLine and the hypervisor crate's device module,
if at all possible. That's why add_network_device() shouldn't take
device module's NetworkConfig but just parts that are useful in
add_network_device()'s implementation.
Signed-off-by: Pavel Mores <pmores@redhat.com>
is_running_in_vm() is enough to figure out whether to disable_modern but
it's clumsy and verbose to use. should_disable_modern() streamlines the
usage by encapsulating the verbosity.
Signed-off-by: Pavel Mores <pmores@redhat.com>
This commit replaces the existing NetDevice-based implementation with one
using Netdev and DeviceVirtioNet.
Signed-off-by: Pavel Mores <pmores@redhat.com>
In keeping with architecture of QemuCmdLine implementation we split the
functionality into two objects: Netdev to represent and generate the
-netdev part and DeviceVirtioNet for the -device virtio-net-<transport>
part.
This change is a pure refactor, existing functionality does not change.
However, we do remove some stub generalizations and govmm-isms, notably:
- we remove the NetDev enum since the only network interface types that
kata seems to use with qemu are tuntap and macvtap, both of which are
implemented by the same -netdev tap
- enum DeviceDriver is also left out since it doesn't seem reasonable to
try to represent VFIO NICs (which are completely different from
virtio-net ones) with the same struct as virtio-net
- we also remove VirtioTransport because there's no use for it so far, but
with the expectation that it will be added soon.
We also make struct Netdev the owner of any vhost-net and queue file
descriptors so that their lifetime is tied ultimately to the lifetime of
QemuCmdLine automatically, instead of returning the fds to the caller and
forcing it to achieve the equivalent functionality but manually.
Signed-off-by: Pavel Mores <pmores@redhat.com>
generate_netdev_fds() takes NetworkConfig from which it however only needs
a host-side network device name. This commit makes it take the device name
directly, making the function useful to callers who don't have the whole
NetworkConfig but do have the requisite device name.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The idea of this function is to make sure O_CLOEXEC is not set on file
descriptors that should be inherited by a child (=hypervisor) process.
The approach so far is however rather heavy-handed - clearing *all* flags
is unjustifiably aggresive for a low-level function with no knowledge of
context whatsoever.
This commit refactors the function so that it only does what's expected
and renames it accordingly. It also clarifies some of its call sites.
Signed-off-by: Pavel Mores <pmores@redhat.com>
Kata CI has full debug output enabled for the cbl-mariner k8s tests,
and the test AKS node is relatively slow. So debug prints from policy
are expensive during CI.
Fixes: #9296
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Change scripts and source that uses files in the tests repo to use the
corresponding file in the current repo.
Fixes#9165
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
Currently, `.lock().await.clone()` results in `Option<ImageService>` being duplicated in memory with each call to `singleton()`.
Consequently, if kata-agent receives numerous image pulling requests simultaneously,
it will lead to the allocation of multiple `Option<ImageService>` instances in memory, thereby consuming additional memory resources.
In image.rs, we introduce two public functions:
`merge_bundle_oci()` and `init_image_service()`. These functions will encapsulate
the operations on `IMAGE_SERVICE`, ensuring that its internal details remain
hidden from external modules such as `rpc.rs`.
Fixes: #9225 -- part II
Signed-off-by: Xynnn007 <xynnn@linux.alibaba.com>
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
It is observed that virtiofsd exits immediately on s390x
if there is no attached console devices.
This commit resolves the issue by migrating `appendConsole()`
from runtime and being triggered in `start_vm()`.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
For s390x, it requires an additional option `memory-backend` for `-machine`.
Otherwise, virtiofsd exits with HandleRequest(InvalidParam).
This commit is to add a field `memory_backend` to `struct Machine`
and turn it on for s390x.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Like nvdimm for x86_64, a block device for s390x should be
treated differently with `virtio-blk-ccw`.
This is to generate a QEMU command line parameter for a block
device by using `-blockdev` and `-device` if the `vm_rootfs_driver`
is set to `virtio-blk-ccw`.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Add in the full details once cloud-hypervisor/cloud-hypervisor#6103
has been implemented, and the feature is available in a Cloud Hypervisor
release.
Fixes: #8799
Signed-off-by: David Esparza <david.esparza.borquez@intel.com>
Currently, `*-pci` is used as an argument for the device config.
It is not true for a case where a different type of bus is used.
s390x uses `ccw`.
This commit is to make it flexible to generate the device argument
based on the bus type. A structure `DeviceVhostUserFsPci` and
`VhostVsockPci` is renamed to `DeviceVhostUserFs` and `VhostVsock`
because the structure name is not bound to a certain bus type any more.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
It has been observed that the runtime stops running around
`sysinfo::total_memory()` while adjusting a config on s390x.
This is to update the crate to the latest version which happened
to resolve the issue. (No explicit release note for this)
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
MTRR, or Memory-Type Range Registers are a group of x86 MSRs providing a way to control access
and cache ability of physical memory regions.
During our test in runtime-rs + Dragonball, we found out that this register support is a must
for passthrough GPU running CUDA application, GPU needs that information to properly use GPU memory.
fixes: #9310
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
When the https_proxy/no_proxy settings are configured alongside agent-policy enabled, the process of pulling image in the guest will hang.
This issue could stem from the instantiation of `reqwest`’s HTTP client at the time of agent-policy initialization,
potentially impacting the effectiveness of the proxy settings during image guest pulling.
Given that both functionalities use `reqwest`, it is advisable to set https_proxy/no_proxy prior to the initialization of agent-policy.
Fixes: #9212
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Enable to build kata-agent with PULL_TYPE feature.
We build kata-agent with guest-pull feature by default, with PULL_TYPE set to default.
This doesn't affect how kata shares images by virtio-fs. The snapshotter controls the image pulling in the guest.
Only the nydus snapshotter with proxy mode can activate this feature.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
To support handle image-guest-pull block volume from different CRIs, including cri-o and containerd.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
By default the pause image and runtime config will provided
by host side, this may have potential security risks when the
host config a malicious pause image, then we will use the pause
image packaged in the rootfs.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Arron Wang <arron.wang@intel.com>
Co-authored-by: Julien Ropé <jrope@redhat.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Add "guest-pull" feature option to determine that the related dependencies
would be compiled if the feature is enabled.
By default, agent would be built with default-pull feature, which would
support all pull types, including sharing images by virtio-fs and
pulling images in the guest.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
support to pass image information to guest by KataVirtualVolumeImageGuestPullType
in KataVirtualVolume, which will be used to pull image on the guest.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
As we do not employ a forked containerd in confidential-containers, we utilize the KataVirtualVolume
which storing the image information as an integral part of `CreateContainer`.
Within this process, we store the image information in rootfs.storage and pass this image url through `CreateContainerRequest`.
This approach distinguishes itself from the use of `PullImageRequest`, as rootfs.storage is already set and initialized at this stage.
To maintain clarity and avoid any need for modification to the `OverlayfsHandler`,we introduce the `ImagePullHandler`.
This dedicated handler is responsible for orchestrating the image-pulling logic within the guest environment.
This logic encompasses tasks such as calling the image-rs to download and unpack the image into `/run/kata-containers/{container_id}/images`,
followed by a bind mount to `/run/kata-containers/{container_id}`.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
When being passed an image name through a container annotation,
merge its corresponding bundle OCI specification and process into the passed container creation one.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Arron Wang <arron.wang@intel.com>
Co-authored-by: Jiang Liu <gerry@linux.alibaba.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: wllenyj <wllenyj@linux.alibaba.com>
Co-authored-by: jordan9500 <jordan.jackson@ibm.com>
Containerd can support set a proxy when downloading images with a environment variable.
For CC stack, image download is offload to the kata agent, we need support similar feature.
Current we add https_proxy and no_proxy, http_proxy is not added since it is insecure.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Arron Wang <arron.wang@intel.com>
With image-rs pull_image API, the downloaded container image layers
will store at IMAGE_RS_WORK_DIR, and generated bundle dir with rootfs
and config.json will be saved under CONTAINER_BASE/cid directory.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Arron Wang <arron.wang@intel.com>
Co-authored-by: Jiang Liu <gerry@linux.alibaba.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: wllenyj <wllenyj@linux.alibaba.com>
Introduce structure ImageService, which will be used to pull images
inside the guest.
Fixes: #8103
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
co-authored-by: wllenyj <wllenyj@linux.alibaba.com>
co-authored-by: stevenhorsman <steven@uk.ibm.com>
File descriptors that are passed to QEMU need some special care.
We want them to be closed when the QEMU process is started. But
at the same time, it is required that the associated rust File
structures, either coming from the` std::fs` or the `tokio::fs`
crates, are still in scope when the QEMU process is forked. This
is currently achieved by keeping File structures in variables
at the outer scope of `start_vm()`. This scheme is currently
duplicated, with similar justifications in the corresponding
comments.
Consolidate all this handling in one place with a more generic
explanation.
Fixes#9281
Signed-off-by: Greg Kurz <groug@kaod.org>
The agent now has a number of optional build-time features that can be
enabled.
Add details of these features to the following areas:
- Version output (`kata-agent --version`)
- Announce message (so that the details are always added to the journal
at agent startup).
- The response message returned by the ttRPC `GetGuestDetails()` API.
Fixes: #9285.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Some previous contribution missed to run cargo clippy.
Fix the dependency now so that it doesn't cause noise
in future contributions.
Signed-off-by: Greg Kurz <groug@kaod.org>
Fixes: #9269
From https://github.com/opencontainers/runtime-spec/blob/main/config.md#mounts
type (string, OPTIONAL) The type of the filesystem to be mounted.
bind may be only specified in the oci spec options -> flags update r#type
The agent will ignore bind mounts if they are only specified in the OCI spec options and not in the flags.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
s390x supports a different machine type `s390-ccw-virtio` and it is
not required to configure cpu features by default for the platform.
A hypervisor `dragonball` is not supported on s390x so that `DBCMD`
is not necessary. `vm-rootfs_driver` should be set to `virtio-blk-ccw`.
This commit is to set the architecture-specific flags for Makefile.
Fixes: #9158
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The guest_hook_path item in configuration.toml allows OCI hook scripts
to be executed within Kata's guest environment. Traditionally, these
guest hook programs are pre-built and included in Kata's guest rootfs
image at a fixed location.
While setting guest_hook_path = "/usr/share/oci/hooks" in configuration.toml
works, it lacks flexibility. Not all guest hooks reside in the path
/usr/share/oci/hooks, and users might have custom locations.
To address this, a more flexible and configurable approach is to be proposed
that allows users to specify their desired path. This could include using a
sandbox bind mount path for hooks specific to that particular container.
However, The current implementation of guest hooks and bind mounts in kata-agent
has a reversed order of execution compared to the desired behavior.
To achieve the intended functionality, we simply need to swap the order of their
implementation.
Fixes: #9274
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Fixes: #9267
The doc states we have support for all lifecycle hooks. There are still some missing.
This is the first issue regarding the CreateContainer hook which is run before pivot_root but after prestart and createruntime
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
The open_named_tuntap function is designed as a public function to
open a tuntap device with the specified name. However, in order to
reference existing methods in dbs_utils, we still need to keep the
reference "path = "../../../dragonball/src/dbs_utils" in dependencies
and cannot hide it.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add network helpers and impl ToQemuParams trait to build
netdev params which are put into cmdline for Qemu VM running.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
We need ensure the add_network_device happens in netns and
move qemu process into netns which keeps the qemu process
running in this net namespace.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add network device handler in start_vm, which is sepcially
for Qemu VM running with added net params to command line.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
We need add a new netns field in struct QemuInner, and
initialize it with argument passed down in prepare_vm().
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The enter_netns function is designed as a public method to help
VMMs running as a independent process enter a network namespace,
reducing duplicate code.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
It just move the related code to a public file(utils.rs) and make
it a common method for both vsock and network, or some others.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Pavel Mores <pmores@redhat.com>
In order to better support non-builtin vmm usage of NetnsGuard and
reduce code duplication, we need to move it to a common path that
can be referenced by both hypervisor and resource manager.
In this patch, it just do moving code from network/utils/netns.rs
to kata-sys-utils/src/netns.rs
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The PID needs to be initialized before calling isClhRunning.
waitVMM() uses isClhRunning and is called by launchClh() just
before returning from function.
Fixes: #9230
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
Use containerd's default environment for container images that don't
specify the Env field.
Also, re-enable policy env variable verification, now that these
uncommon images are supported too.
Fixes: #9239
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This change updates the module import to use 'util' instead of the deprecated 'io/util'
Fixes: #9166
Signed-off-by: Chungeun Choi <ce.choi@okestro.com>
copyBuffer returns and the streams will be closed when error occurs.
If the error contains "blocked by policy" it means the log output is
disabled by policy with "ReadStreamRequest" and "WriteStreamRequest" set
to false. But at this moment, we want the real stream still working (not
be seen) because we might want to enable logging for debugging purpose,
so we repeat copybuffer in this case to avoid streams being closed.
Fixes: #8797
Signed-off-by: Linda Yu <linda.yu@intel.com>
logging/debugging information might probably be disabled in production
due to security consideration, but we'd better provide an approach for
customer to get logging information during runtime, this PR implement
setpolicy function in kata-runtime tools, although it can set whole policy
other than logging.
setpolicy would evokes remote attestation, which means before setting
policy during runtime, user has to reconfigure new policy hash in KBS/AS.
usage: kata-runtime policy set policy.rego --sandbox-id XXXXXXXX
Fixes: #8797
Signed-off-by: Linda Yu <linda.yu@intel.com>
Disable env variable verification to unblock CI, until container
images that don't specify the Env variables will be handled correctly
(see #9239).
Also, mark the image config Env field as optional, thus allowing
policy generation for these container images.
Fixes: #9240
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
a) There is some unknown syscalls triggered in new github virt machine
that would break the make test process with SIGSYS after applying
SeccompFilter. In order to fix this, we change the allowlist in this
unit test for seccompfileter into a blocklist to avoid meeting the unknown syscalls.
b) lazy static METRICS is not fully initialize in the unit test and may lead to
unstable result for this UT.
fixes: #9207
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
the mmap region start guest addr hard-code a value and later there
would be check whether the mentioned addr is larger than or equal
to mem_end (default to host_phy_mem >> 1) in order to satisfy the
requirement for DaxMemory. Since github virt machine phy_mem is larger
than previous CI machine we use, the hard-code value could no longer be
worked. To fix this, we change the address to mem_end in unit test to
avoid the influence of host machine change.
fixes: #9207
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
This fixes a panic on tracing on container exit.
The root cause is that global var needs to be set by "=" instead of
":=".
Fixes: #9102
Signed-off-by: Liu Bo <liub.liubo@gmail.com>
Add all agent configuration options to README so that users can more easily understand
what these options do and how to configure them at runtime.
Fixes: #9109
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Relax the timeout for calling CLH's CreateVM + BootVM APIs. When
hitting the older 1s timeout, killing a half-booted Guest and
retrying the same boot sequence could have been wasteful and resulting
in unstable CI testing on slower Hosts.
Fixes: #9152
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
For consistency with the go runtime.
As the shim itself is not using the network (all its communication with
other processes is done with local unix sockets), there is no reason to
keep gathering and reporting shim-specific network metrics.
Actual network usage of the kata containers can be found from the existing
agent network metrics (kata_guest_netdev_stat).
Signed-off-by: Julien Ropé <jrope@redhat.com>
As part of the shim network metrics, the shim is reporting network interfaces
from the host with no namespace isolation - this gives insight in interfaces
not tied to the kata containers, and causes an increase in resource usage for
kata metrics.
As the shim itself is not using the network (all its communication with
other processes is done with local unix sockets), there is no reason to
keep gathering and reporting shim-specific network metrics.
Actual network usage of the kata containers can be found from the existing
hypervisor network metrics (kata_hypervisor_netdev) and from the agent
network metrics (kata_guest_netdev_stat).
Fixes: #5738
Signed-off-by: Julien Ropé <jrope@redhat.com>
We need initailize the pci_hotplug_enabled with true before we do GPU
passthrough with runtime-rs/dragonball. Otherwise it fails with error
`InvalidOperation`.
Fixes: #9129
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When cgroup v2 is in use, a container should only see its part of the
unified hierarchy in `/sys/fs/cgroup`, not the full hierarchy created
at the OS level. Similarly, `/proc/self/cgroup` inside the container
should display `0::/`, rather than a full path such as :
0::/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podde291f58_8f20_4d44_aa89_c9e538613d85.slice/crio-9e1823d09627f3c2d42f30d76f0d2933abdbc033a630aab732339c90334fbc5f.scope
What is needed here is isolation from the OS. Do that by running the
container in its own cgroup namespace. This matches what runc and
other non VM based runtimes do.
Fixes#9124
Signed-off-by: Greg Kurz <groug@kaod.org>
In case an error is encountered while removing a network endpoint during
network cleanup, we cuurently return immediately with the error.
With this change, in case of error we simply log the error and proceed
towards removing the next endpoint. With this, we can cleanup the
network changes made by the shim as much as possible.
This is especially important when multiple interfaces are passed to the
network namespace using a network plugin like multus.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Move the defer for cleaning up network before the call to add network.
This way if any change made by add network is reverted by in case of
failure. This is particulary important for physical network interfaces
as with this step we make sure that driver for the physical interface is
reverted back to the original host driver. Without this the physical
network iterface will remain bound to vfio.
Fixes: #8646
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Although we don't seem to be affected by
https://nvd.nist.gov/vuln/detail/CVE-2024-21626, we vendor and use the
runc package in a few different places of our code, and we better update
the package to its latest release.
Fixes: #9097
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Last but not least, all placeholders for argument replacement
should be configured to generate a configuration file when `QEMUCMD`
is defined. This enriches those variables.
Additionally, this involves creating a symbolic link to `configuration-qemu.toml`
if QEMU is defined as the default hypervisor.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
There are some variables newly introduced to runtime-rs, such as:
- runtime.name
- runtime.hypervisor_name
- runtime.agent_name
- vm_rootfs_driver
Additionally some of the placeholders for argument replacement are
made hypervisor-specific based on the changes made for dragonball.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
For example, Kata CI's k8s-copy-file.bats transfers files between the
Host and the Guest using "kubectl exec", and that results in
CloseStdinRequest being called from the Host.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
1. Remove PullImageRequest because that is not used in the main
branch. It was used in the CCv0 branch.
2. Add default false values for the remaining Kata Agent ttrpc
requests.
These changes don't change the functionality of the auto generated
Policy, but they help with easier understanding the Policy text and
the logging from the Rego rules.
Fixes: #9049
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Since cri-o doesn't seem to use address for event publishing as mentioned
in the previous commit it will not send it. However, the exact way of
not sending it is unfortunately different from what is assumed by
runtime-rs. Due to an implementation detail of cri-o which uses containerd
libraries for some low-level tasks, TTRPC_ADDRESS will not be missing from
environment as assumed, instead it will be present with an empty value.
This commit contains a small adjustment to account for that and use
LogForwarder even if TTRPC_ADDRESS is present, but with an empty value.
Fixes#8985
Signed-off-by: Pavel Mores <pmores@redhat.com>
This is needed to fix the bug which is not allowing to create SEV container
on SNP enabled host anymore. This is a regression that was introduced as
part of the following commit:
de39fb7d38Fixes: #9036
Signed-off-by: Niteesh Dubey <niteesh@us.ibm.com>
It makes sense to reuse a configuration template for runtime-golang
as a base. This is simply to copy it into the config directory.
Fixes: #8441
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
It appears that under the shim v2 protocol, a shim has no use of its own
for the -address value, it just passes it back to container runtime's
(mostly containerd or cri-o) event-publishing binary. Since the -address
value only flows through the shim, being passed to the shim by a container
runtime and then essentially passed back by shim to the container runtime,
it seems inappropriate for a shim to validate the value that is fully
owned and only used by the container runtime.
This commit removes such validation from runtime-rs. Doing so, it solves
(part of) an interoperability problem between runtime-rs and cri-o. cri-o
seems to intentionally choose not to implement the event-publishing part
of the shim v2 protocol and thus it has no value it could pass to
runtime-rs for -address. As a result, it sends an empty string which has
been failing the excessive validation performed by runtime-rs so far.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The emergent Kata CI tests for Policy use confidential_guest = false
in genpolicy-settings.json. That value is inconsistent with the
following mount settings:
"emptyDir": {
"mount_type": "local",
"mount_source": "^$(cpath)/$(sandbox-id)/local/",
"mount_point": "^$(cpath)/$(sandbox-id)/local/",
"driver": "local",
"source": "local",
"fstype": "local",
"options": [
"mode=0777"
]
},
We need to keep those settings for confidential_guest = true, and
change confidential_guest = false to use:
"emptyDir": {
"mount_type": "local",
"mount_source": "^$(cpath)/$(sandbox-id)/rootfs/local/",
"mount_point": "^$(cpath)/$(sandbox-id)/local/",
"driver": "local",
"source": "local",
"fstype": "local",
"options": [
"mode=0777"
]
},
The value of the mount_source field is different.
This change unblocks testing using Kata CI's pod-empty-dir.yaml:
genpolicy -u -y pod-empty-dir.yaml
kubectl apply -f pod-empty-dir.yaml
k get pod sharevol-kata
NAME READY STATUS RESTARTS AGE
sharevol-kata 1/1 Running 0 53s
Fixes: #8887
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Implementing Persist API for cloud-hypervisor was done partially with
initial support for cloud-hypervisor. Store and retrieve additional
fields to/from the hypervisor state.
Fixes: #6202
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
The original handling method does not reach user expectations.
When the ClientSocketAddress method stats the corresponding
path of runtime-rs and has not found it yet, we should return
an error message here that includes the reason for the failure
(which should be an error display indicating that both runtime-go
and runtime-rs were not found). Instead of simply displaying the
corresponding path of runtime-rs as the final error message to
users.
It is also necessary to return the error promptly to the caller
for further error handling.
Fixes: #8999
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Now that we have a confidential image / initrd being built, instead of a
specific one for each TEE, let's use it everywhere possible.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we're building a single confidential kernel, we should rely on it
rather than keep using the specific ones for TDX / SEV / SNP.
However, for debugability-sake, let's do this change TEE by TEE.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we're building a single confidential kernel, we should rely on it
rather than keep using the specific ones for TDX / SEV / SNP.
However, for debugability-sake, let's do this change TEE by TEE.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we're building a single confidential kernel, we should rely on it
rather than keep using the specific ones for TDX / SEV / SNP.
However, for debugability-sake, let's do this change TEE by TEE.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
With this we can properly generate and the the `-confidential` kernel,
which supports SEV / SNP / TDX as part of our configuration files.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Until now, runtime-rs couldn't be compiled on s390x.
We need to lift those restrictions in Makefile first.
Fixes: #8446
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
It fails to compile virt_container because Dragonball is only
used in the implementation of the trait method Persist::restore().
As the hypervisor is not compiled on s390x and QEMU implements
the trait method, this commit is to let the method use QEMUi's.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Dragonball and cloud-hypervisor are not supported on s390x.
We need to exclude the plugins for these hypervisors from compilation.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
This patch can reduce load on systemd process, and
increase the k8s deployment density when using go runtime.
Fixes: #8758
Signed-off-by: Zhigang Wang <wangzhigang17@huawei.com>
Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
Kata CI's pod-sandbox-vcpus-allocation.yaml ends with "---", so the
empty YAML document following that line should be ignored.
To test this fix:
genpolicy -u -y pod-sandbox-vcpus-allocation.yaml
Fixes: #8895
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Allow users to specify in genpolicy-settings.json a default cluster
namespace other than "default". For example, Kata CI uses as default
namespace: "kata-containers-k8s-tests".
Fixes: #8976
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
There is a race condition in agent HVSOCK_STREAMS hashmap, where a
stream may be taken before it is inserted into the hashmap. This patch
add simple retry logic to the stream consumer to alleviate this issue.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Linux forbids opening an existing socket through /proc/<pid>/fd/<fd>,
making some images relying on the special file /dev/stdout(stderr),
/proc/self/fd/1(2) fail to boot in passfd io mode, where the
stdout/stderr of a container process is a vsock socket.
For back compatibility, a pipe is introduced between the process
and the socket, and its read end is set as stdout/stderr of the
container process instead of the socket. The agent will do the
forwarding between the pipe and the socket.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
In passfd io mode, when not using a terminal, the stdout/stderr vsock
streams are directly used as the stdout/stderr of the child process.
These streams are non-blocking by default.
The stdout/stderr of the process should be blocking, otherwise
the process may encounter EAGAIN error when writing to stdout/stderr.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
We want the io connection keep connected when the containerd closed
the io pipe, thus it can be attached on the io stream.
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
Support the hybrid fd passthrough mode with passing pipe fd,
which can specify this connection kept even when the pipe
peer closed, and this connection can be reget wich re-opening
the pipe.
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
In linux, when a FIFO is opened and there are no writers, the reader
will continuously receive the HUP event. This can be problematic
when creating containers in detached mode, as the stdin FIFO writer
is closed after the container is created, resulting in this situation.
In passfd io mode, open stdin fifo with O_RDWR|O_NONBLOCK to avoid the
HUP event.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
When container exits, the agent should clean up the term master fd,
otherwise the fd will be leaked.
Fixes: kata-containers#6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
When one end of the connection close, the epoll event will be triggered
forever. We should close the connection and kill the connection.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Currently in the kata container, every io read/write operation requires
an RPC request from the runtime to the agent. This process involves
data copying into/from an RPC request/response, which are high overhead.
To solve this issue, this commit utilize the vsock fd passthrough, a
newly introduced feature in the Dragonball hypervisor. This feature
allows other host programs to pass a file descriptor to the Dragonball
process, directly as the backend of an ordinary hybrid vsock connection.
The runtime-rs now utilizes this feature for container process io. It
open the stdin/stdout/stderr fifo from containerd, and pass them to
Dragonball, then don't bother with process io any more, eliminating
the need for an RPC for each io read/write operation.
In passfd io mode, the agent uses the vsock connections as the child
process's stdin/stdout/stderr, eliminating the need for a pipe
to bump data (in non-tty mode).
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Two toml options, `use_passfd_io` and `passfd_listener_port` are introduced
to enable and configure dragonball's vsock fd passthrough io feature.
This commit is a preparation for vsock fd passthrough io feature.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Using custom input paths with -i is counter-intuitive. Simplify path handling with explicit flags for rules.rego and genpolicy-settings.json.
Fixes: #8568
Signed-Off-By: Malte Poll <1780588+malt3@users.noreply.github.com>
When creating a cgroup, add a SingleContainer when obtaining the OCI Spec to apply to ctr, podman, etc.
Fixes: #5240
Signed-off-by: yaoyinnan <35447132+yaoyinnan@users.noreply.github.com>
Removed the setting of default values for runtime fields. Added explicit checks for missing or empty fields, reporting errors with clear messages.
Fixes: #8838
Signed-off-by: yaoyinnan <35447132+yaoyinnan@users.noreply.github.com>
The `noop-method-call` is a rustc lint that has existed since v1.52.0.
This lint has been moved to the warn by default lint level since v1.73.0.
Therefore build is failing with this version and above.
This commit removes the unnecessary call to `<&T as Deref>::deref` on `T: !Deref`.
Fixes: #8586
Signed-off-by: Kvlil <kalil.pelissier@gmail.com>
The auto-generated Policy already allows these volumes to be mounted,
regardless if they are:
- Present, or
- Missing and optional
Fixes: #8893
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Qemu stderr monitoring runs in its own asynchronous green thread.
For that, `stderr` is taken out of the Child representing the qemu child
process to avoid partial move and make it possible for the main thread
still to call functions on QemuInner::qemu_process (e.g. kill(), id()).
Fixes#8937
Signed-off-by: Pavel Mores <pmores@redhat.com>
We'll want to capture qemu's stderr in parallel with normal runtime-rs
execution. Tokio's primitives make this much easier than std's. This
also makes child process management more consistent across runtime-rs
(i.e. virtiofsd child process is already launched and managed using tokio).
Some changes were necessary due to tokio functions being slightly different
from their std counterparts. Child::kill() is now async and Child::id()
now returns an Option.
Signed-off-by: Pavel Mores <pmores@redhat.com>
Allow Kata CI's pod-nested-configmap-secret.yaml to work with
genpolicy and current cbl-mariner images:
1. Ignore the optional type field of Secret input YAML files.
It's possible that CoCo will need a more sophisticated Policy
for Secrets, but this change at least unblocks CI testing for
already-existing genpolicy features.
2. Adapt the value of the settings field below to fit current CI
images for testing on cbl-mariner Hosts:
"kata_config": {
"confidential_guest": false
},
Switching this value from true to false instructs genpolicy to
expect ConfigMap volume mounts similar to:
"configMap": {
"mount_type": "bind",
"mount_source": "$(sfprefix)",
"mount_point": "^$(cpath)/watchable/$(bundle-id)-[a-z0-9]{16}-",
"driver": "watchable-bind",
"fstype": "bind",
"options": [
"rbind",
"rprivate",
"ro"
]
},
instead of:
"confidential_configMap": {
"mount_type": "bind",
"mount_source": "$(sfprefix)",
"mount_point": "$(sfprefix)",
"driver": "local",
"fstype": "bind",
"options": [
"rbind",
"rprivate",
"ro"
]
}
},
This settings change unblocks CI testing for ConfigMaps.
Simple sanity testing for these changes:
genpolicy -u -y pod-nested-configmap-secret.yaml
kubectl apply -f pod-nested-configmap-secret.yaml
kubectl get pods | grep config
nested-configmap-secret-pod 1/1 Running 0 26s
Fixes: #8892
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This fixes a flaw pointed out in review of PR #8185. Creation of the
directory semantically fits better into VM preparation than VM launch.
Signed-off-by: Pavel Mores <pmores@redhat.com>
Validating the node name is currently outside the scope of the CoCo
policy.
This change unblocks testing using Kata CI's test-pod-file-volume.yaml
and pv-pod.yaml.
Fixes: #8888
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Remove the unused DriverInfo declaration or integrate it into the codebase where applicable.
Fixes: #8927
Signed-off-by: yaoyinnan <35447132+yaoyinnan@users.noreply.github.com>
Add metadata containing the Policy annotation if the user didn't
provide any metadata in the input yaml file.
For a simple sanity test using a Kata CI YAML file:
genpolicy -u -y job.yaml
kubectl apply -f job.yaml
kubectl get pods | grep job
job-pi-test-64dxs 0/1 Completed 0 14s
Fixes: #8891
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
SharedVersion fiel add a versiontable property that isn't supported by upstream QEMU.
This is dead code since virtcontainers isn't setting SharedVersions to true.
Fixes: #7720
Signed-off-by: Kvlil <kalil.pelissier@gmail.com>
Ignore pod DNS settings because policing the network traffic is
currently outside the scope of the Agent Policy.
Example from Kata CI: pod-custom-dns.yaml
Fixes: #8832
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Deploy the framework added by the previous commit to generate qemu
command line and launch the VM.
We now properly store the child process object which allows us to
implement remaining Hypervisor functions necessary for a simple but
successful VM lifecycle, get_vmm_master_tid() and stop_vm().
Fixes#8184
Signed-off-by: Pavel Mores <pmores@redhat.com>
- test_volume_capacity_stats: verify the file block size against the fetched size via statfs()
- test_reseed_rng: Correct the request codes for RNDADDTOENTCNT and RNDRESEEDCRNG when platform is ppc64le
- test list_routes: Add the route only if destination is not empty
- test_new_fs_manager: skip the test if cgroups v2 is used by default
- skip test cases rpc::tests::test_do_write_stream, sandbox::tests::test_find_process, sandbox::t
ests::test_find_container_process and sandbox::tests::add_and_get_container on ppc64le as they are fl
aky
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
kata-ctl currently fails to build on ppc64le. Skip it for running static checks and the issues will be fixed and tracked in a seperate issue.
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
A few CPU related test cases were failing as the version was being verified against Power8 while the CI machine is Power9.
Fixes: #5531
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
At the moment, a project `dragonball` and `runtime-rs` does not support
for s390x. During the enablement, some errors due to the misconfiguration
of Makefile for `make check` and `make vendor` were identified.
This is to skip the build for the affected target of the projects.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Some linting errors were identified during the enablement of `make check`.
These have not been found by the Jenkins CI job because `make test` was
only triggered.
The errors for the `agent` occurs under the s390x specific tests while
the other ones for the `kata-ctl` are the architecture-specific code.
This commit is to fix those errors.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The idea of most of these is just to prevent running into todo!()s where
we can at the moment, while implementing the fundamental functionality of
VM launch.
Signed-off-by: Pavel Mores <pmores@redhat.com>
DirectVolume/Rawblock doesn't work well when device's block driver
is virtio-blk-pci and the storage handler is DRIVER_BLK_PCI_TYPE.
Fixes: #8707
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Temporarily disable the allow_storages() rules, because they are based
on the tarfs snapshotter + container image integrity information that
are not available yet in the main branch - see #8833.
Fixes: #8834
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Adjust genpolicy-settings.json to match the container root path from
the main branch + cbl-mariner Guest VMs.
This configuration might have to be adjusted again when other types of
Guest VMs will be tested during CI using genpolicy, in the future.
Also, improve logging from allow_root_path(), to easier debug these
issues in the future.
Fixes: #8835
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Replace the `todo!()` calls with a minimal NOP implementation to return
the CH driver to working order since the `todo!()`'s forcibly crash the
driver at runtime. Full implementations for these APIs will be added on
issues #8800, #8801, and #8802.
Fixes: #8784.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Remove the `todo!()` macro which would cause a runtime crash and replace
with a implementation that returns an error as a stop-gap until #8800 is
implemented.
Fixes: #8785.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
It is a little bit heavy for the runtime-rs to forwards events via
containerd CLI, contrast to the ttrpc way. Plus, for runtimes that haven't
this mechanism, e.g. CRI-O, we can't get those events anywhere.
This patch introduces two types of forwarders:
- `ContainerdForwarder`: Acquire ttrpc address from environment variables
and forward events via ttrpc connection.
- `LogForwarder`: Write event info into logs.
Fixes: #7881
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
This is not necessarily meant to work, just to stub out unimplemented
functionality while focusing on more fundamental things.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The agent registers an event fd in `memory.oom_control`. An OOM event is
forwarded to containerd when the event is emitted, regardless of the
content in that file.
I observed content indicating that events should not be forwarded, as shown
below. When `oom_kill` is set to 0, it means no OOM has occurred. Therefore,
it is important to check the content to avoid mistakenly forwarding OOM
events.
```
oom_kill_disable 0
under_oom 0
oom_kill 0
```
Fixes: #8715
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Don't release the lock between is_allowed and set_policy calls,
because the policy might change in between these calls.
Also, move more policy code into policy.rs.
Fixes: #8734
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
- `ttrpc` from `0.7.1` to `0.8`.
- `containerd-shim-protos` from `0.3.0` to `0.6.0`.
Fixes: #8756
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
In order to avoid rust-vmm upstream change breaks Dragonball
compilation, we introduce Cargo.lock to dbs crates.
fixes: #8770
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
In order to avoid rust-vmm upstream change breaks Dragonball
compilation, we introduce Cargo.lock to dbs crates.
fixes: #8770
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
As reported in #8767, we have found that the root cause is that rust-vmm's vmm-sys-utils
introduce a new release 0.12.1 and dbs-pci rely on rust-vmm's vfio-ioctls which uses >=
to declare vmm-sys-utils so it automatically upgrade vmm-sys-utils to 0.12.1.
That's how two different versions of vmm-sys-utils is introduced and this breaks the compilation.
In order to fix this and also avoid future problems, we introduce Cargo.lock file to dbs crates.
fixes: #8770
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Previously, Dragonball did not support PCI device hot-plugging or
VFIO device passthrough. Therefore, the runtime-rs support for
Dragonball was incomplete. it is time to complete it so that users
can use Dragonball's PCI hot-plugging and VFIO passthrough capabilities.
Fixes: #8748
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
vfio commits introduce quite a lot change in runtime-rs, this commit is
for all the changes related to ci, including compilation errors and so on.
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Introduce two new vmm action to implement pci hotplug
and pci hot-unplug: PrepareRemoveHostDevice and RemoveHostDevice.
PrepareRemoveHostDevice is to call upcall to unregister the pci device
in the guest kernel.
RemoveHostDevice should be called after PrepareRemoveHostDevice, it is used
to clean the PCI resource in the Dragonball side.
fixes: #8741
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Helin Guo <helinguo@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Introduce a new vmm action InsertHostDevice to passthrough
host pci devices like NIC or GPU devices into guest so that
users could have high performance usage of those devices.
fixes: #8741
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Helin Guo <helinguo@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Add a pcie_topology field to DeviceManager and initialize
pcie_topology when ResourceManager calls DeviceManager's new()
with TopologyConfigInfo.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Before calling the device driver to attach a device, register
the device to PCIe topology and allocate a PciPath for it.
However, for some hypervisor such as CLH, the allocation is invalid
when plugging devices to VM, they have the ability to return
DeviceInfo containing PciPath. It'll update the PciPath with the
returned pci path in the PCIe topology for them to prevent the
inferred pcipath from being different from the actual value returned.
But the update will not be executed if the pcipath value doesn't change.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Introduce helper macros to simplify PCIe device register/unregister
and update, which provides a convenient way to handle devices in
topology.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Add one more argument with type &mut Option<&mut PCIeTopology>
in attach and detach to inroduce methods within PCIe Topology.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Implement Trait PCIeDevice register/unregister for pcie/pci
device, such as vfio device which needs set/get device's pci
path for kata agent's device handler.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Introduce Trait PCIeDevice with register/unregister, which are
used to register or unregister pcie device within the PCIe topology.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Due to different ways that different VMMs handle PCI devices,
we expect to provide a general PCIe topology processing framework
that is as compatible as possible with VMMs such as dragonball,
qemu, clh(Though it has its own management method, no conflict).
Currently,it's mainly developed for kinds of PCIe/PCI devices in
dragonball/clh which are attached on the pci/pcie root bus directly.
More will be added when Qemu is ready in runtime-rs.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
A TopologyConfigInfo added to store device config info for PCIe/PCI
devices in the VM from Hypervisor DeviceInfo.
And TopologyConfigInfo::new will be the entry to initialize PCIe
Topology for each VM.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
vfio mod collects lots of information related to the vfio operations, including VfioMsi and VfioMsix capability & state,
vfio interrupt info, pci region infor and vfio pci device info & state.
fixes: #8722
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Shifang Feng <fengshifang@linux.alibaba.com>
Signed-off-by: Yang Su <yang.su@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Xin Lin <jingshan@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Bridge the gap between user requirements for direct block device access
and the DirectVolume capabilities provided by Kata runtimes
(kata-runtime/runtime-rs), and facilitate seamless integration with CSI
to improve user experience.
It aims to integrate DirectVolume CSI support into Kata, enabling users
to benefit from its performance and flexibility advantages.
Fixes: #8602
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
This patch introduces a feature of supporting vhost-user-blk device.
This device needs to be defined before the VM instance is started,
which can be done through the dbs-cli tool with --virblks option:
--virblks '{
"drive_id": "8623",
"device_type": "Spdk",
"path_on_host": "spdk:///var/tmp/vhost.sock",
"is_root_device": false,
"is_read_only": false,
"is_direct": false,
"no_drop": false,
"num_queues": 1,
"queue_size": 256
}'
Fixes: #8631
Signed-off-by: Eric Ren <renzhen@linux.alibaba.com>
Signed-off-by: fupan <fupan.lfp@antgroup.com>
Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
Signed-off-by: Qinqi Qu <quqinqi@linux.alibaba.com>
The compiler will give a warning if a developer forget to add an arm for
a new variants defined.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
DAN reads vhost-user-net device from JSON config. It only supports VMM
running as server right now.
Fixes: #8625
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
The changes involve:
- Expose VhostUserConfig struct to runtime-rs.
- Set a default value while num_queues or queue_size are 0.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
This commit introduces VhostUserEndpoint and supports relative to
vhost-user-net devices for device manager. For now, Dragonball is able to
attach vhost-user-net devices.
Fixes: #8625
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Config space of network device is shared and accord with virtio 1.1 spec.
It is a good way to abstract the common part into one function.
`set_config_space()` implements this.
Plus, this patch removes `vq_pairs` from vhost-net devices, since there is
a possibility of data inconsistency. For example, some places read that
from `self.vq_pairs`, others read from `queue_sizes.len() / 2`.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Also support alternative media type and update samples
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Add application that infers K8s user's intentions based on user's
K8s YAML file, and generates a Rego/OPA based policy for that YAML.
Just Pod YAML files are supported as input using this initial source
code. Support for other types of YAML files will come with upcoming
commits.
Fixes: #7673
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
introduce msi/msix mod to maintain information for PCI Message Signalled
Interrupt Extended Capability. It will be initialized when parsing pci
configuration space and used when getting interrupt capabilities.
fixes: #8661
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Shifang Feng <fengshifang@linux.alibaba.com>
Signed-off-by: Yang Su <yang.su@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Xin Lin <jingshan@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
This PR introduces vhost-user-net devices to Dragonball. The devices are
allowed to run as server on the VMM side.
Fixes: #8502
Signed-off-by: Eric Ren <renzhen@linux.alibaba.com>
Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Vhost-user-net has a dependency on address space from `MmioV2DeviceState`.
The addition of the address space is introduced in this patch. Plus, it
makes sure all unit tests have the according parameter as well.
Fixes: #8502
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
It's important to ensure that these tasks which setup vfio
devices are completed before add_device.
So Moving vfio device setup code to a dedicated method at device
building time which does not affect the behavior of other code.
And this change makes it easier to understand the difference
between create and attach, and also makes the boundaries
clearer.
Fixes: #8665
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Make VhostUserConfig pci_path's type more specific, change it
from Option<String> to Option<PciPath>.
Fixes: #8665
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Add implementations for the following `Hypervisor` trait methods which
simply return the same details as the `get_vmm_master_tid()` method:
- `get_thread_ids()`
- `get_pids()`
Fixes: #6438.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
This is to reintroduce a configuration rule for IBM Z Secure Execution,
where no initrd path should be configured. For the TEE of interest,
only a kernel image should be specified with `confidential_guest=true`.
Fixes: #8692
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
`make SUPPORT_VIRTUALIZATION=1 test` iterates through all subcrates and
does test.
Plus, this patch fixes some issues about unit tests:
- Feed too much parameters to `I8042Device::new()`.
- Virtqueue checks have been introduced since `virtio-queue v0.7.0`.
- GHA might have no access to `/var/tmp` dir on runner.
Fixes: #8690
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
As a follow-up for #8516, guest_cid and vhost_fd are not necessarily initialised
via new(). Instead, the fields should be initialised later when they are really
used to construct hypervisor's parameters.
This commit is to separate init_config() from new() to initialise guest_cid
and vhost_fd and leave only the assignment of id for the existing function.
Fixes: #8671
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
We use a matching direct-volume path to determine whether an OCI mount
is a DirectVolume. However, we should handle the case where no match is
found appropriately.
This error will be defined as a non-DirectVolume type when judging the
OCI mount but not failed.
Fixes: #8619
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
DirectVolume structure in runtime-rs is different from it in kata-runtime,
which causes they has no unified handling method for DirectVolumeMountInfo
and MountInfo.
We should align the two by simply adding the attribute #[serde(rename="x")
to each field in DirectVolumeMountInfo
Fixes: #8619
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Dragonball sets a default queue config in the case of `None`. The
queue_size and num_queues of vhost-net are set to `Some(0)` by default.
Therefore, we might get an invalid queue config. This patch fixes this
issue.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
This patch set vhost-net as default backend of networking. It allows users
to set `disable_vhost_net` to `true` to reenable virtio-net backend.
Plus, which backend to use is a matter of hypervisor, runtime-rs will no
longer need to know that.
Fixes: #8608
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Make the CH (Cloud Hypervisor) `stop_vm()` method check the VM state before
attempting to stop the VM, and update the state once the VM has stopped.
This avoids the method failing if called multiple times which will
happen if the workload exits before the container manager requests that
the container stop.
This change ensures the CH driver finishes cleanly.
Fixes: #8629.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Add a `--show-default-config-paths` command line option for parity with
`kata-runtime`.
Note that this requires the `KataCtlCli.command` to be optional so that
the user can run simply:
```bash
$ kata-ctl --show-default-config-paths
```
... without also specifying a (sub-)command.
Fixes: #8640.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
The layout of packed virtqueue isn't supported by `Endpoint::negotiate()`.
Communication between device and driver will be failed due to the failure
of parsing virtqueue if we don't disable the packed feature. This patch
fixes this issue.
Fixes: #8633
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
In order to follow up the PCI implementation in Dragonball, we need to
add PCI root device and root bus support.
root device is a pseudo PCI root device to manage accessing to PCI
configuration space.
root bus is mainly for emulating PCI root bridge and also create the PCI
root bus with the given bus ID with the PCI root bridge.
fixes: #8563
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Shifang Feng <fengshifang@linux.alibaba.com>
Signed-off-by: Yang Su <yang.su@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Xin Lin <jingshan@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Right now, cargo fmt check in Dragonball only test with the default
features but not all features. This will cause some code being untested
by the fmt tool.
This PR adds --all option for the Dragonball CI and also fix some code
that forgets to do cargo fmt --all.
fixes: #8598
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
If a wrong configuration.toml file is used by accidentally, runtime-rs
binary could run into panic because of unwrap().
This fixes the panic by returning errors instead of unwrap().
fixes: #8565
Signed-off-by: Liu Bo <liub.liubo@gmail.com>
Log-parser-rs was always intended to become a sub-functionality of
kata-ctl, but it was useful to develop it and initaly merge it as a
standalone program, and migrate it to a subcommand later.
Fixes#6797
Signed-off-by: Gabe Venberg <gabevenberg@gmail.com>
Since cloud-hypervisor is no longer built as an optional feature,
lets mention cloud-hypervisor in the list of hypervisors supported by
runtime-rs.
Fixes: #8587
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
This patch implements the virtio-fs device used for filesystem sharing
and heavily based on the vhost-user protocol.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
Signed-off-by: Huang Jianan <jnhuang@linux.alibaba.com>
Signed-off-by: Qinqi Qu <quqinqi@linux.alibaba.com>
Implement try_from trait function to convert runtime-rs BlockConfig
to cloud-hypervisor DiskConfig. This can allow for code reuse in the
future.
Fixes: #8581
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Add some test cases for vhost-user-fs function.
Signed-off-by: Beiyue <beiyue@linux.alibaba.com>
Signed-off-by: Huang Jianan <jnhuang@linux.alibaba.com>
This patch implements the virtio-fs device used for filesystem sharing
and heavily based on the vhost-user protocol.
This vhost-user-fs device defines 5 parameters:
- path: vhost-user socket path
- tag: mount tag used from the guest to mount the filesystem
- req_num_queues: number of request virtqueues
- queue_size: depth of each virtqueue
- cache_size: cache window size for dax
This device needs to be defined before the VM instance is started,
which can be done through the dbs-cli tool with --fs option:
--fs '{
"sock_path":"/path/to/virtiofs.socket",
"tag":"myfs",
"num_queues":1,
"queue_size":1024,
"cache_size":0,
"thread_pool_size":1,
"cache_policy":"auto",
"writeback_cache":true,
"no_open":true,
"xattr":true,
"drop_sys_resource":false,
"mode":"vhostuser",
"fuse_killpriv_v2":true,
"no_readdir":false,
}'
Fixes: #8428
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Eryu Guan <eguan@linux.alibaba.com>
Signed-off-by: Huang Jianan <jnhuang@linux.alibaba.com>
This is required for clh to work with nerdtcl and docker.
This fixes the issues seen with nerdctl while starting a container.
Hoewever, container exit with docker is still broken due to an unrelated
issue.
Fixes: #8579
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
PR #8483 changed the location of the rust runtime config files to
`/etc/kata-containers/runtime-rs/`. However, if you haven't updated your
system to create that directory, attempting to create a container using
the rust runtime was giving the following cryptic message
(formatted for easier reading):
```
failed to handler message try init runtime instance
Caused by:
0: load config
1: load toml config
2: entity not found
```
Now, the message is as follows (again, reformatted for easier reading):
```
failed to handle message try init runtime instance
Caused by:
0: load config
1: load TOML config failed (tried [
\"/etc/kata-containers/runtime-rs/configuration.toml\",
\"/usr/share/defaults/kata-containers/runtime-rs/configuration.toml\",
\"/opt/kata/share/defaults/kata-containers/runtime-rs/configuration.toml\"
])
```
Fixes: #8557.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Bring support for legacy vsock and add Vsock to the ResourceConfig
enum type, and add the processing flow of the Vsock device to the
prepare_before_start_vm function.
Fixes: #8474
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Instroduce prepare_vm_socket_config to VirtSandbox for vm
socket config, including Vsock and Hybrid Vsock.
Use the capabilities() trait of the hypervisor to get the
vm socket supported in VMM.
Fixes: #8474
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Add Cap of HybridVsockSupport for hypervisors CLH and Dragonball
which use hybrid-vsock, default for Qemu, which uses legacy vsock.
Fixes: #8474
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Introduce HybridVsock Cap to judge which kind of vm socket will
be supported by the Hypervisor.
Use `is_hybrid_vsock_supported` to tell if an hypervisor supports
hybrid-vsock, if not, it supports legacy vsock.
Fixes: #8474
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Remove the redundant `kata-ctl` `root` check when running the `env`
command. This check duplicated the `GuestProtection` check, and that
check is now no longer necessary anyway.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
It is no longer necessary to be `root` to query the guest protection
(TDX) on `x86_64` systems, so drop the requirement.
> **Note:**
>
> This change drops the `nix` `Uid` import required for the `root` check.
> But at the same time it adds it for PPC64le since that implementation of
> `available_guest_protection()` needs it and it was previously missing.
Fixes: #8548.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
PR #8311 inadvertently broke the logging since no log messages below the
`Info` level are logged now, regardless of the requested log level.
Resolve the issue by storing the requested log level in the
`RuntimeComponentLevelFilter` and using that level in the `log()`
function, rather than hard-coding `Info` as the default where no entry
is found in the `FILTER_RULE` hashmap.
Fixes: #8546.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Device mapper is the only supported block device driver so far,
which seems limiting. Kata Containers can work well with other
block devices. It is necessary to enhance supporting of multiple
kinds of host block device.
Fixes#4714
Signed-off-by: yuchen.cc <yuchen.cc@alibaba-inc.com>
This commit inits dbs-pci lib for Dragonball to use.
It contains several implementation now:
1. PCI configuration space
2. PCI bus
More info of the design & behavior of those two features could be found
in the README of dbs-pci.
fixes: #8479
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Shifang Feng <fengshifang@linux.alibaba.com>
Signed-off-by: Yang Su <yang.su@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Xin Lin <jingshan@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
This reverts commit b0157ad73a.
```
commit b0157ad73a
Refs: 3.3.0-alpha0-124-gb0157ad73
Author: Fabiano Fidêncio <fabiano.fidencio@intel.com>
AuthorDate: Fri Aug 11 14:55:11 2023 +0200
Commit: Fabiano Fidêncio <fabiano.fidencio@intel.com>
CommitDate: Fri Nov 10 12:58:20 2023 +0100
runtime: confidential: Do not set the max_vcpu to cpu
We don't have to do this since we're relying on the
`static_sandbox_resource_mgmt` feature, which gives us the correct
amount of memory and CPUs to be allocated.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
```
This commit was removing a requirement that was made previously, but due
to the SMP issue we're facing with the QEMU used for TDX (see commit
d1b54ede290e95762099fff4e0bcdad10f816126*), QEMU will fail to start due
to:
```
Invalid CPU topology: product of the hierarchy must match maxcpus:
sockets (1) * dies (1) * cores (1) * threads (1) != maxcpus (240)"
```
This has no affect on the SEV / SNP workflow and hopefully we'll be able
to re-revet this soon enough, when this gets solved on te QEMU side.
Last but not least, this is not a "clean" revert as we're using
conf.NumVCPUs() instead of conf.NumVCPUs, to ensure we're dealing with
uint32.
Fixes: #8532
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
- Update the remote hypervisor code to match the re-genned code for
the ttrpc Hypervisor Service
Fixes: #8519
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The new clean-generated-files make target allows for removing the
generated files (including the configuration.toml files).
The tools/packaging/static-build/shim-v2/build.sh script now uses that
target to always force the re-generation of those files.
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
As the configuration files are different, we can safely remove those as
any new installation of the binary should also bring in the new
configurations.
This makes things less error-prone in the future, as we're ensuring that
the rust runtime will only be reading the rust configuration files.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Update the `DEFAULT_RUNTIME_CONFIGURATIONS` list to include a number of
rust runtime specific paths to try to load before checking the
"traditional" (golang) runtime configuration paths.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Install the rust runtime configuration files to a `runtime-rs/`
directory to distinguish them from the golang config files (which may
have a different syntax).
The default values mean that the rust config files are now installed to
`/opt/kata/share/defaults/kata-containers/runtime-rs/` rather than
`/opt/kata/share/defaults/kata-containers/`.
See: https://github.com/kata-containers/kata-containers/issues/6020Fixes: #8444.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
(1) Add enum DirectVolumeType for direct volumes.
(2) Reimplement spdk volume into direct_volume and
do alignment of rawblock volume.
Fixes: #8300
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
The current infra(K8S, CSI, CRI, Containerd) for Kata containers is
unable to properly handle direct volumes, resulting in the need for
workarounds like searching/comparision and then patch up volume type.
In this commit, reimplement of handling method is added to support
raw block volume which backends may be rawdisk or other format file.
Fixes: #8300
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
As vsock device will be used in Qemu or other VMMs, the Vsoock
is reintroduced to DeviceType enum.
Fixes: #8474
Signed-off-by: Pavel Mores <pmores@redhat.com>
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Currently encounters difficulty in utilizing the clone operation
on VsockConfig due to the implicit management of the vhost fd
within the runtime-rs. This responsibility should be delegated to
the VMM(especially QEMU) child process, as it's not runtime-rs core
responsibilities. We'll remove the member vhost_fd from VsockConfig
and make the VsockConfig/VsockDevice Cloneable.
Fixes: #8474
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Introduce a new function generate_vhost_vsock_cid to generate
a guest CID and set guest CID for vsock fd.
Also this commit wouldn't introduce functional change and it's
just splited from the previous VsockDevice::new().
Fixes: #8474
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
(1) rawblock volume is directvol mount type.
(2) block volume is based on the bind mount type.
Fixes: #8300
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
1) Creating storage for all `io.katacontainers.volume=` messages in rootFs.Options,
and then aggregates all storages into `containerStorages`.
2) Creating storage for other data volumes and push them into `volumeStorages`.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
To enhance the construction and administration of `Katavirtualvolume` storages,
this commit expands the 'sharedFile' structure to manage both
rootfs storages(`containerStorages`) including `Katavirtualvolume` and other data volumes storages(`volumeStorages`).
NOTE: `volumeStorages` is intended for future extensions to support Kubernetes data volumes.
Currently, `KataVirtualVolume` is exclusively employed for container rootfs, hence only `containerStorages` is actively utilized.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
The snapshotter will place `KataVirtualVolume` information
into 'rootfs.options' and commence with the prefix 'io.katacontainers.volume='.
The purpose of this commit is to transform the encapsulated KataVirtualVolume data into device information.
Fixes: #8495
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Feng Wang <feng.wang@databricks.com>
Co-authored-by: Samuel Ortiz <sameo@linux.intel.com>
Co-authored-by: Wedson Almeida Filho <walmeida@microsoft.com>
Update cloud hypervisor implementation to allow hybrid vsock device to
be handled.
Fixes#6692
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
In the case of Cloud Hypervisor running on arm64 architecture,
only arm AMBA UART (pl011) is supported as the TTY. Consequently,
when enabling Hypervisor debug mode, it's essential to configure
the console as "ttyAMA0" rather than "ttyS0
Fixes: #8381
Signed-off-by: briwan01 <brian.wang@arm.com>
The test utils will be used by the upcoming feature tests: vhost-user-net,
vhost-user-blk and vhost-user-fs.
Signed-off-by: Beiyue <beiyue@linux.alibaba.com>
Signed-off-by: Huang Jianan <jnhuang@linux.alibaba.com>
The vhost-user connection management logic will be used by
the upcoming features: vhost-user-net, vhost-user-blk and
vhost-user-fs.
Fixes: #8448
Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
Signed-off-by: Qinqi Qu <quqinqi@linux.alibaba.com>
Signed-off-by: Huang Jianan <jnhuang@linux.alibaba.com>
It mainly focus on the two parts:
(1) redesign the ShareFsConfig with ShareFsMountConfig
The device mount operation must depend on the fact that sharefs
device exists, and re-design the structure of SharesFsConfig and
move the ShareFsMountConfig into it with Option type, which is to
describe the relation between ShareFsConfig and ShareFsMountConfig.
(2) move virtiofs into device manager
Currently, virtio-fs is still outside of the device manager.
To do Enhancement of device manager, it will bring virtio-fs
device in device-manager for unified management
Fixes: #7915
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
In order to support PCI VFIO functionality in Dragonball, we should
first add PCI bus and PCI device Interrupt information in Dragonball
mptable setup process.
This patch add :
1. pci_legacy_irqs transfered to setup_mptable function.
2. pci bus support in mptable mem
3. pci interrupt support in mptable mem
fixes: #8449
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Add the corresponding data structure in the runtime part according to
kata-containers/kata-containers/pull/7698.
Fixes: #8472
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
In order to support different pod VM instance type via
remote hypervisor implementation (cloud-api-adaptor),
we need to pass machine_type, default_vcpus
and default_memory annotations to cloud-api-adaptor.
The cloud-api-adaptor then uses these annotations to spin
up the appropriate cloud instance.
Reference PR for cloud-api-adaptor
https://github.com/confidential-containers/cloud-api-adaptor/pull/1088Fixes: #7140
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
(based on commit 004f07f076)
This patch updates the template configuration file for
the remote hypervisor to set static_sandbox_resource_mgmt
to be true. The remote hypervisor uses the peer pod config
to determine the sandbox size, so requires this to be set to
true by default.
Fixes: #6616
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
(based on commit 938447803b)
Add the SELinux setting to ensure it is passed through to the remote
hypervisor
Fixes: #5936
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
(based on commit 3ef2fd1784)
This patch adds the support of the remote hypervisor type.
Shim opens a Unix domain socket specified in the config file,
and sends TTPRC requests to a external process to control
sandbox VMs.
Fixes#4482
Co-authored-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
(based on commit f9278f22c3)
This patch adds a protobuf definiton of the remote hypervisor type.
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
(based on commit 150e8aba6d)
This PR fixes k8's configmap/secrets etc update propagation when filesystem sharing is disabled.
The commit introduces below changes with some limitations:
- creates new timestamped directory in guest
- updates the '..data' symlink
- creates user visible symlinks to newly created secrets.
- Limitation: The older timestamped directory and stale user visible symlinks exist in guest
due to missing DELETE api in agent.
Fixes: #7398
Signed-off-by: Sumedh Alok Sharma <sumsharma@microsoft.com>
Improve the `CODEOWNERS` file by specifying more groups.
Since GitHub automatically checks the `CODEOWNERS` file when a PR is
created and adds all matching groups as reviewers for the PR, this may
help reduce the PR backlog since the right people will be alerted and
requested to review the PR. That should improve the quality of reviews
(and thus the quality of the landed code). It may also have a positive
effect on PR velocity.
> **Note:**
>
> This PR combines the other `CODEOWNERS` files so we have
> a single, visible, top-level file.
See: https://github.com/kata-containers/community/issues/253Fixes: #3804.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
This commit enables StratoVirt hypervisor to be tested in kata GHA,
incluing k8s, metrics, cri-containerd, nydus and so on.
Meanwhile, adding some unit tests for StratoVirt to make sure it works.
Fixes: #7794
Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
Add configuration-stratovirt.toml.in to generate the StratoVirt configuration,
and parser to deliver config to StratoVirt.
Fixes: #7794
Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
Initial support of the MicroVM machine type of StratoVirt
hypervisor for the kata go runtime.
Fixes: #7794
Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
When multiple containers in a kata pod share one direct volume,
it's important to make sure that the corresponding block device
is only mounted once in the guest. This means that there should
be only one mount entry for the device in the mount information.
Fixes: #8328
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
When a direct volume is used by multiple containers in Kata,
Generating many shared paths with cids will cause IO error
as the result of one direct volume mounts more than once.
To correct it, use the device_id instead of cid which
ensures that the guest only mounts the FS once.
Fixes: #8328
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
This patch is to remove vhost-net dependency on virtio-net for
dbs-virtio-devices crate. Then, the feature of vhost-net is able to enable
without enabling virtio-net device, error, etc.
Fixes: #8423
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Introduce the `update_device` trait in Hypervisor to enable
device updates for VMMs.This trait will initially be utilized
for virtiofs Mount operations.
Fixes: #7915
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
PR #8311 inadvertently broke the runtime-rs / Cloud Hypervisor TDX
handling. It also introduced unrecoverable failure scenarios. Hence,
replace slow, fallible regex matching in logging fast path with single pass
non-failing multi-string log level matching.
Also, added a unit test for `parse_ch_log_level()`.
Fixes: #8418.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
- Remove two panic statements from InsertNetworkDevice test.
- Rename `NUM_QUEUES` to `DEFAULT_NUM_QUEUES`, `QUEUE_SIZE` to
`DEFAULT_QUEUE_SIZE` for vhost-net and virtio-net.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
`test_networkconfig_to_netconfig` from clh depends on `NetworkConfig` which
has some new fields in this PR. Therefore, this commit gives the test
missing fields.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
- Dragonball's vhost-net feature not depends on virtio-net feature.
- Remove `TapError` from dbs-virtio-devices's Error, and add `VirtioNet`
and `VhostNet` two fields.
- Downgrade visiblity of two fields of `VhostNetDeviceMgr` from
`pub(crate)`.
- File an issue to record a todo for network rate limiter.
- Print internal errors with `{0:?}.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
With the change done in the last commit, instead of calculating milli
cpus, we're actually converting the CPUs to a fraction number, a float.
Let's update the function name (and associated vars) to represent that
change.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
First of all, this is a controversial piece, and I know that.
In this commit we're trying to make a less greedy approach regards the
amount of vCPUs we allocate for the VMM, which will be advantageous
mainly when using the `static_sandbox_resource_mgmt` feature, which is
used by the confidential guests.
The current approach we have basically does:
* Gets the amount of vCPUs set in the config (an integer)
* Gets the amount of vCPUs set as limit (an integer)
* Sum those up
* Starts / Updates the VMM to use that total amount of vCPUs
The fact we're dealing with integers is logical, as we cannot request
500m vCPUs to the VMMs. However, it leads us to, in several cases, be
wasting one vCPU.
Let's take the example that we know the VMM requires 500m vCPUs to be
running, and the workload sets 250m vCPUs as a resource limit.
In that case, we'd do:
* Gets the amount of vCPUs set in the config: 1
* Gets the amount of vCPUs set as limit: ceil(0.25)
* 1 + ceil(0.25) = 1 + 1 = 2 vCPUs
* Starts / Updates the VMM to use 2 vCPUs
With the logic changed here, what we're doing is considering everything
as float till just before we start / update the VMM. So, the flow
describe above would be:
* Gets the amount of vCPUs set in the config: 0.5
* Gets the amount of vCPUs set as limit: 0.25
* ceil(0.5 + 0.25) = 1 vCPUs
* Starts / Updates the VMM to use 1 vCPUs
In the way I've written this patch we introduce zero regressions, as
the default values set are still the same, and those will only be
changed for the TEE use cases (although I can see firecracker, or any
other user of `static_sandbox_resource_mgmt=true` taking advantage of
this).
There's, though, an implicit assumption in this patch that we'd need to
make explicit, and that's that the default_vcpus / default_memory is the
amount of vcpus / memory required by the VMM, and absolutely nothing
else. Also, the amount set there should be reflected in the
podOverhead for the specific runtime class.
One other possible approach, which I am not that much in favour of
taking as I think it's **less clear**, is that we could actually get the
podOverhead amount, subtract it from the default_vcpus (treating the
result as a float), then sum up what the user set as limit (as a float),
and finally ceil the result. It could work, but IMHO this is **less
clear**, and **less explicit** on what we're actually doing, and how the
default_vcpus / default_memory should be used.
Fixes: #6909
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
We don't have to do this since we're relying on the
`static_sandbox_resource_mgmt` feature, which gives us the correct
amount of memory and CPUs to be allocated.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
TestCheckHostIsVMContainerCapable removes sysModuleDir to simulate a
case that the kernel modules are not loaded. However,
checkKernelModules() executes modprobe <module> if a module not
found in that directory. Loading those modules is required to be denied
temporarily.
Fixes: #8390
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
- Disable device cgroup restriction while pod cgroup is not available.
- Remove balcklist-related names and change whitelist-related names to
allowed_all.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
FSManager of systemd cgroup manager is responsible for setting up cgroup
path. The container launching will be failed if the FSManager is in
read-only mode.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
The changes include:
- Change to debug logging level for resources after processed.
- Remove a todo for pod cgroup cleanup.
- Add an anyhow context to `get_paths_and_mounts()`.
- Remove code which denys access to VMROOTFS since it won't take effect. If
blackmode is in use, the VMROOTFS will be denyed as default. Otherwise,
device cgroups won't be updated in whitelist mode.
- Add a unit test for `default_allowed_devices()`.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
The runk is a standard OCI runtime that isnt' aware of concept of sandbox.
Therefore, the `devcg_info` argument of `LinuxContainer::new()` is
unneccessary to be provided.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
The target is to guarantee that containers couldn't escape to access extra
devices, like vm rootfs, etc.
Assume that there is a cgroup, such as `/A/B`. The `B` is container cgroup,
and the `A` is what we called pod cgroup. No matter what permissions are
set for the container (`B`), the `A`'s permission is always `a *:* rwm`. It
leads that containers could acquire permission to access to other devices
in VM that not belongs to themselves.
In order to set devices cgroup properly, the order of setting cgroups is
that the pod cgroup comes first and the container cgroup comes after.
The `Sandbox` has a new field, `devcg_info`, to save cgroup states. To
avoid setting container cgroup too early, an initialization should be done
carefully. `inited`, one of the states, is a boolean to indicate if the pod
cgroup is initialized. If no, the pod cgroup should be created firstly, and
set default permissions. After that, the pause container cgroup is created
and inherits the permissions from the pod cgroup.
If whitelist mode which allows containers to access all devices in VM is
enabled, then device resources from OCI spec are ignored.
This feature not supports systemd cgroup and cgroup v2, since:
- Systemd cgroup implemented on Agent hasn't supported devices subsystem so
far, see: https://github.com/kata-containers/kata-containers/issues/7506.
- Cgroup v2's device controller depends on eBPF programs, which is out of
scope of cgroup.
Fixes: #7507
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Since moving from network coldplug to hotplug, the only case verified
was veth endpoints. Support for network hotplug for ipvlan and macvlan was
broken/not added. Fix it.
Fixes: #8391
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Remove the redundant `VmConfigError::EmptyVsockSocketPath` error from
the Cloud Hypervisor config crate since this scenario is already handled
by the `VsockConfigError::NoVsockSocketPath` error.
Fixes: #8385.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Remove the redundant `parse_mac()` function: this was never used and we
already have an implementation in `crates/resource/src/network/utils/mod.rs`.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Virtio-net and vhost-net share a common virtio config, and vhost-user-net
uses another config, named `VhostUserConfig`. Thus, the virtio config could
be added into `NetworkConfig` instead of `Backend`.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Moving Dragonball structs convertions out of device drivers to keep driver
neutral. The convertions include `NetworkBackend` to
`DragonballNetworkBackend` and `NetworkConfig` to
`DragonballNetworkConfig`.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Changes include:
- Merge `VhostNetDeviceError` import item.
- Replace if with match in `add_vhost_net_device()`
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Network backends determine the virtio dataplane implementations. Common
protocols include virtio-net, vhost-net and vhost-user-net, etc. Network
config has a new field named `backend` to specify which protocol to use.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
PLEASE NOTE THAT this pull request just implements vhost-net support for
Dragonball, and adaptation for the Runtime-rs. And this pull request
DOESN'T provide an item to config which backend to use. To sum up,
virtio-net as a default backend is only choice for the user so far.
This pull request introduces vhost-net device for the Dragonball. In
addition, this pull request includes changes of Runtime-rs to improve
network configuration abilities.
The Dragonball part implements a vhost-net device and a vhost-net device
manager, named `VhostNetDeviceMgr`, to manage vhost-net device.
`NetworkInterfaceConfig` is introduced as a high-level abstract for network
config. Then, the Dragonball is able to distinguish network backends, e.g.
virtio-net, vhost-net, vhost-user-net(WIP), etc.
The Runtime-rs part adds support of multiple network backends as well.
`NetworkConfig` has a couple of new fields, like `backend`,
`use_shared_irq`, etc. And Dragonball's network config structs are
implmented `From` trait which allow to be converted from the Runtime-rs's
network config conveniently.
Fixes: #7674
Signed-off-by: Eric Ren <renzhen@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: wllenyj <wllenyj@linux.alibaba.com>
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
TestCheckHostIsVMContainerCapable is failing on AMD machines.
kata-check_amd64_test.go:96 has no AMD modules, also getCPUType is
missing.
Fixes#8384.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
This update includes necessary changes due to the version bump of
containerd and its dependencies. It's part of a broader initiative to
phase out gogo protobuf, which has been deprecated, and to align with
the current supported libraries.
Fixes#7420.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
The +fieldpath option, specific to gogoprotobuf, enabled dynamic field
access in protobuf messages, allowing nested fields to be accessed via
string paths.
This change is part of a larger effort to transition to the official Go
protobuf library for better maintainability and community support.
Upon review, no instances of dynamic field access were found in the
codebase, confirming that the feature is not in use.
By removing this unused feature, we simplify the build process and make
it easier to complete the transition away from gogoprotobuf.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
Those mappings are not used by our .proto files and there is no
difference between .pb.go files generated.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
Remove earlier functionality that tries to assign PCI path to vfio
devices from the host assuming pci slots to start from 1.
Get this from the hypervisor instead.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
If PCI path for block device is not empty for a block device, use
that as identifier for agent instead of virt path which is valid only
for mmio devices.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Block(virtio-blk) and vfio devices are currently not handled correctly
by the agent as the agent is not provided with correct PCI paths for
these devices.
The PCI paths for these devices can be inferred from the PCI information
provided by the hypervisor when the device is added.
Hence changing the add_device trait function to return a device copy
with PCI info potentially provided by the hypervisor. This can then be
provided to the agent to correctly detect devices within the VM.
This commit includes implementation for PCI info update for
cloud-hupervisor for virtio-blk devices with stubs provided for other
hypervisors.
Removing Vsock from the DeviceType enum as Vsock currently does not
implement the Device Trait, it has no attach and detach trait functions
among others. Part of the reason is because these functions require Vsock
to implement Clone trait as these functions need cloned copies to be
passed down the hypervisor.
The change introduced for returning a device copy from the add_device
hypervisor trait explicitly requires a device to implement
Copy trait. Hence removing Vsock from the DeviceType enum for now, as
its implementation is incomplete and not currently used.
Note, one of the blockers for adding the Clone trait to Vsock is that it
currently includes a file handle which cannot be cloned. For Clone and
Device Traits to be implemented for Vsock, it requires an implementation
change in the future for it to be cloneable.
Fixes: #8283
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
This patch re-generates the client code for Cloud Hypervisor v35.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.
Fixes: #8378
Signed-off-by: Bo Chen <chen.bo@intel.com>
make check is giving us the following error:
error: this expression creates a reference which is immediately
dereferenced by the compiler.
Fixes#8344
Signed-off-by: Beraldo Leal <bleal@redhat.com>
By modifying RuntimeLevelFilter drain to improve logging control,
enabling isolation of change effect of the loggers between components,
tuning clh logs to be logged according to their log levels
given by cloud-hypervisor.
Fixes: #8310
Signed-off-by: Ruoqing He <linuxwatcher@outlook.com>
This PR adds the tracing capability for dragonball and it depends on the tracing::Subscriber of the upper layer.
Fixes: #7249
Signed-off-by: Songqian Li <mail@lisongqian.cn>
We used the approach of cold-plugging network interface for pre-shimv2
support for docker.Since the hotplug approach was not required,
we never really got to implementing hotplug support for certain network
endpoints, ipvlan and macvlan being among them.
Since moving to shimv2 interface as the default for
runtime, we switched to hotplugging the network interface for supporting
docker and nerdctl. This was done for veth endpoints only.
Implement the hot-attach apis for ipvlan and macvlan as well to support
ipvlan and macvlan networks with docker and nerdctl.
Fixes: #8333
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Legacy device manager adds device metrics to METRICS when a device is created and removes metrics when a device is dropped.
Fixes: #7248
Signed-off-by: Songqian Li <mail@lisongqian.cn>
Add the hypervisor security details to the output of the `kata-runtime
env` and `kata-ctl env` commands so the user can see, amongst other
things, the value of `confidential_guest`.
Fixes: #8313.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
The config template file for clh is in the new format for runtime-rs.
It is a result of merging the new format file and options supportted by
cloud-hypervisor.
Some config options from the golang runtime are missing as they may not
be currently supported by the rust runtime. An example of this is the
selinux options, rate limiting options as these are not currently
supported or verified with the rust runtime.
Fixes: #8249
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Balloon device manager adds balloon device metrics to METRICS when a device is created and remove metrics when a device is dropped.
Fixes: #7248
Signed-off-by: Songqian Li <mail@lisongqian.cn>
This is to skip a flaky test `create_tmpfs()` on s390x until a root cause is identified and fixed.
Fixes: #4248
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Remove the ability to block access to kata agent endpoints by using
agent-config.toml. That functionality is now implemented using the
Agent Policy feature (#7573).
The CCv0 branch relied on blocking endpoints using agent-config.toml
but will set-up an equivalent default policy file instead (#8219).
Fixes: #8228
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Add the missing closing bracket to the output of the TDX details,
so rather than:
```bash
$ sudo kata-ctl env 2>/dev/null | grep available_guest_protection
available_guest_protection = "tdx (major_version: 1, minor_version: 0"
: ^
: Missing ')' !
```
... we now have:
```bash
$ sudo kata-ctl env 2>/dev/null | grep available_guest_protection
available_guest_protection = "tdx (major_version: 1, minor_version: 0)"
: ^
: Aha!
```
Added a unit test for this scenario.
Fixes: #8257.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
If you attempt to create a container (a TD) on a TDX system using a
custom build of Cloud Hypervisor (CH) that was not built with the `tdx`
CH feature, Kata will report the following, somewhat cryptic, CH error:
```
ApiError(VmBoot(InvalidPayload))
```
Newer versions of CH now report their build-time features in the ping
API response message so we now use that, if available, to detect this
scenario and generate a user-friendly error message instead.
This changes improves the readability of `handle_guest_protection()` and
adds a couple of additional tests for that method.
Fixes: #8152.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Improve the way `handle_guest_protection()` is structured by inverting
the logic and checking the value of the `confidential_guest` setting
before checking the guest protection. This makes the code easier to
understand.
> **Notes:**
>
> - This change also unconditionally saves the available guest protection
> (where previously it was only saved when `confidential_guest=true`).
> This explains the minor unit test fix.
>
> - This changes also errors if the CH driver finds an unexpected
> protection (since only Intel TDX is currently tested).
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
ACPI PCI device hotplug on qemu virt is not supported. The only way to
hotplug pci device is pcie native way. Thus we need create pcie root
port as default.
Pcie root port number depends on following:
1. reserved one for network device as default;
2. virtio-mem dev;
3. add enough port for vhost user blk dev;
Fixes: #7646
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Add GetEndpointsNum API for Network Interface to get the number of
network endpoints. This is used for caculate the number of pcie root
port for QemuVirt.
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
1. enable virtio-fs-pro in Dragonball to have the ability to process nydus backend registry
2. change passthrough for rw layer's readonly config to false to have the accurate read write ability.
Fixes:#8013
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Since Nydus snapshotter has been updated in previous commits, there is a
problem that the config passthrough to Dragonball during mount_rafs is
RafsConfig instead of ConfigV2, but Dragonball could only serde ConfigV2
so it will panic.
We need to add the support for RafsConfig
Fixes:#8013
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Allow access to the ReseedRandomDev endpoint by default. Using false
for ReseedRandomDevRequest was unintended.
Fixes: #8225
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Previously, if you accidentally modified the name of the hypervisor
section in the config file, the default golang runtime gives a cryptic
error message ("`VM memory cannot be zero`"). This can be demonstrated
using the `kata-runtime` utility program which uses the same golang
config package as the actual runtime (`containerd-shim-kata-v2`):
```bash
$ kata-runtime env >/dev/null; echo $?
0
$ sudo sed -i 's!^\[hypervisor\.qemu\]!\[hypervisor\.foo\]!g' /etc/kata-containers/configuration.toml
$ kata-runtime env >/dev/null; echo $?
VM memory cannot be zero
1
```
The hypervisor name is now validated so that the behaviour becomes:
```bash
$ kata-runtime env >/dev/null; echo $?
0
$ sudo sed -i 's!^\[hypervisor\.qemu\]!\[hypervisor\.foo\]!g' /etc/kata-containers/configuration.toml
$ ./kata-runtime env >/dev/null; echo $?
/etc/kata-containers/configuration.toml: configuration file contains invalid hypervisor section: "foo"
1
```
Fixes: #8212.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Improve the `GuestProtection` handling to detect the version of
Intel TDX available.
The TDX version is now logged by the Cloud Hypervisor driver.
Fixes: #8147.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Direct-volume needs to use the same base64 character set as
kata-runtime/direct-volume does.
Fixes: #8175
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
This change adds support for adding and removing vfio devices for
cloud-hypervisor.
Fixes: #6691
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
To pick up fix for the following issue:
A maliciously crafted HTTP/2 stream could cause excessive CPU
consumption in the HPACK decoder, sufficient to cause a denial of
service from a small number of small requests.
Fixes: #8190
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
Enable the Cloud Hypervisor driver (the `cloud-hypervisor` build feature) for the rust runtime.
Fixes: #6264.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
This will help to build the agent binary as part of the kata-deploy
localbuild, as we need to pass the DESTDIR to where the agent will be
installed, and also whether we're building the agent with policy support
enabled or not.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
We need to do proper sandbox sizing when we're doing cold-plug introduce CDI,
the de-facto standard for enabling devices in containers. containerd
will pass-through annotations for accumulated CPU,Memory and now CDI
devices. With that information sandbox sizing can be derived correctly.
Fixes: #7331
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
Pause and resume task do not currently update the status of the
container to paused or running, so fix this. This is specifically for
pausing the task and not the VM.
Fixes#6434
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
Allow Cloud Hypervisor to create a confidential guest (a TD or
"Trust Domain") rather than a VM (Virtual Machine) on Intel systems
that provide TDX functionality.
> **Notes:**
>
> - At least currently, when built with the `tdx` feature, Cloud Hypervisor
> cannot create a standard VM on a TDX capable system: it can only create
> a TD. This implies that on TDX capable systems, the Kata Configuration
> option `confidential_guest=` must be set to `true`. If it is not, Kata
> will detect this and display the following error:
>
> ```
> TDX guest protection available and must be used with Cloud Hypervisor (set 'confidential_guest=true')
> ```
>
> - This change expands the scope of the protection code, changing
> Intel TDX specific booleans to more generic "available guest protection"
> code that could be "none" or "TDX", or some other form of guest
> protection.
Fixes: #6448.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Introduce a few new constants (for PCI segment count and FS queues) and
move the disk queue constants to `convert.rs` to allow them to be used
there too.
> **Note:**
>
> This change gives the `ShareFs` code it's own set of values rather
> than relying on the disk queue constants.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Modify the Cloud Hypervisor `add_device()` method to add `ShareFs` and
`Network` devices to the list of pending devices since only these two
device types need to be cached before VM startup. Full details in the
comments.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Remove the `VIRTIO_BLK_MMIO` check which appears to have been added
erroneously in the first place.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
This patch re-generates the client code for Cloud Hypervisor v35.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.
Fixes: #8057
Signed-off-by: Bo Chen <chen.bo@intel.com>
The cgroup stats come from resourcecontrol package in the form of pointers
to structs. The sandbox Stat() method incorrectly was expecting structs.
This caused the cpu and memory stats to always be 0, which in turn caused
incorrect pod overhead metrics.
Fixes#8035
Signed-off-by: Peteris Rudzusiks <rye@stripe.com>
Firecracker supports noflush semantic via Unsafe cache type.
There is no support for direct i/o, remove it from config file
Fixes: #7823
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
Clh suports direct i/o for disks. It doesn't
offer any support for noflush, removed passing
of option to cloud-hypervisor internal config
Fixes: #7798
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
Support quoted kernel command line parameters that include space
characters. Example:
dm-mod.create="dm-verity,,,ro,0 736328 verity 1
/dev/vda1 /dev/vda2 4096 4096 92041 0 sha256
f211b9f1921ef726d57a72bf82be23a510076639fa8549ade10f85e214e0ddb4
065c13dfb5b4e0af034685aa5442bddda47b17c182ee44ba55a373835d18a038"
Fixes: #8003
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This is part of a bigger effort to drop gogoprotobuff from our code
base. IIUC, those options are basically used by *pb_test.go, and since
we are dropping gogoprotobuff and those are auto generated tests, let's
just remove it.
Fixes#7978.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
Users have noticed that this is needed, as CLH does not yet implement a
way to hotplug resources on aarh64.
With this patch, when building for x86_64, I can see the this is the
resulting config:
```
$ ARCH=amd64 make
...
$ cat config/configuration-clh.toml | grep static_sandbox_resource_mgmt
static_sandbox_resource_mgmt=false
```
And when building for aarch64:
```
$ ARCH=arm64 make
...
$ cat config/configuration-clh.toml | grep static_sandbox_resource_mgmt
static_sandbox_resource_mgmt=true
```
Fixes: #7941
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
PR #6146 added the possibility to control QEMU with an extra HMP socket
as an aid for debugging. This is great for development or bug chasing
but this raises some concerns in production.
The HMP monitor allows to temper with the VM state in a variety of ways.
This could be intentionally or mistakenly used to inject subtle bugs in
the VM that would be extremely hard if not even impossible to debug. We
definitely don't want that to be enabled by default.
The feature is currently wired to the `enable_debug` setting in the
`[hypervisor.qemu]` section of the configuration file. This setting has
historically been used to control "debug output" and it is used as such
by some downstream users (e.g. Openshift). Forcing people to have the
extra HMP backdoor at the same time is abusive and dangerous.
A new `extra_monitor_socket` is added to `[hypervisor.qemu]` to give
fine control on whether the HMP socket is wanted or not. This setting
is still gated by `enable_debug = true` to make it clear it is for
debug only. The default is to not have the HMP socket though. This
isn't backward compatible with #6416 but it is for the sake of "better
safe than sorry".
An extra monitor socket makes the QEMU instance untrusted. A warning is
thus logged to the journal when one is requested.
While here, also allow the user to choose between HMP and QMP for the
extra monitor socket. Motivation is that QMP offers way more options to
control or introspect the VM than HMP does. Users can also ask for
pretty json formatting well suited for human reading. This will improve
the debugging experience.
This feature is only made visible in the base and GPU configurations
of QEMU for now.
Fixes#7952
Signed-off-by: Greg Kurz <groug@kaod.org>
Otherwise `make test` will simply fail with:
```
error[E0583]: file not found for module `config`
```
Fixes: #7974 -- part 0
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This makes it pssible to run the tests in the cost free runners, which
are not KVM capable.
Fixes: #7974 -- part 0
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Otherwise `make test` will simply fail with:
```
error[E0583]: file not found for module `version`
```
Fixes: #7974 -- part 0
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Otherwise `make test` will fail with:
```
error[E0583]: file not found for module `version`
```
Fixes: #7974 -- part 0
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This PR adds support for adding a network device before starting the
cloud-hypervisor VM.
Support for adding and removing network devices is not really added to
the resource manager, so supporting this for cloud-hypervisor is not
scoped in this PR.
This also changes "pending_devices" for clh implementation from an
Option of vector to simply a vector. This simplifies the structure a bit
as we can simple iterate over the pending devices instead of having to
check for a "Some" value as this is not really required.
Fixes: #6333
Signed-off-by: Shuaiyi Zhang <zhang_syi@qq.com>
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
cloud hypervisor on arm64 only support arm AMBA UART(pl011) as
tty. So, the console should be set to "ttyAMA0" instead of "ttyS0"
when enable hypervisor debug mode.
Fixes: #5080
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
by enabling IOMMU on the default PCI segment. For hotplug to work we need a
virtualized iommu and clh exposes one if there is some device or PCI segment
that requests it. I would have preferred to add a separate PCI segment for
hotplugging vfio devices but unfortunately kata assumes there is only one
segment all over the place. See create_pci_root_bus_path(),
split_vfio_pci_option() and grep for '0000'.
Enabling the IOMMU on the default PCI segment requires passing enabling IOMMU on
every device that is attached to it, which is why it is sprinkled all over the
place.
CLH does not support IOMMU for VirtioFs, so I've added a non IOMMU segment for
that device.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
There is no way for this branch to be hit, as port is only set when it is
different than config.NoPort.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
These test cases shows which options are valid for CLH/Qemu, and test that we
correctly catch unsupported combinations.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
The only supported options are hot_plug_vfio=root-port or no-port.
cold_plug_vfio not supported yet.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
hot_plug_vfio needs to be set to root-port, otherwise attaching vfio devices to
CLH VMs fails. Either cold_plug_vfio or hot_plug_vfio is required, and we have
not implemented support for cold_plug_vfio in CLH yet.
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
In the RemoveEndpoints(), when the endpoints paramete isn't empty,
using idx may result in wrong endpoint removals. To improve,
directly passing the endpoint parameter helps
locate the correct elements within n.eps.
Fixes: #7732
Signed-off-by: shixuanqing <1356292400@qq.com>
Fixes: #7732
Signed-off-by: shixuanqing <1356292400@qq.com>
Update src/runtime/virtcontainers/network_linux.go
Co-authored-by: Xuewei Niu <justxuewei@apache.org>
- In rust 1.72, clippy warned clippy::non-minimal-cfg
as the cfg has only one condition, so doesn't
need to be wrapped in the any combinator.
Fixes: #7902
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
- Allow `clippy::redundant-closure-call`
which has issues with the guard function passed into
the `run_if_auto_values` macro
Fixes: #7902
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The bindgen generated code is triggering lots of
ambiguous-glob-reexports warnings in rust 1.70+
Fixes: #7902
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
- In rust 1.72, clippy warned clippy::non-minimal-cfg
as the cfg has only one condition, so doesn't
need to be wrapped in the all combinators.
Fixes: #7902
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
- Allow `clippy::redundant-closure-call` in `from_cmdline`
which has issues with the guard function passed into
the `parse_cmdline_param` macro
Fixes: #7902
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
When creating a new endpoint, we check existing endpoint names and automatically adjust the naming of the new endpoint to ensure uniqueness.
Fixes: #7876
Signed-off-by: shixuanqing <1356292400@qq.com>
This commit allows us to specify the huge page backend when enabling huge
page. Currently, we support two backends: thp and hugetlbfs, the default
is hugetlbfs.
To ensure backward compatibility, we introduce another configuration item
"hugepage_type" to select the memory backend, which is available only when
"enable_hugepages" is true. Besides, we add an annotation
"io.katacontainers.config.hypervisor.hugepage_type" to configure huge page
type per pod.
Fixes: #6703
Signed-off-by: Guixiong Wei <weiguixiong@bytedance.com>
Signed-off-by: Yipeng Yin <yinyipeng@bytedance.com>
To support the removal of the `initcall_debug` and `earlyprintk=`
options from the default guest kernel cmdline, add `kernel_params` to the list
of enabled annotations to allow those kernel options (or others) to be
set using `kata-deploy` for either runtime.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Removed the following kernel command line options:
- `earlyprintk=ttyS0`
- `initcall_debug`
Both these options are only useful when debugging a guest kernel failure
which is not a common occurrence.
Further, the `earlyprintk=` option can have a large negative performance
impact (it can increase the VM boot time significantly).
If the user wishes to use either of these options, they can add them to the
`kernel_params=` setting in the Kata configuration file's hypervisor
stanza.
Fixes: #7886.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
1. Directly support CgroupManager::freeze through systemd API.
2. Avoid always passing unit_name by storing it into DBusClient.
3. Realize CgroupManager::destroy more accurately by killing systemd unit rather than stop it.
4. Ignore no such unit error when destroying systemd unit.
5. Update zbus version and corresponding interface file.
Acknowledgement: error handling for no such systemd unit error refers to
Fixes: #7080, #7142, #7143, #7166
Signed-off-by: Yuan-Zhuo <yuanzhuo0118@outlook.com>
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
This syntax belongs to the legacy C virtiofsd implementation that
we don't support anymore since kata-containers 3.1.3 because
of other API breaking changes.
People have been warned to switch from "none" to "never" since
kata-containers 2.5.2. Let's officially do that.
The compat code that would convert "none" to "never" isn't
needed anymore. Just drop it.
Fixes#7864
Signed-off-by: Greg Kurz <groug@kaod.org>
gogo.nullable is the main gogo.protobuf' feature used here. Since we are
trying to remove gogo.protobuf, the first reasonable step seems to be
remove this feature. This is a core update, and it will change how the
structs are defined. I could spot only a few places using those structs,
based on make check/build.
Fixes#7723.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
There is no reference to PROTO_FILE and this is not working. Also we are
not inside a Makefile, so makes sense to adapt the usage to reflect the
script instead of a make command.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
import_path is used as the default package when no input files specify
go_package. However, all the files we are currently building already
have a go_package definition, making this behavior both redundant and
error-prone.
Additionally, one of our files (types.pb.go) resides outside the grpc
directory, indicating that it's indeed ignored but also inconsistent.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
Currently, the script searches for .proto files within $GOPATH/.
Consequently, modifications to a definition file in the current working
directory won't influence the output .pb.go if the directory is outside
of $GOPATH. For developers, it's more intuitive to alter the local
codebase than the version stored in $GOPATH.
With this modification, the generated .pb.go files will be relative to
the current working directory, removing the need to clone this project
under $GOPATH/src/github.com/kata-containers.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
The definitions are already specified in the .proto files using the
go_package option. Centralizing them in one location reduces the
potential for errors and simplifies the script.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
Currently, the dbs-upcall features have 2 problems that are needed to be
fixed :
There are redundant dbs-upcall features that are needed to be removed.
Some place should be controlled by dbs-upcall but not being implemented.
This commit will fix those two problems.
fixes: #6878
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Some use cases may just require passing extra arguments to virtiofsd,
and having this disabled by default makes it impossible to set when
using kata-deploy, as changes in the configuration file would be
overwritten by the daemon-set.
With this in mind, let's allow users to pass whatever thet need (and
here I'm specifically looking at `--xattr`) as a virtio_fs_extra_arg.
Fixes: #7853
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Currently, virtio_vsock are still outside of the device
manager. This causes some management issues,such as the
inability to unify PCI address management.
Just do some work for hybrid vsock.
Fixes: #7655
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
When a storage device is used by more than one container, the second
and forth instances will cause storage device reference count leakage,
thus cause storage device leakage. The reason is:
add_storages() will increase reference count of existing storage device,
but forget to add the device to the `mount_list` array, thus leak the
reference count.
Fixes: #7820
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Use AGENT_POLICY=yes when building the Guest images, and add a
permissive test policy to the k8s tests for:
- CBL-Mariner
- SEV
- SNP
- TDX
Also, add an example of policy rejecting ExecProcessRequest.
Fixes: #7667
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Let's make sure we use the TDX image as part of the QEMU TDX
configuration, which will help us to have the policies tested here.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Update the protection checking code to detect newer versions of Intel
TDX (whose userland interface has now stabilised).
> **Note:** that we don't need to retain the existing behaviour since:
>
> - We haven't yet landed the TDX feature (#6448).
> - Systems wishing to use TDX will need to use the latest available
> system components (such as firmware and host kernel).
Also added an explicit TDX unit test.
Fixes: #7384.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
IoCopy is a tricky function (I don't claim to fully understand its contract),
but here is what I see: The goroutine that runs it spawns 3 goroutines - one
for each stream to handle (stdin/stdout/stderr). The goroutine then waits for
the stream goroutines to exit. The idea is that when the process exits and is
closed, the stdout goroutine will be unblocked and close stdin - this should
unblock the stdin goroutine. The stderr goroutine will exit at the same time as
the stdout goroutine. The iocopy routine then closes all tty.io streams.
The problem is that the stdout goroutine decrements the WaitGroup before
closing the stdin stream, which causes the iocopy goroutine to race to close
the streams. Move the wg.Done() of the stdout routine past the close so that
*this* race becomes impossible. I can't guarantee that this doesn't affect some
unspecified behavior.
Fixes: #5031
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
If we are running FC hypervisor, it is not started when prestart hooks
are executed. So we should just ignore such error and just go ahead and
run the hooks.
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
FC does not support network device hotplug. Let's add a check to fail
early when starting containers created by docker.
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
Add a new hypervisor capability to tell if it supports device hotplug.
If not, we should run prestart hooks before starting new VMs as nerdctl
is using the prestart hooks to set up netns. To make nerdctl + FC
to work, we need to run the prestart hooks before starting new VMs.
Fixes: #6384
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
When running on amd machines, those tests will fail because there is no
vmx flag. Following other tests that checks for cpuType, let's adapt
them to restrict vmx only on Intel machines.
Fixes#7788.
Related #5066
Signed-off-by: Beraldo Leal <bleal@redhat.com>
QEMU for TDX 1.5 makes use of private memory map/unmap.
Make changes to govmm to support this. Support for private backing fd
for memory is added as knob to the qemu config.
Userspace's map/unmap operations are done by fallocate() ioctl on the
backing store fd.
Reference:
https://lore.kernel.org/linux-mm/20220519153713.819591-1-chao.p.peng@linux.intel.com/Fixes: #7770
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
The log_forwarder task does not check if the peer has closed, causing a
meaningless loop during the period of “kata vm exit”, when the peer
closed, and “ShutdownContainer RPC received” that aborts the log forwarder.
This patch fixes the problem.
Fixes: #7741
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
There are several processes for container exit:
- Non-detach mode: `Wait` request is sent by containerd, then
`wait_process()` will be called eventually.
- Detach mode: `Wait` request is not sent, the `wait_process()` won’t be
called.
- Killed by ctr: For example, a container runs `tail -f /dev/null`, and
is killed by `sudo ctr t kill -a -s SIGTERM <CID>`. Kill request is
sent, then `kill_process()` will be called. User executes `sudo ctr c
rm <CID>`, `Delete` request is sent, then `delete_process()` will be
called.
- Exited on its own: For example, a container runs `sleep 1s`. The
container’s state goes to `Stopped` after 1 second. User executes
the delete command as below.
Where do we do container cleanup things?
- `wait_process()`: No, because it won’t be called in detach mode.
- `delete_process()`: No, because it depends on when the user executes the
delete command.
- `run_io_wait()`: Yes. A container is considered exited once its IO ended.
And this always be called once a container is launched.
Fixes: #7713
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Refine storage related code by:
- remove the STORAGE_HANDLER_LIST
- define type alias
- move code near to its caller
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Introduce StorageDevice and StorageHandlerManager, which will be used
to refine storage device management for kata-agent.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Simplify the way to manage storage objects, and introduce
StorageStateCommon structures for coming extensions.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Right now if we configure an image annotation and have a config file
setting initrd, the initrd config would override the image annotation.
Make sure annotations are preferred over config options in image and initrd
path handling.
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
We should make sure annotations are preferred over
config options in image and initrd path handling.
Fixes: #7705
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
Right now if we configure an image annotation and have a config file
setting initrd, the initrd config would override the image annotation.
Add a helper function ImageOrInitrdAssetPath to make sure annotations
are preferred over config options in image and initrd path handling.
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
When the FileMode field for the device is unset (0), use a default value instead
to allow the use of the device from the container.
This behaviour is seen from cri-o typically.
Note: this is what runc is doing, which is why regular containers don't have an
issue. This change makes sure kata behaves the same as runc.
Fixes: #7717
Signed-off-by: Julien Ropé <jrope@redhat.com>
Introduce structure KataVirtualVolume to to encapsulate information
for extra mount options and direct volumes, so we could build a common
infrastructure to handle these cases.
Fixes: #7699
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
When building with AGENT_POLICY=yes and AGENT_INIT=yes:
1. Include OPA and the Policy settings in rootfs.
2. Start OPA from the kata agent.
Before these changes, building with both AGENT_POLICY=yes and
AGENT_INIT=yes was unsupported.
Starting OPA from systemd (when AGENT_INIT=no) was already supported.
Fixes: #7615
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
The error message when the kill command is executed with the container's
state == Stopped should be "container not running" because the containerd
tests expect that OCI runtimes return the error message and compare it.
If the error message is different from the expected one, the tests fail.
Fixes: #7650
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
We extend the `Result` and `Option` types with associated types that
allows converting a `Result<T, E>` and `Option<T>` into
`ttrpc::Result<T>`.
This allows the elimination of many `match` statements in favor of
calling the map function plus the `?` operator. This transformation
simplifies the code.
Fixes: #7624
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Fixes: #7573
To enable this feature, build your rootfs using AGENT_POLICY=yes. The
default is AGENT_POLICY=no.
Building rootfs using AGENT_POLICY=yes has the following effects:
1. The kata-opa service gets included in the Guest image.
2. The agent gets built using AGENT_POLICY=yes.
After this patch, the shim calls SetPolicy if and only if a Policy
annotation is attached to the sandbox/pod. When creating a sandbox/pod
that doesn't have an attached Policy annotation:
1. If the agent was built using AGENT_POLICY=yes, the new sandbox uses
the default agent settings, that might include a default Policy too.
2. If the agent was built using AGENT_POLICY=no, the new sandbox is
executed the same way as before this patch.
Any SetPolicy calls from the shim to the agent fail if the agent was
built using AGENT_POLICY=no.
If the agent was built using AGENT_POLICY=yes:
1. The agent reads the contents of a default policy file during sandbox
start-up.
2. The agent then connects to the OPA service on localhost and sends
the default policy to OPA.
3. If the shim calls SetPolicy:
a. The agent checks if SetPolicy is allowed by the current
policy (the current policy is typically the default policy
mentioned above).
b. If SetPolicy is allowed, the agent deletes the current policy
from OPA and replaces it with the new policy it received from
the shim.
A typical new policy from the shim doesn't allow any future SetPolicy
calls.
4. For every agent rpc API call, the agent asks OPA if that call
should be allowed. OPA allows or not a call based on the current
policy, the name of the agent API, and the API call's inputs. The
agent rejects any calls that are rejected by OPA.
When building using AGENT_POLICY_DEBUG=yes, additional Policy logging
gets enabled in the agent. In particular, information about the inputs
for agent rpc API calls is logged in /tmp/policy.txt, on the Guest VM.
These inputs can be useful for investigating API calls that might have
been rejected by the Policy. Examples:
1. Load a failing policy file test1.rego on a different machine:
opa run --server --addr 127.0.0.1:8181 test1.rego
2. Collect the API inputs from Guest's /tmp/policy.txt and test on the
machine where the failing policy has been loaded:
curl -X POST http://localhost:8181/v1/data/agent_policy/CreateContainerRequest \
--data-binary @test1-inputs.json
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Remove the installation step in the virtcontainers doc
because the virtcontainers install/uninstall targets have
been removed by 86723b51ae
and they are not used anymore.
Fixes: #7637
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
Remove configuration file shared_fs = none warnings
now that there is a solution to updating configMaps, secrets etc
Fixes: #7210
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
This patch allows copying of directories and symlinks when
static file copying is used between host and guest. This change is
necessary to support recursive file copying between shim and agent.
Signed-off-by: Yohei Ueda <yohei@jp.ibm.com>
(cherry picked from commit de232b8030)
For remote hypervisor, the configmap, secrets, downward-api or project-volumes are
copied from host to guest. This patch watches for changes to the host files
and copies the changes to the guest.
Note that configmap updates takes significantly longer than updates via downward-api.
This is similar across runc and Kata runtimes.
Fixes: #7210
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Signed-off-by: Julien Ropé <jrope@redhat.com>
(cherry picked from commit 3081cd5f8e)
(cherry picked from commit 68ec673bc4d9cd853eee51b21a0e91fcec149aad)
This patch upgrades Firecracker version from v1.1.0 to v1.4.0.
* Generate swagger models for v1.4.0 (from `firecracker.yaml`)
- The version of go-swagger used is v0.30.0
* The firecracker v1.4.0 includes the following changes.
- Added
* Added support for custom CPU templates allowing users to adjust vCPU features
exposed to the guest via CPUID, MSRs and ARM registers.
* Introduced V1N1 static CPU template for ARM to represent Neoverse V1 CPU
as Neoverse N1.
* Added support for the virtio-rng entropy device. The device is optional. A
single device can be enabled per VM using the /entropy endpoint.
* Added a cpu-template-helper tool for assisting with creating and managing
custom CPU templates.
- Changed
* Set FDP_EXCPTN_ONLY bit (CPUID.7h.0:EBX[6]) and ZERO_FCS_FDS bit
(CPUID.7h.0:EBX[13]) in Intel's CPUID normalization process.
- Fixed
* Fixed feature flags in T2S CPU template on Intel Ice Lake.
* Fixed CPUID leaf 0xb to be exposed to guests running on AMD host.
* Fixed a performance regression in the jailer logic for closing open file
descriptors.
* A race condition that has been identified between the API thread and the VMM
thread due to a misconfiguration of the api_event_fd.
* Fixed CPUID leaf 0x1 to disable perfmon and debug feature on x86 host.
* Fixed passing through cache information from host in CPUID leaf 0x80000006.
* Fixed the T2S CPU template to set the RRSBA bit of the IA32_ARCH_CAPABILITIES
MSR to 1 in accordance with an Intel microcode update.
* Fixed the T2CL CPU template to pass through the RSBA and RRSBA bits of the
IA32_ARCH_CAPABILITIES MSR from the host in accordance with an Intel microcode
update.
* Fixed passing through cache information from host in CPUID leaf 0x80000005.
* Fixed the T2A CPU template to disable SVM (nested virtualization).
* Fixed the T2A CPU template to set EferLmsleUnsupported bit
(CPUID.80000008h:EBX[20]), which indicates that EFER[LMSLE] is not supported.
Fixes: #7610
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
Since the passed fd through unix socket would be any
stream fd such as pipe/fifo fd or any other socket
fd, thus we should deal with it as a normal hybrid
stream instead of a unix stream.
Fixes:#7584
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
There are many places where the code currently creates new `Vec`
instances when it's not really needed. The result is a perf hit because
it allocates memory, copies all elements, then frees the memory; in some
cases, copying elements also involves extra allocations (e.g., when
elements are strings, or structs containing strings).
This patch addresses a number of these cases.
Fixes: #7203
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Refine implementation of mount by:
- log message with `path.display()` instead of `{:?}`
- add prefix "_" to unused variables
- pass by reference instead of by value to avoid creating redundant
array
- exactly matching prefix "fsgid=" instead of "fsgid"
- avoid redundant clone() operations
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
There's a bug in function update_ephemeral_mounts() which only handles
the first storage object and ignores all other storage objects.
Fixes: #7551
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Simplify function online_cpu_memory() by on calling update_cpuset_path()
for containers with cpuset configured.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Refine style of code related to sandbox by:
- remove unnecessary comments for caller to take lock, we have already taken
`&mut self`.
- change "*count < 1 " to "*count == 0", `count` is type of u32.
- make remove_sandbox_storage() to take `&mut self` instead of `&self`.
- group related function to each others
- avoid search the map twice in function find_process()
- avoid unwrap() in function run_oom_event_monitor()
- avoid unwrap() in online_resources()
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Avoid unwrap() in function do_remove_container(), and also make
implmementation symmetric for both timeout and non-timeout cases.
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
Optimize agent rpc implementation by:
- avoid clone objects when possible
- avoid unwrap() when possible
- explictly drop object to ensure order
Signed-off-by: Jiang Liu <gerry@linux.alibaba.com>
This pull request is mainly for updating vm-memory and vmm-sys-util.
The affacted crates include:
- vm-memory: from 0.9.0 to 0.10.0
- vmm-sys-util: from 0.10.0 to 0.11.0
- virtio-queue: from 0.6.0 to 0.7.0
- fuse-backend-rs: from 0.10.4 to 0.10.5
- linux-loader: from 0.6.0 to 0.8.0
- nydus-api: from 0.3.0 to 0.3.1
- nydus-rafs: from 0.3.1 to 0.3.2
- nydus-storage: from 0.6.3 to 0.6.4
Fixes: #0000
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
These calls cause two extra atomic instructions each time they're used,
one to increment and another one to decrement the refcount.
Since we don't need them because the referred value is guaranteed to
outlive the function, remove the calls.
Fixes: #7190
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
When the mounted block device isn't a layer, we want to mount it into
containers, but since it's already mounted with the correct fs (e.g.,
tar, ext4, etc.) in the pod, we just bind-mount it into the container.
Fixes: #7536
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
When at least one `io.katacontainers.fs-opt.layer` option is added to
the rootfs, it gets inserted into the VM as a layer, and the file system
is mounted as an overlay of all layers using the overlayfs driver.
Additionally, if the `io.katacontainers.fs-opt.block_device=file` option
is present in a layer, it is mounted as a block device backed by a file
on the host.
Fixes: #7536
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
This causes the overlay-fs driver to add the `upperdir` and `workdir`
options to an overlay-fs mount so that the mount becomes writable using
a discardable directory under the container id.
Fixes: #7536
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
This is so that file systems don't fail when we pass kata-specific
options from the snapshotter to kata.
Fixes: #7536
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Version 0.10.5, which was just released, breaks `nydus-storage`.
This is a workaround to fix the CI which is blocking other PRs.
Fixes: #7541
Signed-off-by: Wedson Almeida Filho <walmeida@microsoft.com>
Allow `clippy::redundant_clone` in the agent's unit tests
because rustc>=1.70 shows the errors as false-negatives.
These `clone()` are required because the following codes
refer to the variable, but the clippy analyzes them by mistake,
using the conservative and limited approach.
Ref. https://rust-lang.github.io/rust-clippy/master/index.html#/redundant_cloneFixes: #7534
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
Kata containers as VM-based containers are allowed to run in the host
netns. That is, the network is able to isolate in the L2. The network
performance will benefit from this architecture, which eliminates as many
hops as possible. We called it a Directly Attachable Network (DAN for
short).
The network devices are placed at the host netns by the CNI plugins. The
configs are saved at {dan_conf}/{sandbox_id}.json in the format of JSON,
including device name, type, and network info. At the very beginning stage,
the DAN only supports host tap devices. More devices, like the DPDK, will
be supported in later versions.
The format of file looks like as below:
```json
{
"netns": "/path/to/netns",
"devices": [{
"name": "eth0",
"guest_mac": "xx:xx:xx:xx:xx",
"device": {
"type": "vhost-user",
"path": "/tmp/test",
"queue_num": 1,
"queue_size": 1
},
"network_info": {
"interface": {
"ip_addresses": ["192.168.0.1/24"],
"mtu": 1500,
"ntype": "tuntap",
"flags": 0
},
"routes": [{
"dest": "172.18.0.0/16",
"source": "172.18.0.1",
"gateway": "172.18.31.1",
"scope": 0,
"flags": 0
}],
"neighbors": [{
"ip_address": "192.168.0.3/16",
"device": "",
"state": 0,
"flags": 0,
"hardware_addr": "xx:xx:xx:xx:xx"
}]
}
}]
}
```
Fixes: #1922
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>