- Fix the lint error:
```
error: you seem to use `.enumerate()` and immediately discard the index
--> src/device_manager/mod.rs:427:33
|
427 | for (_index, device) in self.virtio_devices.iter().enumerate() {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
```
by removing the unnecessary enumerate
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The ci failed with:
```
error: use of `or_insert_with` to construct default value
--> src/address_space_manager.rs:650:14
|
650 | .or_insert_with(NumaNode::new);
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: try: `or_default()`
|
```
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
We take advantage of the Inner pattern to enable QemuInner::resize_vcpu()
take `&mut self` which we need to call non-const functions on Qmp.
This runs on Intel architecture but will need to be verified and ported
(if necessary) to other architectures in the future.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The QMP_SOCKET_FILE constant in cmdline_generator.rs is made public to make
it accessible from QemuInner. This is fine for now however if the constant
needs to be accessed from additional places in the future we could consider
moving it to somewhere more visible.
The Debug impl for Qmp is empty since first, we don't actually want it,
it's only forced by Hypervisor trait bounds, and second, it doesn't have
anything to display anyway. If Qmp gets any members in the future that
can be meaningfully displayed they should be handled by Qmp's Debug::fmt().
Signed-off-by: Pavel Mores <pmores@redhat.com>
The constructor handles QMP connection initialisation, too, so there can
be non-functional Qmp instance.
Signed-off-by: Pavel Mores <pmores@redhat.com>
We set the VERSION variable consistently across Makefiles to
'unknown' if the file is empty or not present.
We also use git commands consistently for calculating the COMMIT,
COMMIT_NO variables, not erroring out when building outside of
a git repository.
In create_summary_file we also account for a missing/empty VERSION
file.
This makes e.g. the UVM build process in an environment where we
build outside of git with a minimal/reduced set of files smoother.
Signed-off-by: Manuel Huber <mahuber@microsoft.com>
When the total number of files observed is greater than limit, return -1 directly.
runtime has fixed this bug, it should b ported to runtime-rs.
Fixes:#9829
Signed-off-by: gaohuatao <gaohuatao@bytedance.com>
It shouldn't call the initial_size_manager's setup_config
in the load_config since it had been called in the sandbox's
try_init function.
Fixes: #9778
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
For kata container, the container's pid is meaning less to
containerd/crio since the container's pid is belonged to VM,
and containerd/crio couldn't use it. Thus we just return any
tid of kata shim or hypervisor. But since the hypervisor had
been stopped before deleting the container, and it wouldn't
get the hypervisor's tid for some supported hypervisor, thus
we'd better to return the kata shim's pid instead of hypervisor's
tid.
Fixes: #9777
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
While running with a remote hypervisor, whenever kata-monitor tries to access
metrics from the shim, the shim does a "panic" and no metric can be gathered.
The function GetVirtioFsPid() is called on metrics gathering, and had a call
to "panic()". Since there is no virtiofs process for remote hypervisor, the
right implementation is to return nil. The caller expects that, and will skip
metrics gathering for virtiofs.
Fixes: #9826
Signed-off-by: Julien Ropé <jrope@redhat.com>
This corrects the warning to point to the \`-j\` flag,
which is the correct flag for the JSON settings file.
Previously, the warning was confusing, as it pointed to
the \`-p\` flag, which specifies to the path for the Rego ruleset.
Signed-off-by: Moritz Sanft <58110325+msanft@users.noreply.github.com>
This patch re-generates the client code for Cloud Hypervisor v39.0.
Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by openapi-generator.
Fixes: #8694, #9574
Signed-off-by: Bo Chen <chen.bo@intel.com>
- Update the config parsing logic so that when reading from the
agent-config.toml file any envs are still processed
- Add units tests to formalise that the envs take precedence over values
from the command line and the config file
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
When the total number of files observed is greater than limit, return (-1, err).
When the returned err is not nil, the func countFiles should return -1.
Fixes:#9780
Signed-off-by: gaohuatao <gaohuatao@bytedance.com>
fixes#9810
Add an annotation to the enum values in the agent config that will
deserialize them using a kebab-case conversion, aligning the behaviour
to parsing of params specified via kernel cmdline.
drive-by fix: add config override for guest_component_procs variable
Signed-off-by: Magnus Kulke <magnuskulke@microsoft.com>
(1) As mis-use of cap.set causing previous Caps lost which
causing assert! failed, just replacing cap.set with cap.add.
(2) It will return error if there's no such name setting when
do update_config_by_annotation {
...
if config.runtime.name.is_empty() {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
"Runtime name is missing in the configuration",
));
}
...
}
Fixes#9783
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This is required to provide the hashes of kernel, initrd and cmdline
needed during the attestation of the coco.
Fixes: #9150
Signed-off-by: Niteesh Dubey <niteesh@us.ibm.com>
CreateContainerRequest objects can specify devices to be created inside
the guest VM. This change ensures that requested devices have a
corresponding entry in the PodSpec.
Devices that are added to the pod dynamically, for example via the
Device Plugin architecture, can be allowlisted globally by adding their
definition to the settings file.
Fixes: #9651
Signed-off-by: Markus Rudy <mr@edgeless.systems>
This adds structs and fields required to parse PodSpecs with
VolumeDevices and PVCs with non-default VolumeModes.
Signed-off-by: Markus Rudy <mr@edgeless.systems>
Since yq frequently updates, let's upgrade to a version from February to
bypass potential issues with versions 4.41-4.43 for now. We can always
upgrade to the newest version if necessary.
Fixes#9354
Depends-on:github.com/kata-containers/tests#5818
Signed-off-by: Beraldo Leal <bleal@redhat.com>
golang.mk is not ready to deal with non GOPATH installs. This is
breaking test on s390x.
Since previous steps here are installing go and yq our way, we could
skip this aditional check. A full refactor to golang.mk would be needed
to work with different paths.
Signed-off-by: Beraldo Leal <bleal@redhat.com>
fixes#9748
A configuration option `guest_component_procs` has been introduced that
indicates which guest component processes are supposed to be spawned by
the agent. The default behaviour remains that all of those processes are
actively spawned by the agent. At the moment this is based on presence
of binaries in the rootfs and the guest_component_api_rest option.
The new option is incremental:
none -> attestation-agent -> confidential-data-hub -> api-server-rest
e.g. api-server-rest implies attestation-agent and confidential-data-hub
the `none` option has been removed from guest_component_api_rest, since
this is addresses by the introduced option.
To not change expected behaviour for non-coco guests we still will still
only attempt to spawn the processes if the requested attestation binaries
are present on the rootfs, and issue in warning in those cases.
Signed-off-by: Magnus Kulke <magnuskulke@microsoft.com>
Add the CLI flag --runtime-class-names, which is used during
policy generation. For resources that can define a
runtimeClassName (e.g., Pods, Deployments, ReplicaSets,...)
the value must have any of the --runtime-class-names as
prefix, otherwise the resource is ignored.
This allows to run genpolicy on larger yaml
files defining many different resources and only generating
a policy for resources which will be deployed in a
confidential context.
Signed-off-by: Leonard Cohnen <lc@edgeless.systems>
It creates this line, as the Golang runtime does:
-object rng-random,id=rng0,filename=/dev/urandom -device virtio-rng-pci,rng=rng0
Signed-off-by: Emanuel Lima <emlima@redhat.com>
We need to remove the device from the tracking map, a container
restart will increment the bus index and we will get out of root-ports
and crash the machine.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
We need special handling for pod_sandbox, pod_container and
single_container how and when to inject CDI devices
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
In Kubernetes we still do not have proper VM sizing
at sandbox creation level. This KEP tries to mitigates
that: kubernetes/enhancements#4113 but this can take
some time until Kube and containerd or other runtimes
have those changes rolled out.
Before we used a static config of VFIO ports, and we
introduced CDI support which needs a patched contianerd.
We want to eliminate the patched continerd in the GPU case
as well.
Fixes: #8860
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
This option allows to add a debug monitor socket when
`enable_debug = true` to control QEMU within debugging case.
Fixes: #9603
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The `update_env_pci()` function need the PCI address mapping to
translate the host PCI address to guest PCI address in below
environment variables:
- PCIDEVICE_<prefix>_<resource-name>_INFO
- PCIDEVICE_<prefix>_<resource-name>
So collect PCI address mapping for both vfio-pci-gk and
vfio-pci devices.
Fixes#9614
Signed-off-by: Lei Huang <leih@nvidia.com>
Replace tab indentation with spaces for the three lines within the ifneq statements, aligning them with the surrounding code.
Fixes:#9692
Signed-off-by: sidneychang <2190206983@qq.com>
The `dial_timeout` works fine for Runtime-go, but is obsoleted in
Runtime-rs.
When the pod cannot connect to the Agent upon starting, we need to adjust
the `reconnect_timeout_ms` to increase the number of connection attempts to
the Agent.
Fixes: #9688
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
For the TD attestation to work the connection to QGS on the host is needed.
By default QGS runs on vsock port 4050, but can be modified by the host owner.
Format of the qemu object follows the SocketAddress structure, so it needs to be provided in the JSON format, as in the example below:
-object '{"qom-type":"tdx-guest","id":"tdx","quote-generation-socket":{"type":"vsock","cid":"2","port":"4050"}}'
Fixes: #9497
Signed-off-by: Jakub Ledworowski <jakub.ledworowski@intel.com>
The new version of sriov-network-device-plugin adds an env
`PCIDEVICE_<prefix>_<resource-name>_INFO`, which has a json
value; kata-agent can't parse it as env
`PCIDEVICE_<prefix>_<resource-name>` which has value in format
"DDDD:BB:SS.F".
This change updates env `PCIDEVICE_<prefix>_<resource-name>_INFO`.
Signed-off-by: Lei Huang <leih@nvidia.com>
Remove `rand.Seed` call to resolve the following failure:
```
rand.Seed is deprecated: As of Go 1.20 there is no reason to call Seed with a random value.
```
The go rand.Seed docs: https://pkg.go.dev/math/rand@go1.20#Seed
back this up and states:
> If Seed is not called, the generator is seeded randomly at program startup.
so I believe we can just delete the call.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
For now, let's allow the users to set the default_cpu and default_memory
when using TDX, as they may hit issues related to the size of the
container image that must be pulled and unpacked inside the guest,
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
- Make due to us bumping the golang version used in our CI
but `make vendor` fails without the go version in the runtime go.mod
being increased, so update this and run go mod tidy
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Fixes: #9532
Close stdin when write_stdin receives data of length 0.
Stop call notify_term_close() in close_stdin, because it could
discard stdout unexpectedly.
Signed-off-by: Tim Zhang <tim@hyper.sh>
We shouldn't be using 9p, at all, with TEEs, as off right now we have no
way to ensure the channels are encrypted. The way to work this around
for now is using guest pull, either with containerd + nydus snapshotter
or with CRI-O; or even tardev snapshotter for pulling on the host (which
is the approach used by MSFT).
This is only done for TDX for now, leaving the generic, AMD, and IBM
related stuff for the folks working on those to switch and debug
possible issues on their environment.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
In Kubernetes, the following values for namespace are equivalent and all refer to the default namespace:
- ` ` (namespace field missing)
- `namespace: ""` (namespace field is the empty string)
- `namespace: "default"`(namespace field has the explicit value `default`)
Genpolicy currently does not handle the empty string case correctly.
Signed-Off-By: Malte Poll <1780588+malt3@users.noreply.github.com>
Implementation of QemuCmdLine has a fairly uniform and repetitive structure
that's guided by a set of conventions. These conventions have however been
mostly implicit so far, leading to a superfluous and annoying
request/force-push churn during qemu-rs PR reviews.
This commit aims to make things explicit so that contributors can take them
into account before an initial PR submission.
Signed-off-by: Pavel Mores <pmores@redhat.com>
ResizeMemory for Cloud Hypervisor is missing a check for the new
requested memory being greater than the max hotplug size after
alignment. Add the check, and since an earlier check for this
setsrequested memory to the max hotplug size, do the same in the
post-alignment check.
Fixes#9640
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
All these settings are hardcoded as `false` and result in
no extra options on the QEMU command line, like the go
runtime does. There actually not needed :
- we're never going to ask QEMU to survive a guest shutdown
- we're never going to run QEMU daemonized since it prevents
log collection
- we're never going to ask QEMU to start with the guest stopped
No need to keep this code around then.
Signed-off-by: Greg Kurz <groug@kaod.org>
By default, when a container is created with the `--privileged` flag,
all devices in `/dev` from the host are mounted into the guest. If
there is a block device(e.g. `/dev/dm`) followed by a generic
device(e.g. `/dev/null`),two identical block devices(`/dev/dm`)
would be requested to the kata agent causing the agent to exit with error:
> Conflicting device updates for /dev/dm-2
As the generic device type does not hit any cases defined in `switch`,
the variable `kataDevice` which is defined outside of the loop is still
the value of the previous block device rather than `nil`. Defining `kataDevice`
in the loop fixes this bug.
Signed-off-by: cncal <flycalvin@qq.com>
RTC was being built in a wrong fashion on commit #2bc5e3c6e2ab0145fa9e8be95df0d5086c07a517
RTC was being constructed inside the QemuCmdLine struct,
but it should've been built inside the devices vector.
Signed-off-by: Emanuel Lima <emlima@redhat.com>
This was needed when we were using an old (and not maintained anymore)
host stack. Considering what we have as part of the distros, Today,
this can simply be dropped, as I cannot find any reference of this one
being needed in any up-to-date documentation.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This reverts commit b7cccfa019.
The `private=on` bit has never made its way upstream, and was removed
from the latest iteration that we're using. With that in mind, let's
revert its usage in the code.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This reverts commit 582b5b6b19.
The `private=on` bit has never made its way upstream, and was removed
from the latest iteration that we're using. With that in mind, let's
revert its addition, and later on its usage in the code.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's add the PLACEHOLDER_FOR_DISTRO_{QEMU,OVMF}_WITH_TDX_SUPPORT
variables instead of actually setting a path, so we can easily replace
those as part of our deployment scripts.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Everytime I create contianer on arm64 machine, containerd/kata logs a redundant warning
as follows:
``` shell
time="2024-05-07" level=warning msg="<nil>" arch=arm64 name=containerd-shim-v2
pid=xxx sandbox=fdd1f05 source=virtcontainers/hypervisor
```
I added an error statement so that the error would be logged when it occurs.
Signed-off-by: cncal <flycalvin@qq.com>
We should init and asign the runtime instance to runtime
handler, otherwise, if the pause container failed to start,
which means the runtime instance failed to start, then the
following delete & shutdown request wouldn't be run, thus
the dead shim would be left.
Fixes: #9597
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
dragonball reserves 2048G of mmio space for the pci root bus by default
on physical addresses greater than 4G. However, for some machines with
smaller physical address widths, such as 39-bit wide physical addresses,
dragonball reserves the mmio space when initializing the memory. It is
less than 2048G, so this commit dynamically calculates and allocates the
mmio size of each pci root bus.
Fixes: #9509
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
The CH v39 upgrade in #9575 is currently blocked because of a bug in the
Mariner host kernel. To address this, we temporarily tweak the Mariner
CI to use an Ubuntu host and the Kata guest kernel, while retaining the
Mariner initrd. This is tracked in #9594.
Importantly, this allows us to preserve CI for genpolicy. We had to
tweak the default rules.rego however, as the OCI version is now
different in the Ubuntu host. This is tracked in #9593.
This change has been tested together with CH v39 in #9588.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
The build message shows that yq was not found when I tried to build
runtime binaries, but I've actually installed yq by yum install.
Signed-off-by: cncal <flycalvin@qq.com>
If device /dev/kvm does not exist, kata-runtime check would fail with
an ambiguous error messae 'no such file or directory'. I added a little
more details to make it understandable and it will belike:
```
ERRO[0000] cannot open kvm device: no such file or directory arch=arm64 check-type=full device=/dev/kvm name=kata-runtime pid=2849085 source=runtime
ERRO[0000] no such file or directory arch=arm64 name=kata-runtime pid=2849085 source=runtime
no such file or directory
```
Signed-off-by: cncal <flycalvin@qq.com>
- The testVersionString logic use regex to check that the ociVersion is
displayed correctly, but with the new go module that version has a
`+` in, so we need to quote this to escape special characters
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Created a new configuration to configure Kata for CoCo without requiring TEE
hardware so to allow developers implement/test/debug platform agnostic code
on their workstations. It will also ease testing of CoCo features on CI with
non-TEE supported VMs.
This is based off qemu configuration. The following differences applied:
- switched to confidential guest image/initrd
- switched to confidential kernel
- switched to 9p shared_fs
Fixes#9487
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
Since OPA binary was replaced by the regorus crate, we can finally stop
building and shipping the binary.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
iommu_platform support was already added on initial DeviceVhostUserFs
introduction, however it incorrectly enabled iommu_platform also on
non-CCW (e.g. PCI) systems.
Signed-off-by: Pavel Mores <pmores@redhat.com>
iommu_platform is only turned on for CCW systems.
PartialEq is added to VirtioBusType to enable the '==' operator.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The adding itself is done by a new function add_iommu() that conforms with
the add_*() convention. Note though that this function is called
internally, by the QemuCmdLine constructor, simply because there's nothing
to trigger its invocation from QemuInner (unlike the other add_*()
functions so far).
Signed-off-by: Pavel Mores <pmores@redhat.com>
Remove reminder to initialize Policy earlier, because currently there
are no plans to initialize earlier.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Implement Agent Policy using the regorus crate instead of the OPA
daemon.
The OPA daemon will be removed from the Guest rootfs in a future PR.
Fixes: #9388
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Lock anyhow version to 1.0.58 because:
- Versions between 1.0.59 - 1.0.76 have not been tested yet using
Kata CI. However, those versions pass "make test" for the
Kata Agent.
- Versions 1.0.77 or newer fail during "make test" - see
https://github.com/kata-containers/kata-containers/issues/9538.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
In case of block devices using virtio-block, we need to pass the
pci-path as the storage source field to the agent.
Current the virt-path is being passed which works just for mmio block
devices.
In the future when support is added for scsi, block-ccw and pmem
devices, the storage source would need to be handled accordingly.
Fixes: #9034
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
In do_create_container and do_exec_process, we should create the proc_io first,
in case there's some error occur below, thus we can make sure
the io stream closed when error occur.
Signed-off-by: Tim Zhang <tim@hyper.sh>
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
Fixes: #9334
In linux, when a FIFO is opened and there are no writers, the reader
will continuously receive the HUP event. This can be problematic.
To avoid this problem, we open stdin in write mode and keep the stdin-writer
We need to open the stdout/stderr as the read mode and keep the open endpoint
until the process is delete. otherwise,
the process would exit before the containerd side open and read
the stdout fifo, thus runD would write all of the stdout contents into
the stdout fifo and then closed the write endpoint. Then, containerd
open the stdout fifo and try to read, since the write side had closed,
thus containerd would block on the read forever.
Here we keep the stdout/stderr read endpoint File in the common_process,
which would be destroied when containerd send the delete rpc call,
at this time the containerd had waited the stdout read return, thus it
can make sure the contents in the stdout/stderr fifo wouldn't be lost.
Signed-off-by: Tim Zhang <tim@hyper.sh>
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
In scenario passfd-io, we should wait for stdin to close itself
instead of manually intervening in it.
Signed-off-by: Tim Zhang <tim@hyper.sh>
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
This patch adds a default implementation for the use_sandbox_pidns
and updates the structs that implement the K8sResource trait to use
the default.
Fixes: #8960
Signed-off-by: Archana Choudhary <archana1@microsoft.com>
- Provide default implementation for get_sandbox_name in K8sResource trait
- Remove default implementation from structs implementing the trait K8sResource
Fixes: #8960
Signed-off-by: Archana Choudhary <archana1@microsoft.com>
concurrently with itself
Based on 3a1461b0a5186a92afedaaea33ff2bd120d1cea0
Previously the tool would use the layers_cache folder for all instances
and hence delete the cache when it was done, interfereing with other
instances. This change makes it so that each instance of the tool will
have its own temp folder to use.
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
updates golang.org/x/net to newer version that closes some reported
vulnerabilities and security issues
Fixes#9486
Signed-off-by: Adil Sadik <sparky.005@gmail.com>
1. EPOLLHUP events also need to be read and will be got len 0.
2. We should kill the connection when EPOLLERR events are received.
Signed-off-by: Tim Zhang <tim@hyper.sh>
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
To make `qemu-runtime-rs` working for CI, we have to rename a configuration
template file and `CONFIG_FILE_QEMU` in Makefile.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Linux kernel generates a panic when the init process exits.
The kernel is booted with panic=1, hence this leads to a
vm reboot.
When used as a service the kata-agent service has an ExecStop
option which does a full sync and shuts down the vm.
This patch mimicks this behavior when kata-agent is used as
the init process.
Fixes: #9429
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
isClhRunning uses signal 0 to test whether the process is
still alive or not. This doesn't work because the process is a
direct child of the shim. Once it is dead the process becomes
zombie.
Since no one waits for it the process lingers until
its parent dies and init reaps it. Hence sending signal 0 in
isClhRunning will always return success whether the process is
dead or not.
This patch calls wait to reap the process, if it succeeds that
means it is our child process, if not we send the signal.
Fixes: #9431
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
genpolicy is a handy tool to use in CI systems, to prepare workloads
before applying them to the Kubernetes API server. However, many modern
build systems like Bazel or Nix restrict network access, and rightfully
so, so any registry interaction must take place on localhost.
Configuring certificates for localhost is tricky at best, and since
there are no privacy concerns for localhost traffic, genpolicy should
allow to contact some registries insecurely. As this is a runtime
environment detail, not a target environment detail, configuring
insecure registries does not belong into the JSON settings, so it's
implemented as command line flags.
Fixes: #9008
Signed-off-by: Markus Rudy <webmaster@burgerdev.de>
CRIs don't always use a pause container, but even if they do the
concrete container choice is not specified. Even if the CRI config can
be tweaked, it's not guaranteed that registries in the public internet
can be reached. To be portable across CRI implementations and
configurations, the genpolicy user needs to be able to configure the
container the tool should append to the policy.
Signed-off-by: Markus Rudy <webmaster@burgerdev.de>
Decouple initialization of the Settings struct from creating the
AgentPolicy struct, so that the settings are available for evaluating,
extending or overriding command line arguments.
Signed-off-by: Markus Rudy <webmaster@burgerdev.de>
This patch introduces a one-time cpath to mitigate the cgroup residuals. It
might break the device cgroup merging rules when the cgroup has children.
Fixes: #9456
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
- Add v1 image test case
- Install protobuf-compiler in build check
- Reset containerd config to default in kubernetes test if we are testing genpolicy
- Update docker_credential crate
- Add test that uses default pull method
- Use GENPOLICY_PULL_METHOD in test
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
Add optional toggle to use existing containerd installation to pull and manage container images.
This adds support to a wider set of images that are currently not supported by standard pull method,
such as those that use v1 manifest.
Fixes: #9144
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
If 'rest_api' is configured, let's start the api-server-rest after
the attestation-agent and the confidential-data-hub have been started.
Fixes: #7555
Signed-off-by: Biao Lu <biao.lu@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Linda Yu <linda.yu@intel.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: Jakob Naucke <jakob.naucke@ibm.com>
Co-authored-by: Wang, Arron <arron.wang@intel.com>
Co-authored-by: zhouliang121 <liang.a.zhou@linux.alibaba.com>
Co-authored-by: Alex Carter <alex.carter@ibm.com>
Co-authored-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Co-authored-by: Xynnn007 <xynnn@linux.alibaba.com>
Add configuration for 'rest api server'.
Optional configurations are
'agent.rest_api=attestation' will enable attestation api
'agent.rest_api=resource' will enable resource api
'agent.rest_api=all' will enable all (attestation and resource) api
Fixes: #7555
Signed-off-by: Biao Lu <biao.lu@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Linda Yu <linda.yu@intel.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: Jakob Naucke <jakob.naucke@ibm.com>
Co-authored-by: Wang, Arron <arron.wang@intel.com>
Co-authored-by: zhouliang121 <liang.a.zhou@linux.alibaba.com>
Co-authored-by: Alex Carter <alex.carter@ibm.com>
Co-authored-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Co-authored-by: Xynnn007 <xynnn@linux.alibaba.com>
Let's introduce a new method to start the confidential data hub and the
attestation agent. The former depends on the later, and it needs to be
started before the RPC server.
Starting the attestation components is based on whether the confidential
containers guest components binaries are found in the rootfs.
Fixes: #7544
Signed-off-by: Biao Lu <biao.lu@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Linda Yu <linda.yu@intel.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: Jakob Naucke <jakob.naucke@ibm.com>
Co-authored-by: Wang, Arron <arron.wang@intel.com>
Co-authored-by: zhouliang121 <liang.a.zhou@linux.alibaba.com>
Co-authored-by: Alex Carter <alex.carter@ibm.com>
Co-authored-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Co-authored-by: Xynnn007 <xynnn@linux.alibaba.com>
Support different pause images in the guest for guest-pull, such as k8s
pause image (registry.k8s.io/pause) and openshift pause image (quay.io/bpradipt/okd-pause).
Fixes: #9225 -- part III
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Let's rely on the kvm module 'tdx' parameter to do so.
This aligns with both OSVs (Canonical, Red Hat, SUSE) and the TDX
adoption (https://github.com/intel/tdx-linux) stacks.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This commit is a mess, but I'm not exactly sure what's the best way to
make it less messy, as we're getting QEMU TDX to work while partially
reverting 1e34220c41.
With that said, let me cover the content of this commit.
Firstly, we're reverting all the changes related to
"memory-backend-memfd-private", as that's what was used with the
previous host stack, but it seems it
didn't fly upstream.
Secondly, in order to get QEMU to properly work with TDX, we need to
enforce the 'private=on' knob and use the "memory-backend-ram", and
we're doing so, and also making sure to test the `private=on` newly
added knob.
I'm sorry for the confusion, I understand this is not optimal, I just
don't see an easy path to do changes without leaving the code broken
during those changes.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This reverts commit d1b54ede29.
Conflicts:
src/runtime/virtcontainers/qemu.go
This commit was a hack that was needed in order to get QEMU + TDX to
work atop of the stack our CI was running on. As we're moving to "the
officially supported by distros" host OS, we need to get rid of this.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
The private=on|off knob is required in order to properly lauunch a TDX
guest VM.
This is a brand new property that is part of the still in-flight patches
adding TDX support on QEMU.
Please, see:
3fdd8072da
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
We used to utilize go runtime's "NumCPUs()", which will give the number
of cores available to the Go runtime, which may be a subset of physical
cores if the shim is started from within a cpuset. From the function's
description:
"NumCPU returns the number of logical CPUs usable by the current
process."
As an example, if containerd is run from within a smaller CPUset, the
maximum size of a pod will be dictated by this CPUset, instead of what
will be available on the rest of the system.
Since the shim will be moved into its own cgroup that may have a
different CPUset, let's stick with checking physical cores. This also
aligns with what we have documented for maxVCPU handling.
In the event we fail to read /proc/cpuinfo, let's use the goruntime.
Fixes: #9327
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Try to reduce duplicated code in decrease_attach_count with public
new function do_decrease_count.
Fixes: #8738
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Try to reduce duplicated code in increase_attach_count with public
new function do_increase_count.
Fixes: #8738
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Introduce a dedicated public function do_decrease_count to
reduce duplicated code in drivers' decrease_attach_count.
Fixes: #8738
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Since there are many implementations of reference counting in the
drivers, all of which have the same implementation, we should try
to reduce such duplicated code as much as possible. Therefore, a
new function is introduced to solve the problem of duplicated code.
Fixes: #8738
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Configure the system to mount cgroups-v2 by default during system boot
by the systemd system, We must add systemd.unified_cgroup_hierarchy=1
parameter to kernel cmdline, which will be passed by kernel_params in
configuration.toml.
To enable cgroup-v2, just add systemd.unified_cgroup_hierarchy=true[1]
to kernel_params.
Fixes: #9336
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When there's a pod with multiple containers, there may be case that
attach point more than 2, we should not return Err in that case when
we are doing attach ops, but just return Ok.
Fixes: #8738
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
This reverts commit 1c5693be86.
Avoid apparent infinite loop when ReadStreamRequest is blocked by
policy - for some of the pods.
When running the k8s-limit-range.bats test with Policy enabled,
the Shim + VMM never get terminated on my cluster. Not sure why
the sandbox clean-up works better for other tests, but the
k8s-limit-range test pod gets stuck in an infinite loop:
stdout io stream copy error happens: error = %wrpc error: code =
PermissionDenied desc = \"ReadStreamRequest is blocked by policy
...
policy check: ReadStreamRequest
...
stdout io stream copy error happens: error = %wrpc error: code =
PermissionDenied desc = \"ReadStreamRequest is blocked by policy
...
policy check: ReadStreamRequest
...
Fixes: #9380
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Since we no longer use the service_offload configuration,
remove the ServiceOffload field from the image struct.
Signed-off-by: Tobin Feldman-Fitzthum <tobin@ibm.com>
These experimental options were added 2 years ago
in anticipation of features that would be added
in CoCo. These do not match the features that were
eventually added and will soon be ported to main.
Fixes: #8047
Signed-off-by: Tobin Feldman-Fitzthum <tobin@ibm.com>
This is the report from `make check`:
```
error: this loop never actually loops
--> src/signal.rs:147:9
|
147 | / loop {
148 | | select! {
149 | | _ = handle => {
150 | | println!("INFO: task completed");
... |
156 | | }
157 | | }
| |_________^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#never_loop
= note: `#[deny(clippy::never_loop)]` on by default
```
There is only one option: you get something or a timeout. You never retry, so
the report is correct.
Fixes: #9342
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
The lint report is the following:
```
error: `flatten()` will run forever if the iterator repeatedly produces an `Err`
--> src/rpc.rs:1754:10
|
1754 | .flatten()
| ^^^^^^^^^ help: replace with: `map_while(Result::ok)`
|
note: this expression returning a `std::io::Lines` may produce an infinite number of `Err` in case of a read error
--> src/rpc.rs:1752:5
|
1752 | / reader
1753 | | .lines()
| |________________^
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#lines_filter_map_ok
= note: `-D clippy::lines-filter-map-ok` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(clippy::lines_filter_map_ok)]`
```
This commit simply applies the suggestion.
Fixes: #9342
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
Running `make check` in the `src/agent` directory gives:
```
error: you seem to use `.enumerate()` and immediately discard the index
--> rustjail/src/mount.rs:572:27
|
572 | for (_index, line) in reader.lines().enumerate() {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#unused_enumerate_index
= note: `-D clippy::unused-enumerate-index` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(clippy::unused_enumerate_index)]`
help: remove the `.enumerate()` call
|
572 | for line in reader.lines() {
| ~~~~ ~~~~~~~~~~~~~~
Checking tokio-native-tls v0.3.1
Checking hyper-tls v0.5.0
Checking reqwest v0.11.18
error: could not compile `rustjail` (lib) due to 1 previous error
warning: build failed, waiting for other jobs to finish...
make: *** [../../utils.mk:177: standard_rust_check] Error 101
```
Fixes: #9342
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
support to configure CreateContainerRequestTimeout in the
configurations.
e.g.:
[runtime]
...
create_container_timeout = 300
Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
(https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Most of the content of `docs/Stable-Branch-Strategy.md` got de-facto
deprecated by the re-design of the release process described in #9064.
Remove this file and all its references in the repo.
The `## Versioning` section has some useful information though. It is
moved to `docs/Release-Process.md`. The documentation of the `PATCH`
field is adapted according to new workflow.
Fixes#9064 - part VI
Signed-off-by: Greg Kurz <groug@kaod.org>
Support to configure CreateContainerRequestTimeout in the annotations.
e.g.:
annotations:
"io.katacontainers.config.runtime.create_container_timeout": "300"
Note: The effective timeout is determined by the lesser of two values: runtime-request-timeout from kubelet config
(https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/#:~:text=runtime%2Drequest%2Dtimeout) and create_container_timeout.
In essence, the timeout used for guest pull=runtime-request-timeout<create_container_timeout?runtime-request-timeout:create_container_timeout.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
In the situation to pull images in the guest #8484, it’s important to account for pulling large images.
Presently, the image pull process in the guest hinges on `CreateContainerRequest`, which defaults to a 60-second timeout.
However, this duration may prove insufficient for pulling larger images, such as those containing AI models.
Consequently, we must devise a method to extend the timeout period for large image pull.
Fixes: #8141
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
The remaining code in network.rs was mostly moved to utils.rs which seems
better home for these utility functions anyway (and a closely related
function open_named_tuntap() has already lived there).
ToString implementation for Address was removed after some consideration.
Address should probably ideally implement Display (as per RFC 565) which
would also supply a ToString implementation, however it implements Debug
instead, probably to enable automatic implementation of Debug for anything
that Address is a member of, if for no other reason. Rather than having
two identical functions this commit simply switches to using the Debug
implementation for printing Address on qemu command line.
Fixes#9352
Signed-off-by: Pavel Mores <pmores@redhat.com>
Make file descriptors to be passed to qemu owned by QemuCmdLine. See
commit 52958f17cd for more explanation.
Signed-off-by: Pavel Mores <pmores@redhat.com>
add_network_device() doesn't need to be passed NetworkInfo since it
already has access to the full HypervisorConfig.
Also, one of the goals of QemuCmdLine interface's design is to avoid
coupling between QemuCmdLine and the hypervisor crate's device module,
if at all possible. That's why add_network_device() shouldn't take
device module's NetworkConfig but just parts that are useful in
add_network_device()'s implementation.
Signed-off-by: Pavel Mores <pmores@redhat.com>
is_running_in_vm() is enough to figure out whether to disable_modern but
it's clumsy and verbose to use. should_disable_modern() streamlines the
usage by encapsulating the verbosity.
Signed-off-by: Pavel Mores <pmores@redhat.com>
This commit replaces the existing NetDevice-based implementation with one
using Netdev and DeviceVirtioNet.
Signed-off-by: Pavel Mores <pmores@redhat.com>
In keeping with architecture of QemuCmdLine implementation we split the
functionality into two objects: Netdev to represent and generate the
-netdev part and DeviceVirtioNet for the -device virtio-net-<transport>
part.
This change is a pure refactor, existing functionality does not change.
However, we do remove some stub generalizations and govmm-isms, notably:
- we remove the NetDev enum since the only network interface types that
kata seems to use with qemu are tuntap and macvtap, both of which are
implemented by the same -netdev tap
- enum DeviceDriver is also left out since it doesn't seem reasonable to
try to represent VFIO NICs (which are completely different from
virtio-net ones) with the same struct as virtio-net
- we also remove VirtioTransport because there's no use for it so far, but
with the expectation that it will be added soon.
We also make struct Netdev the owner of any vhost-net and queue file
descriptors so that their lifetime is tied ultimately to the lifetime of
QemuCmdLine automatically, instead of returning the fds to the caller and
forcing it to achieve the equivalent functionality but manually.
Signed-off-by: Pavel Mores <pmores@redhat.com>
generate_netdev_fds() takes NetworkConfig from which it however only needs
a host-side network device name. This commit makes it take the device name
directly, making the function useful to callers who don't have the whole
NetworkConfig but do have the requisite device name.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The idea of this function is to make sure O_CLOEXEC is not set on file
descriptors that should be inherited by a child (=hypervisor) process.
The approach so far is however rather heavy-handed - clearing *all* flags
is unjustifiably aggresive for a low-level function with no knowledge of
context whatsoever.
This commit refactors the function so that it only does what's expected
and renames it accordingly. It also clarifies some of its call sites.
Signed-off-by: Pavel Mores <pmores@redhat.com>
Kata CI has full debug output enabled for the cbl-mariner k8s tests,
and the test AKS node is relatively slow. So debug prints from policy
are expensive during CI.
Fixes: #9296
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Change scripts and source that uses files in the tests repo to use the
corresponding file in the current repo.
Fixes#9165
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
Currently, `.lock().await.clone()` results in `Option<ImageService>` being duplicated in memory with each call to `singleton()`.
Consequently, if kata-agent receives numerous image pulling requests simultaneously,
it will lead to the allocation of multiple `Option<ImageService>` instances in memory, thereby consuming additional memory resources.
In image.rs, we introduce two public functions:
`merge_bundle_oci()` and `init_image_service()`. These functions will encapsulate
the operations on `IMAGE_SERVICE`, ensuring that its internal details remain
hidden from external modules such as `rpc.rs`.
Fixes: #9225 -- part II
Signed-off-by: Xynnn007 <xynnn@linux.alibaba.com>
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
It is observed that virtiofsd exits immediately on s390x
if there is no attached console devices.
This commit resolves the issue by migrating `appendConsole()`
from runtime and being triggered in `start_vm()`.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
For s390x, it requires an additional option `memory-backend` for `-machine`.
Otherwise, virtiofsd exits with HandleRequest(InvalidParam).
This commit is to add a field `memory_backend` to `struct Machine`
and turn it on for s390x.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Like nvdimm for x86_64, a block device for s390x should be
treated differently with `virtio-blk-ccw`.
This is to generate a QEMU command line parameter for a block
device by using `-blockdev` and `-device` if the `vm_rootfs_driver`
is set to `virtio-blk-ccw`.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Add in the full details once cloud-hypervisor/cloud-hypervisor#6103
has been implemented, and the feature is available in a Cloud Hypervisor
release.
Fixes: #8799
Signed-off-by: David Esparza <david.esparza.borquez@intel.com>
Currently, `*-pci` is used as an argument for the device config.
It is not true for a case where a different type of bus is used.
s390x uses `ccw`.
This commit is to make it flexible to generate the device argument
based on the bus type. A structure `DeviceVhostUserFsPci` and
`VhostVsockPci` is renamed to `DeviceVhostUserFs` and `VhostVsock`
because the structure name is not bound to a certain bus type any more.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
It has been observed that the runtime stops running around
`sysinfo::total_memory()` while adjusting a config on s390x.
This is to update the crate to the latest version which happened
to resolve the issue. (No explicit release note for this)
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
MTRR, or Memory-Type Range Registers are a group of x86 MSRs providing a way to control access
and cache ability of physical memory regions.
During our test in runtime-rs + Dragonball, we found out that this register support is a must
for passthrough GPU running CUDA application, GPU needs that information to properly use GPU memory.
fixes: #9310
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
When the https_proxy/no_proxy settings are configured alongside agent-policy enabled, the process of pulling image in the guest will hang.
This issue could stem from the instantiation of `reqwest`’s HTTP client at the time of agent-policy initialization,
potentially impacting the effectiveness of the proxy settings during image guest pulling.
Given that both functionalities use `reqwest`, it is advisable to set https_proxy/no_proxy prior to the initialization of agent-policy.
Fixes: #9212
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Enable to build kata-agent with PULL_TYPE feature.
We build kata-agent with guest-pull feature by default, with PULL_TYPE set to default.
This doesn't affect how kata shares images by virtio-fs. The snapshotter controls the image pulling in the guest.
Only the nydus snapshotter with proxy mode can activate this feature.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
To support handle image-guest-pull block volume from different CRIs, including cri-o and containerd.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
By default the pause image and runtime config will provided
by host side, this may have potential security risks when the
host config a malicious pause image, then we will use the pause
image packaged in the rootfs.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Arron Wang <arron.wang@intel.com>
Co-authored-by: Julien Ropé <jrope@redhat.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Add "guest-pull" feature option to determine that the related dependencies
would be compiled if the feature is enabled.
By default, agent would be built with default-pull feature, which would
support all pull types, including sharing images by virtio-fs and
pulling images in the guest.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
support to pass image information to guest by KataVirtualVolumeImageGuestPullType
in KataVirtualVolume, which will be used to pull image on the guest.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
As we do not employ a forked containerd in confidential-containers, we utilize the KataVirtualVolume
which storing the image information as an integral part of `CreateContainer`.
Within this process, we store the image information in rootfs.storage and pass this image url through `CreateContainerRequest`.
This approach distinguishes itself from the use of `PullImageRequest`, as rootfs.storage is already set and initialized at this stage.
To maintain clarity and avoid any need for modification to the `OverlayfsHandler`,we introduce the `ImagePullHandler`.
This dedicated handler is responsible for orchestrating the image-pulling logic within the guest environment.
This logic encompasses tasks such as calling the image-rs to download and unpack the image into `/run/kata-containers/{container_id}/images`,
followed by a bind mount to `/run/kata-containers/{container_id}`.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
When being passed an image name through a container annotation,
merge its corresponding bundle OCI specification and process into the passed container creation one.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Arron Wang <arron.wang@intel.com>
Co-authored-by: Jiang Liu <gerry@linux.alibaba.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: wllenyj <wllenyj@linux.alibaba.com>
Co-authored-by: jordan9500 <jordan.jackson@ibm.com>
Containerd can support set a proxy when downloading images with a environment variable.
For CC stack, image download is offload to the kata agent, we need support similar feature.
Current we add https_proxy and no_proxy, http_proxy is not added since it is insecure.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Arron Wang <arron.wang@intel.com>
With image-rs pull_image API, the downloaded container image layers
will store at IMAGE_RS_WORK_DIR, and generated bundle dir with rootfs
and config.json will be saved under CONTAINER_BASE/cid directory.
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Co-authored-by: Arron Wang <arron.wang@intel.com>
Co-authored-by: Jiang Liu <gerry@linux.alibaba.com>
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: wllenyj <wllenyj@linux.alibaba.com>
Introduce structure ImageService, which will be used to pull images
inside the guest.
Fixes: #8103
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
co-authored-by: wllenyj <wllenyj@linux.alibaba.com>
co-authored-by: stevenhorsman <steven@uk.ibm.com>
File descriptors that are passed to QEMU need some special care.
We want them to be closed when the QEMU process is started. But
at the same time, it is required that the associated rust File
structures, either coming from the` std::fs` or the `tokio::fs`
crates, are still in scope when the QEMU process is forked. This
is currently achieved by keeping File structures in variables
at the outer scope of `start_vm()`. This scheme is currently
duplicated, with similar justifications in the corresponding
comments.
Consolidate all this handling in one place with a more generic
explanation.
Fixes#9281
Signed-off-by: Greg Kurz <groug@kaod.org>
The agent now has a number of optional build-time features that can be
enabled.
Add details of these features to the following areas:
- Version output (`kata-agent --version`)
- Announce message (so that the details are always added to the journal
at agent startup).
- The response message returned by the ttRPC `GetGuestDetails()` API.
Fixes: #9285.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Some previous contribution missed to run cargo clippy.
Fix the dependency now so that it doesn't cause noise
in future contributions.
Signed-off-by: Greg Kurz <groug@kaod.org>
Fixes: #9269
From https://github.com/opencontainers/runtime-spec/blob/main/config.md#mounts
type (string, OPTIONAL) The type of the filesystem to be mounted.
bind may be only specified in the oci spec options -> flags update r#type
The agent will ignore bind mounts if they are only specified in the OCI spec options and not in the flags.
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
s390x supports a different machine type `s390-ccw-virtio` and it is
not required to configure cpu features by default for the platform.
A hypervisor `dragonball` is not supported on s390x so that `DBCMD`
is not necessary. `vm-rootfs_driver` should be set to `virtio-blk-ccw`.
This commit is to set the architecture-specific flags for Makefile.
Fixes: #9158
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The guest_hook_path item in configuration.toml allows OCI hook scripts
to be executed within Kata's guest environment. Traditionally, these
guest hook programs are pre-built and included in Kata's guest rootfs
image at a fixed location.
While setting guest_hook_path = "/usr/share/oci/hooks" in configuration.toml
works, it lacks flexibility. Not all guest hooks reside in the path
/usr/share/oci/hooks, and users might have custom locations.
To address this, a more flexible and configurable approach is to be proposed
that allows users to specify their desired path. This could include using a
sandbox bind mount path for hooks specific to that particular container.
However, The current implementation of guest hooks and bind mounts in kata-agent
has a reversed order of execution compared to the desired behavior.
To achieve the intended functionality, we simply need to swap the order of their
implementation.
Fixes: #9274
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Fixes: #9267
The doc states we have support for all lifecycle hooks. There are still some missing.
This is the first issue regarding the CreateContainer hook which is run before pivot_root but after prestart and createruntime
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
The open_named_tuntap function is designed as a public function to
open a tuntap device with the specified name. However, in order to
reference existing methods in dbs_utils, we still need to keep the
reference "path = "../../../dragonball/src/dbs_utils" in dependencies
and cannot hide it.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add network helpers and impl ToQemuParams trait to build
netdev params which are put into cmdline for Qemu VM running.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
We need ensure the add_network_device happens in netns and
move qemu process into netns which keeps the qemu process
running in this net namespace.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add network device handler in start_vm, which is sepcially
for Qemu VM running with added net params to command line.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
We need add a new netns field in struct QemuInner, and
initialize it with argument passed down in prepare_vm().
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The enter_netns function is designed as a public method to help
VMMs running as a independent process enter a network namespace,
reducing duplicate code.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
It just move the related code to a public file(utils.rs) and make
it a common method for both vsock and network, or some others.
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Signed-off-by: Pavel Mores <pmores@redhat.com>
In order to better support non-builtin vmm usage of NetnsGuard and
reduce code duplication, we need to move it to a common path that
can be referenced by both hypervisor and resource manager.
In this patch, it just do moving code from network/utils/netns.rs
to kata-sys-utils/src/netns.rs
Fixes: #8865
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The PID needs to be initialized before calling isClhRunning.
waitVMM() uses isClhRunning and is called by launchClh() just
before returning from function.
Fixes: #9230
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
Use containerd's default environment for container images that don't
specify the Env field.
Also, re-enable policy env variable verification, now that these
uncommon images are supported too.
Fixes: #9239
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This change updates the module import to use 'util' instead of the deprecated 'io/util'
Fixes: #9166
Signed-off-by: Chungeun Choi <ce.choi@okestro.com>
copyBuffer returns and the streams will be closed when error occurs.
If the error contains "blocked by policy" it means the log output is
disabled by policy with "ReadStreamRequest" and "WriteStreamRequest" set
to false. But at this moment, we want the real stream still working (not
be seen) because we might want to enable logging for debugging purpose,
so we repeat copybuffer in this case to avoid streams being closed.
Fixes: #8797
Signed-off-by: Linda Yu <linda.yu@intel.com>
logging/debugging information might probably be disabled in production
due to security consideration, but we'd better provide an approach for
customer to get logging information during runtime, this PR implement
setpolicy function in kata-runtime tools, although it can set whole policy
other than logging.
setpolicy would evokes remote attestation, which means before setting
policy during runtime, user has to reconfigure new policy hash in KBS/AS.
usage: kata-runtime policy set policy.rego --sandbox-id XXXXXXXX
Fixes: #8797
Signed-off-by: Linda Yu <linda.yu@intel.com>
Disable env variable verification to unblock CI, until container
images that don't specify the Env variables will be handled correctly
(see #9239).
Also, mark the image config Env field as optional, thus allowing
policy generation for these container images.
Fixes: #9240
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
a) There is some unknown syscalls triggered in new github virt machine
that would break the make test process with SIGSYS after applying
SeccompFilter. In order to fix this, we change the allowlist in this
unit test for seccompfileter into a blocklist to avoid meeting the unknown syscalls.
b) lazy static METRICS is not fully initialize in the unit test and may lead to
unstable result for this UT.
fixes: #9207
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
the mmap region start guest addr hard-code a value and later there
would be check whether the mentioned addr is larger than or equal
to mem_end (default to host_phy_mem >> 1) in order to satisfy the
requirement for DaxMemory. Since github virt machine phy_mem is larger
than previous CI machine we use, the hard-code value could no longer be
worked. To fix this, we change the address to mem_end in unit test to
avoid the influence of host machine change.
fixes: #9207
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
This fixes a panic on tracing on container exit.
The root cause is that global var needs to be set by "=" instead of
":=".
Fixes: #9102
Signed-off-by: Liu Bo <liub.liubo@gmail.com>
Add all agent configuration options to README so that users can more easily understand
what these options do and how to configure them at runtime.
Fixes: #9109
Signed-off-by: ChengyuZhu6 <chengyu.zhu@intel.com>
Relax the timeout for calling CLH's CreateVM + BootVM APIs. When
hitting the older 1s timeout, killing a half-booted Guest and
retrying the same boot sequence could have been wasteful and resulting
in unstable CI testing on slower Hosts.
Fixes: #9152
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
For consistency with the go runtime.
As the shim itself is not using the network (all its communication with
other processes is done with local unix sockets), there is no reason to
keep gathering and reporting shim-specific network metrics.
Actual network usage of the kata containers can be found from the existing
agent network metrics (kata_guest_netdev_stat).
Signed-off-by: Julien Ropé <jrope@redhat.com>
As part of the shim network metrics, the shim is reporting network interfaces
from the host with no namespace isolation - this gives insight in interfaces
not tied to the kata containers, and causes an increase in resource usage for
kata metrics.
As the shim itself is not using the network (all its communication with
other processes is done with local unix sockets), there is no reason to
keep gathering and reporting shim-specific network metrics.
Actual network usage of the kata containers can be found from the existing
hypervisor network metrics (kata_hypervisor_netdev) and from the agent
network metrics (kata_guest_netdev_stat).
Fixes: #5738
Signed-off-by: Julien Ropé <jrope@redhat.com>
We need initailize the pci_hotplug_enabled with true before we do GPU
passthrough with runtime-rs/dragonball. Otherwise it fails with error
`InvalidOperation`.
Fixes: #9129
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
When cgroup v2 is in use, a container should only see its part of the
unified hierarchy in `/sys/fs/cgroup`, not the full hierarchy created
at the OS level. Similarly, `/proc/self/cgroup` inside the container
should display `0::/`, rather than a full path such as :
0::/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podde291f58_8f20_4d44_aa89_c9e538613d85.slice/crio-9e1823d09627f3c2d42f30d76f0d2933abdbc033a630aab732339c90334fbc5f.scope
What is needed here is isolation from the OS. Do that by running the
container in its own cgroup namespace. This matches what runc and
other non VM based runtimes do.
Fixes#9124
Signed-off-by: Greg Kurz <groug@kaod.org>
In case an error is encountered while removing a network endpoint during
network cleanup, we cuurently return immediately with the error.
With this change, in case of error we simply log the error and proceed
towards removing the next endpoint. With this, we can cleanup the
network changes made by the shim as much as possible.
This is especially important when multiple interfaces are passed to the
network namespace using a network plugin like multus.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Move the defer for cleaning up network before the call to add network.
This way if any change made by add network is reverted by in case of
failure. This is particulary important for physical network interfaces
as with this step we make sure that driver for the physical interface is
reverted back to the original host driver. Without this the physical
network iterface will remain bound to vfio.
Fixes: #8646
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Although we don't seem to be affected by
https://nvd.nist.gov/vuln/detail/CVE-2024-21626, we vendor and use the
runc package in a few different places of our code, and we better update
the package to its latest release.
Fixes: #9097
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Last but not least, all placeholders for argument replacement
should be configured to generate a configuration file when `QEMUCMD`
is defined. This enriches those variables.
Additionally, this involves creating a symbolic link to `configuration-qemu.toml`
if QEMU is defined as the default hypervisor.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
There are some variables newly introduced to runtime-rs, such as:
- runtime.name
- runtime.hypervisor_name
- runtime.agent_name
- vm_rootfs_driver
Additionally some of the placeholders for argument replacement are
made hypervisor-specific based on the changes made for dragonball.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
For example, Kata CI's k8s-copy-file.bats transfers files between the
Host and the Guest using "kubectl exec", and that results in
CloseStdinRequest being called from the Host.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
1. Remove PullImageRequest because that is not used in the main
branch. It was used in the CCv0 branch.
2. Add default false values for the remaining Kata Agent ttrpc
requests.
These changes don't change the functionality of the auto generated
Policy, but they help with easier understanding the Policy text and
the logging from the Rego rules.
Fixes: #9049
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Since cri-o doesn't seem to use address for event publishing as mentioned
in the previous commit it will not send it. However, the exact way of
not sending it is unfortunately different from what is assumed by
runtime-rs. Due to an implementation detail of cri-o which uses containerd
libraries for some low-level tasks, TTRPC_ADDRESS will not be missing from
environment as assumed, instead it will be present with an empty value.
This commit contains a small adjustment to account for that and use
LogForwarder even if TTRPC_ADDRESS is present, but with an empty value.
Fixes#8985
Signed-off-by: Pavel Mores <pmores@redhat.com>
This is needed to fix the bug which is not allowing to create SEV container
on SNP enabled host anymore. This is a regression that was introduced as
part of the following commit:
de39fb7d38Fixes: #9036
Signed-off-by: Niteesh Dubey <niteesh@us.ibm.com>
It makes sense to reuse a configuration template for runtime-golang
as a base. This is simply to copy it into the config directory.
Fixes: #8441
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
It appears that under the shim v2 protocol, a shim has no use of its own
for the -address value, it just passes it back to container runtime's
(mostly containerd or cri-o) event-publishing binary. Since the -address
value only flows through the shim, being passed to the shim by a container
runtime and then essentially passed back by shim to the container runtime,
it seems inappropriate for a shim to validate the value that is fully
owned and only used by the container runtime.
This commit removes such validation from runtime-rs. Doing so, it solves
(part of) an interoperability problem between runtime-rs and cri-o. cri-o
seems to intentionally choose not to implement the event-publishing part
of the shim v2 protocol and thus it has no value it could pass to
runtime-rs for -address. As a result, it sends an empty string which has
been failing the excessive validation performed by runtime-rs so far.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The emergent Kata CI tests for Policy use confidential_guest = false
in genpolicy-settings.json. That value is inconsistent with the
following mount settings:
"emptyDir": {
"mount_type": "local",
"mount_source": "^$(cpath)/$(sandbox-id)/local/",
"mount_point": "^$(cpath)/$(sandbox-id)/local/",
"driver": "local",
"source": "local",
"fstype": "local",
"options": [
"mode=0777"
]
},
We need to keep those settings for confidential_guest = true, and
change confidential_guest = false to use:
"emptyDir": {
"mount_type": "local",
"mount_source": "^$(cpath)/$(sandbox-id)/rootfs/local/",
"mount_point": "^$(cpath)/$(sandbox-id)/local/",
"driver": "local",
"source": "local",
"fstype": "local",
"options": [
"mode=0777"
]
},
The value of the mount_source field is different.
This change unblocks testing using Kata CI's pod-empty-dir.yaml:
genpolicy -u -y pod-empty-dir.yaml
kubectl apply -f pod-empty-dir.yaml
k get pod sharevol-kata
NAME READY STATUS RESTARTS AGE
sharevol-kata 1/1 Running 0 53s
Fixes: #8887
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Implementing Persist API for cloud-hypervisor was done partially with
initial support for cloud-hypervisor. Store and retrieve additional
fields to/from the hypervisor state.
Fixes: #6202
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
The original handling method does not reach user expectations.
When the ClientSocketAddress method stats the corresponding
path of runtime-rs and has not found it yet, we should return
an error message here that includes the reason for the failure
(which should be an error display indicating that both runtime-go
and runtime-rs were not found). Instead of simply displaying the
corresponding path of runtime-rs as the final error message to
users.
It is also necessary to return the error promptly to the caller
for further error handling.
Fixes: #8999
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Now that we have a confidential image / initrd being built, instead of a
specific one for each TEE, let's use it everywhere possible.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we're building a single confidential kernel, we should rely on it
rather than keep using the specific ones for TDX / SEV / SNP.
However, for debugability-sake, let's do this change TEE by TEE.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we're building a single confidential kernel, we should rely on it
rather than keep using the specific ones for TDX / SEV / SNP.
However, for debugability-sake, let's do this change TEE by TEE.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
As we're building a single confidential kernel, we should rely on it
rather than keep using the specific ones for TDX / SEV / SNP.
However, for debugability-sake, let's do this change TEE by TEE.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
With this we can properly generate and the the `-confidential` kernel,
which supports SEV / SNP / TDX as part of our configuration files.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Until now, runtime-rs couldn't be compiled on s390x.
We need to lift those restrictions in Makefile first.
Fixes: #8446
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
It fails to compile virt_container because Dragonball is only
used in the implementation of the trait method Persist::restore().
As the hypervisor is not compiled on s390x and QEMU implements
the trait method, this commit is to let the method use QEMUi's.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Dragonball and cloud-hypervisor are not supported on s390x.
We need to exclude the plugins for these hypervisors from compilation.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
This patch can reduce load on systemd process, and
increase the k8s deployment density when using go runtime.
Fixes: #8758
Signed-off-by: Zhigang Wang <wangzhigang17@huawei.com>
Signed-off-by: Liu Wenyuan <liuwenyuan9@huawei.com>
Kata CI's pod-sandbox-vcpus-allocation.yaml ends with "---", so the
empty YAML document following that line should be ignored.
To test this fix:
genpolicy -u -y pod-sandbox-vcpus-allocation.yaml
Fixes: #8895
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Allow users to specify in genpolicy-settings.json a default cluster
namespace other than "default". For example, Kata CI uses as default
namespace: "kata-containers-k8s-tests".
Fixes: #8976
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
There is a race condition in agent HVSOCK_STREAMS hashmap, where a
stream may be taken before it is inserted into the hashmap. This patch
add simple retry logic to the stream consumer to alleviate this issue.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Linux forbids opening an existing socket through /proc/<pid>/fd/<fd>,
making some images relying on the special file /dev/stdout(stderr),
/proc/self/fd/1(2) fail to boot in passfd io mode, where the
stdout/stderr of a container process is a vsock socket.
For back compatibility, a pipe is introduced between the process
and the socket, and its read end is set as stdout/stderr of the
container process instead of the socket. The agent will do the
forwarding between the pipe and the socket.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
In passfd io mode, when not using a terminal, the stdout/stderr vsock
streams are directly used as the stdout/stderr of the child process.
These streams are non-blocking by default.
The stdout/stderr of the process should be blocking, otherwise
the process may encounter EAGAIN error when writing to stdout/stderr.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
We want the io connection keep connected when the containerd closed
the io pipe, thus it can be attached on the io stream.
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
Support the hybrid fd passthrough mode with passing pipe fd,
which can specify this connection kept even when the pipe
peer closed, and this connection can be reget wich re-opening
the pipe.
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
In linux, when a FIFO is opened and there are no writers, the reader
will continuously receive the HUP event. This can be problematic
when creating containers in detached mode, as the stdin FIFO writer
is closed after the container is created, resulting in this situation.
In passfd io mode, open stdin fifo with O_RDWR|O_NONBLOCK to avoid the
HUP event.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
When container exits, the agent should clean up the term master fd,
otherwise the fd will be leaked.
Fixes: kata-containers#6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
When one end of the connection close, the epoll event will be triggered
forever. We should close the connection and kill the connection.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Currently in the kata container, every io read/write operation requires
an RPC request from the runtime to the agent. This process involves
data copying into/from an RPC request/response, which are high overhead.
To solve this issue, this commit utilize the vsock fd passthrough, a
newly introduced feature in the Dragonball hypervisor. This feature
allows other host programs to pass a file descriptor to the Dragonball
process, directly as the backend of an ordinary hybrid vsock connection.
The runtime-rs now utilizes this feature for container process io. It
open the stdin/stdout/stderr fifo from containerd, and pass them to
Dragonball, then don't bother with process io any more, eliminating
the need for an RPC for each io read/write operation.
In passfd io mode, the agent uses the vsock connections as the child
process's stdin/stdout/stderr, eliminating the need for a pipe
to bump data (in non-tty mode).
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Two toml options, `use_passfd_io` and `passfd_listener_port` are introduced
to enable and configure dragonball's vsock fd passthrough io feature.
This commit is a preparation for vsock fd passthrough io feature.
Fixes: #6714
Signed-off-by: Zixuan Tan <tanzixuan.me@gmail.com>
Using custom input paths with -i is counter-intuitive. Simplify path handling with explicit flags for rules.rego and genpolicy-settings.json.
Fixes: #8568
Signed-Off-By: Malte Poll <1780588+malt3@users.noreply.github.com>
When creating a cgroup, add a SingleContainer when obtaining the OCI Spec to apply to ctr, podman, etc.
Fixes: #5240
Signed-off-by: yaoyinnan <35447132+yaoyinnan@users.noreply.github.com>
Removed the setting of default values for runtime fields. Added explicit checks for missing or empty fields, reporting errors with clear messages.
Fixes: #8838
Signed-off-by: yaoyinnan <35447132+yaoyinnan@users.noreply.github.com>
The `noop-method-call` is a rustc lint that has existed since v1.52.0.
This lint has been moved to the warn by default lint level since v1.73.0.
Therefore build is failing with this version and above.
This commit removes the unnecessary call to `<&T as Deref>::deref` on `T: !Deref`.
Fixes: #8586
Signed-off-by: Kvlil <kalil.pelissier@gmail.com>
The auto-generated Policy already allows these volumes to be mounted,
regardless if they are:
- Present, or
- Missing and optional
Fixes: #8893
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Qemu stderr monitoring runs in its own asynchronous green thread.
For that, `stderr` is taken out of the Child representing the qemu child
process to avoid partial move and make it possible for the main thread
still to call functions on QemuInner::qemu_process (e.g. kill(), id()).
Fixes#8937
Signed-off-by: Pavel Mores <pmores@redhat.com>
We'll want to capture qemu's stderr in parallel with normal runtime-rs
execution. Tokio's primitives make this much easier than std's. This
also makes child process management more consistent across runtime-rs
(i.e. virtiofsd child process is already launched and managed using tokio).
Some changes were necessary due to tokio functions being slightly different
from their std counterparts. Child::kill() is now async and Child::id()
now returns an Option.
Signed-off-by: Pavel Mores <pmores@redhat.com>
Allow Kata CI's pod-nested-configmap-secret.yaml to work with
genpolicy and current cbl-mariner images:
1. Ignore the optional type field of Secret input YAML files.
It's possible that CoCo will need a more sophisticated Policy
for Secrets, but this change at least unblocks CI testing for
already-existing genpolicy features.
2. Adapt the value of the settings field below to fit current CI
images for testing on cbl-mariner Hosts:
"kata_config": {
"confidential_guest": false
},
Switching this value from true to false instructs genpolicy to
expect ConfigMap volume mounts similar to:
"configMap": {
"mount_type": "bind",
"mount_source": "$(sfprefix)",
"mount_point": "^$(cpath)/watchable/$(bundle-id)-[a-z0-9]{16}-",
"driver": "watchable-bind",
"fstype": "bind",
"options": [
"rbind",
"rprivate",
"ro"
]
},
instead of:
"confidential_configMap": {
"mount_type": "bind",
"mount_source": "$(sfprefix)",
"mount_point": "$(sfprefix)",
"driver": "local",
"fstype": "bind",
"options": [
"rbind",
"rprivate",
"ro"
]
}
},
This settings change unblocks CI testing for ConfigMaps.
Simple sanity testing for these changes:
genpolicy -u -y pod-nested-configmap-secret.yaml
kubectl apply -f pod-nested-configmap-secret.yaml
kubectl get pods | grep config
nested-configmap-secret-pod 1/1 Running 0 26s
Fixes: #8892
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This fixes a flaw pointed out in review of PR #8185. Creation of the
directory semantically fits better into VM preparation than VM launch.
Signed-off-by: Pavel Mores <pmores@redhat.com>
Validating the node name is currently outside the scope of the CoCo
policy.
This change unblocks testing using Kata CI's test-pod-file-volume.yaml
and pv-pod.yaml.
Fixes: #8888
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Remove the unused DriverInfo declaration or integrate it into the codebase where applicable.
Fixes: #8927
Signed-off-by: yaoyinnan <35447132+yaoyinnan@users.noreply.github.com>
Add metadata containing the Policy annotation if the user didn't
provide any metadata in the input yaml file.
For a simple sanity test using a Kata CI YAML file:
genpolicy -u -y job.yaml
kubectl apply -f job.yaml
kubectl get pods | grep job
job-pi-test-64dxs 0/1 Completed 0 14s
Fixes: #8891
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
SharedVersion fiel add a versiontable property that isn't supported by upstream QEMU.
This is dead code since virtcontainers isn't setting SharedVersions to true.
Fixes: #7720
Signed-off-by: Kvlil <kalil.pelissier@gmail.com>
Ignore pod DNS settings because policing the network traffic is
currently outside the scope of the Agent Policy.
Example from Kata CI: pod-custom-dns.yaml
Fixes: #8832
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Deploy the framework added by the previous commit to generate qemu
command line and launch the VM.
We now properly store the child process object which allows us to
implement remaining Hypervisor functions necessary for a simple but
successful VM lifecycle, get_vmm_master_tid() and stop_vm().
Fixes#8184
Signed-off-by: Pavel Mores <pmores@redhat.com>
- test_volume_capacity_stats: verify the file block size against the fetched size via statfs()
- test_reseed_rng: Correct the request codes for RNDADDTOENTCNT and RNDRESEEDCRNG when platform is ppc64le
- test list_routes: Add the route only if destination is not empty
- test_new_fs_manager: skip the test if cgroups v2 is used by default
- skip test cases rpc::tests::test_do_write_stream, sandbox::tests::test_find_process, sandbox::t
ests::test_find_container_process and sandbox::tests::add_and_get_container on ppc64le as they are fl
aky
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
kata-ctl currently fails to build on ppc64le. Skip it for running static checks and the issues will be fixed and tracked in a seperate issue.
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
A few CPU related test cases were failing as the version was being verified against Power8 while the CI machine is Power9.
Fixes: #5531
Signed-off-by: Amulyam24 <amulmek1@in.ibm.com>
At the moment, a project `dragonball` and `runtime-rs` does not support
for s390x. During the enablement, some errors due to the misconfiguration
of Makefile for `make check` and `make vendor` were identified.
This is to skip the build for the affected target of the projects.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Some linting errors were identified during the enablement of `make check`.
These have not been found by the Jenkins CI job because `make test` was
only triggered.
The errors for the `agent` occurs under the s390x specific tests while
the other ones for the `kata-ctl` are the architecture-specific code.
This commit is to fix those errors.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The idea of most of these is just to prevent running into todo!()s where
we can at the moment, while implementing the fundamental functionality of
VM launch.
Signed-off-by: Pavel Mores <pmores@redhat.com>
DirectVolume/Rawblock doesn't work well when device's block driver
is virtio-blk-pci and the storage handler is DRIVER_BLK_PCI_TYPE.
Fixes: #8707
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Temporarily disable the allow_storages() rules, because they are based
on the tarfs snapshotter + container image integrity information that
are not available yet in the main branch - see #8833.
Fixes: #8834
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Adjust genpolicy-settings.json to match the container root path from
the main branch + cbl-mariner Guest VMs.
This configuration might have to be adjusted again when other types of
Guest VMs will be tested during CI using genpolicy, in the future.
Also, improve logging from allow_root_path(), to easier debug these
issues in the future.
Fixes: #8835
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Replace the `todo!()` calls with a minimal NOP implementation to return
the CH driver to working order since the `todo!()`'s forcibly crash the
driver at runtime. Full implementations for these APIs will be added on
issues #8800, #8801, and #8802.
Fixes: #8784.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Remove the `todo!()` macro which would cause a runtime crash and replace
with a implementation that returns an error as a stop-gap until #8800 is
implemented.
Fixes: #8785.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
It is a little bit heavy for the runtime-rs to forwards events via
containerd CLI, contrast to the ttrpc way. Plus, for runtimes that haven't
this mechanism, e.g. CRI-O, we can't get those events anywhere.
This patch introduces two types of forwarders:
- `ContainerdForwarder`: Acquire ttrpc address from environment variables
and forward events via ttrpc connection.
- `LogForwarder`: Write event info into logs.
Fixes: #7881
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
This is not necessarily meant to work, just to stub out unimplemented
functionality while focusing on more fundamental things.
Signed-off-by: Pavel Mores <pmores@redhat.com>
The agent registers an event fd in `memory.oom_control`. An OOM event is
forwarded to containerd when the event is emitted, regardless of the
content in that file.
I observed content indicating that events should not be forwarded, as shown
below. When `oom_kill` is set to 0, it means no OOM has occurred. Therefore,
it is important to check the content to avoid mistakenly forwarding OOM
events.
```
oom_kill_disable 0
under_oom 0
oom_kill 0
```
Fixes: #8715
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Don't release the lock between is_allowed and set_policy calls,
because the policy might change in between these calls.
Also, move more policy code into policy.rs.
Fixes: #8734
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
- `ttrpc` from `0.7.1` to `0.8`.
- `containerd-shim-protos` from `0.3.0` to `0.6.0`.
Fixes: #8756
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
In order to avoid rust-vmm upstream change breaks Dragonball
compilation, we introduce Cargo.lock to dbs crates.
fixes: #8770
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
In order to avoid rust-vmm upstream change breaks Dragonball
compilation, we introduce Cargo.lock to dbs crates.
fixes: #8770
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
As reported in #8767, we have found that the root cause is that rust-vmm's vmm-sys-utils
introduce a new release 0.12.1 and dbs-pci rely on rust-vmm's vfio-ioctls which uses >=
to declare vmm-sys-utils so it automatically upgrade vmm-sys-utils to 0.12.1.
That's how two different versions of vmm-sys-utils is introduced and this breaks the compilation.
In order to fix this and also avoid future problems, we introduce Cargo.lock file to dbs crates.
fixes: #8770
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Previously, Dragonball did not support PCI device hot-plugging or
VFIO device passthrough. Therefore, the runtime-rs support for
Dragonball was incomplete. it is time to complete it so that users
can use Dragonball's PCI hot-plugging and VFIO passthrough capabilities.
Fixes: #8748
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
vfio commits introduce quite a lot change in runtime-rs, this commit is
for all the changes related to ci, including compilation errors and so on.
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Introduce two new vmm action to implement pci hotplug
and pci hot-unplug: PrepareRemoveHostDevice and RemoveHostDevice.
PrepareRemoveHostDevice is to call upcall to unregister the pci device
in the guest kernel.
RemoveHostDevice should be called after PrepareRemoveHostDevice, it is used
to clean the PCI resource in the Dragonball side.
fixes: #8741
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Helin Guo <helinguo@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Introduce a new vmm action InsertHostDevice to passthrough
host pci devices like NIC or GPU devices into guest so that
users could have high performance usage of those devices.
fixes: #8741
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Helin Guo <helinguo@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Add a pcie_topology field to DeviceManager and initialize
pcie_topology when ResourceManager calls DeviceManager's new()
with TopologyConfigInfo.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Before calling the device driver to attach a device, register
the device to PCIe topology and allocate a PciPath for it.
However, for some hypervisor such as CLH, the allocation is invalid
when plugging devices to VM, they have the ability to return
DeviceInfo containing PciPath. It'll update the PciPath with the
returned pci path in the PCIe topology for them to prevent the
inferred pcipath from being different from the actual value returned.
But the update will not be executed if the pcipath value doesn't change.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Introduce helper macros to simplify PCIe device register/unregister
and update, which provides a convenient way to handle devices in
topology.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Add one more argument with type &mut Option<&mut PCIeTopology>
in attach and detach to inroduce methods within PCIe Topology.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Implement Trait PCIeDevice register/unregister for pcie/pci
device, such as vfio device which needs set/get device's pci
path for kata agent's device handler.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Introduce Trait PCIeDevice with register/unregister, which are
used to register or unregister pcie device within the PCIe topology.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Due to different ways that different VMMs handle PCI devices,
we expect to provide a general PCIe topology processing framework
that is as compatible as possible with VMMs such as dragonball,
qemu, clh(Though it has its own management method, no conflict).
Currently,it's mainly developed for kinds of PCIe/PCI devices in
dragonball/clh which are attached on the pci/pcie root bus directly.
More will be added when Qemu is ready in runtime-rs.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
A TopologyConfigInfo added to store device config info for PCIe/PCI
devices in the VM from Hypervisor DeviceInfo.
And TopologyConfigInfo::new will be the entry to initialize PCIe
Topology for each VM.
Fixes: #7218
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
vfio mod collects lots of information related to the vfio operations, including VfioMsi and VfioMsix capability & state,
vfio interrupt info, pci region infor and vfio pci device info & state.
fixes: #8722
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Shifang Feng <fengshifang@linux.alibaba.com>
Signed-off-by: Yang Su <yang.su@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Xin Lin <jingshan@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Bridge the gap between user requirements for direct block device access
and the DirectVolume capabilities provided by Kata runtimes
(kata-runtime/runtime-rs), and facilitate seamless integration with CSI
to improve user experience.
It aims to integrate DirectVolume CSI support into Kata, enabling users
to benefit from its performance and flexibility advantages.
Fixes: #8602
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
This patch introduces a feature of supporting vhost-user-blk device.
This device needs to be defined before the VM instance is started,
which can be done through the dbs-cli tool with --virblks option:
--virblks '{
"drive_id": "8623",
"device_type": "Spdk",
"path_on_host": "spdk:///var/tmp/vhost.sock",
"is_root_device": false,
"is_read_only": false,
"is_direct": false,
"no_drop": false,
"num_queues": 1,
"queue_size": 256
}'
Fixes: #8631
Signed-off-by: Eric Ren <renzhen@linux.alibaba.com>
Signed-off-by: fupan <fupan.lfp@antgroup.com>
Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
Signed-off-by: Qinqi Qu <quqinqi@linux.alibaba.com>
The compiler will give a warning if a developer forget to add an arm for
a new variants defined.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
DAN reads vhost-user-net device from JSON config. It only supports VMM
running as server right now.
Fixes: #8625
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
The changes involve:
- Expose VhostUserConfig struct to runtime-rs.
- Set a default value while num_queues or queue_size are 0.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
This commit introduces VhostUserEndpoint and supports relative to
vhost-user-net devices for device manager. For now, Dragonball is able to
attach vhost-user-net devices.
Fixes: #8625
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Config space of network device is shared and accord with virtio 1.1 spec.
It is a good way to abstract the common part into one function.
`set_config_space()` implements this.
Plus, this patch removes `vq_pairs` from vhost-net devices, since there is
a possibility of data inconsistency. For example, some places read that
from `self.vq_pairs`, others read from `queue_sizes.len() / 2`.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Also support alternative media type and update samples
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Add application that infers K8s user's intentions based on user's
K8s YAML file, and generates a Rego/OPA based policy for that YAML.
Just Pod YAML files are supported as input using this initial source
code. Support for other types of YAML files will come with upcoming
commits.
Fixes: #7673
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
introduce msi/msix mod to maintain information for PCI Message Signalled
Interrupt Extended Capability. It will be initialized when parsing pci
configuration space and used when getting interrupt capabilities.
fixes: #8661
Signed-off-by: Gerry Liu <gerry@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Shifang Feng <fengshifang@linux.alibaba.com>
Signed-off-by: Yang Su <yang.su@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Xin Lin <jingshan@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
This PR introduces vhost-user-net devices to Dragonball. The devices are
allowed to run as server on the VMM side.
Fixes: #8502
Signed-off-by: Eric Ren <renzhen@linux.alibaba.com>
Signed-off-by: Liu Jiang <gerry@linux.alibaba.com>
Signed-off-by: Zha Bin <zhabin@linux.alibaba.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Signed-off-by: Zizheng Bian <zizheng.bian@linux.alibaba.com>
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Vhost-user-net has a dependency on address space from `MmioV2DeviceState`.
The addition of the address space is introduced in this patch. Plus, it
makes sure all unit tests have the according parameter as well.
Fixes: #8502
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
It's important to ensure that these tasks which setup vfio
devices are completed before add_device.
So Moving vfio device setup code to a dedicated method at device
building time which does not affect the behavior of other code.
And this change makes it easier to understand the difference
between create and attach, and also makes the boundaries
clearer.
Fixes: #8665
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Make VhostUserConfig pci_path's type more specific, change it
from Option<String> to Option<PciPath>.
Fixes: #8665
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Add implementations for the following `Hypervisor` trait methods which
simply return the same details as the `get_vmm_master_tid()` method:
- `get_thread_ids()`
- `get_pids()`
Fixes: #6438.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
This is to reintroduce a configuration rule for IBM Z Secure Execution,
where no initrd path should be configured. For the TEE of interest,
only a kernel image should be specified with `confidential_guest=true`.
Fixes: #8692
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
`make SUPPORT_VIRTUALIZATION=1 test` iterates through all subcrates and
does test.
Plus, this patch fixes some issues about unit tests:
- Feed too much parameters to `I8042Device::new()`.
- Virtqueue checks have been introduced since `virtio-queue v0.7.0`.
- GHA might have no access to `/var/tmp` dir on runner.
Fixes: #8690
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>