Mount validation for sealed secret requires the base path to start with
`/run/kata-containers/shared/containers`. Previously, it used
`/run/kata-containers/sandbox/passthrough`, which caused test
failures where volume mounts are used.
This commit renames the path to satisfy the validation check.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
- Set guest Storage.options for block rootfs to empty (do not propagate host mount options).
- Align behavior with Go runtime: only add xfs nouuid when needed.
Signed-off-by: Caspian443 <scrisis843@gmail.com>
In #11693 the cc_init_data annotation was changes to be hypervisor
scoped, so each hypervisor needs to explicitly allow it in order to
use it now, so add this to both the go and rust runtime's remote
configurations
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Currently, we change vm_rootfs_driver as the initdata device driver
with block_device_driver.
Fixes#11697
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
we also need support initdat within nonprotection even though the
platform is detected as NonProtection or usually is called nontee
host. Within these cases, there's no need to validate the item of
`confidential_guest=true`, we believe the result of the method
`available_guest_protection()?`.
Fixes#11697
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The default `reconnect_timeout` (3 seconds) was found to be insufficient for
IBM SEL when using VSOCK. This commit updates the timeouts as follows:
- `dial_timeout_ms`: Set to 90ms to match the value used in go-runtime for IBM SEL
- `reconnect_timeout_ms`: Increased to 5000ms based on empirical testing
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Add support for the `InitData` resource config on IBM SEL,
so that a corresponding block device is created and the
initdata is passed to the guest through this device.
Note that we skip passing the initdata hash via QEMU’s
object, since the hypervisor does not yet support this
mechanism for IBM SEL. It will be introduced separately
once QEMU adds the feature.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
We need to include `cc_init_data` in the enable_annotations
array to pass the data. Since initdata is a CoCo-specific
feature, this commit introduces a new array,
`DEFENABLEANNOTATIONS_COCO`, which contains the required
string and applies it to the relevant CoCo configuration.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
This commit support the seccomp_sandbox option from the configuration.toml file
and add the logic for appending command-line arguments based on this new configuration parameter.
Fixes: #11524
Signed-off-by: wangxinge <wangxinge@bupt.edu.cn>
The seccomp feature for Cloud Hypervisor and Firecracker is enabled by default.
This commit introduces an option to disable seccomp for both and updates the built-in configuration.toml file accordingly.
Fixes: #11535
Signed-off-by: wangxinge <wangxinge@bupt.edu.cn>
Route kata-shim logs directly to systemd-journald under 'kata' identifier.
This refactoring enables `kata-shim` logs to be properly attributed to
'kata' in systemd-journald, instead of inheriting the 'containerd'
identifier.
Previously, `kata-shim` logs were challenging to filter and debug as
they
appeared under the `containerd.service` unit.
This commit resolves this by:
1. Introducing a `LogDestination` enum to explicitly define logging
targets (File or Journal).
2. Modifying logger creation to set `SYSLOG_IDENTIFIER=kata` when
logging
to Journald.
3. Ensuring type safety and correct ownership handling for different
logging backends.
This significantly enhances the observability and debuggability of Kata
Containers, making it easier to monitor and troubleshoot Kata-specific
events.
Fixes: #11590
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Also included (as commented out) is a test that does not pass although
it should. See source code comment for explanation why fixing this seems
beyond the scope of this PR.
Signed-off-by: Pavel Mores <pmores@redhat.com>
This commit focuses purely on the formal change of type. If any subsequent
changes in semantics are needed they are purposely avoided here so that the
commit can be reviewed as a 100% formal and 0% semantic change.
Signed-off-by: Pavel Mores <pmores@redhat.com>
This commit addresses a part of the same problem as PR #7623 did for the
golang runtime. So far we've been rounding up individual containers'
vCPU requests and then summing them up which can lead to allocation of
excess vCPUs as described in the mentioned PR's cover letter. We address
this by reversing the order of operations, we sum the (possibly fractional)
container requests and only then round up the total.
We also align runtime-rs's behaviour with runtime-go in that we now
include the default vcpu request from the config file ('default_vcpu')
in the total.
We diverge from PR #7623 in that `default_vcpu` is still treated as an
integer (this will be a topic of a separate commit), and that this
implementation avoids relying on 32-bit floating point arithmetic as there
are some potential problems with using f32. For instance, some numbers
commonly used in decimal, notably all of single-decimal-digit numbers
0.1, 0.2 .. 0.9 except 0.5, are periodic in binary and thus fundamentally
not representable exactly. Arithmetics performed on such numbers can lead
to surprising results, e.g. adding 0.1 ten times gives 1.0000001, not 1,
and taking a ceil() results in 2, clearly a wrong answer in vcpu
allocation.
So instead, we take advantage of the fact that container requests happen
to be expressed as a quota/period fraction so we can sum up quotas,
fundamentally integral numbers (possibly fractional only due to the need
to rewrite them with a common denominator) with much less danger of
precision loss.
Signed-off-by: Pavel Mores <pmores@redhat.com>
As we have changed the initdata annotation definition, Accordingly, we also
need correct its const definition with KATA_ANNO_CFG_RUNTIME_INIT_DATA.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Add full cgroups support on host. Cgroups are managed by `FsManager` and
`SystemdManager`. As the names impies, the `FsManager` manages cgroups
through cgroupfs, while the `SystemdManager` manages cgroups through
systemd. The two manages support cgroup v1 and cgroup v2.
Two types of cgroups path are supported:
1. For colon paths, for example "foo.slice:bar:baz", the runtime manages
cgroups by `SystemdManager`;
2. For relative/absolute paths, the runtime manages cgroups by
`FsManager`.
vCPU threads are added into the sandbox cgroups in cgroup v1 + cgroupfs,
others, cgroup v1 + systemd, cgroup v2 + cgroupfs, cgroup v2 + systemd, VMM
process is added into the cgroups.
The systemd doesn't provide a way to add thread to a unit. `add_thread()`
in `SystemdManager` is equivalent to `add_process()`.
Cgroup v2 supports threaded mode. However, we should enable threaded mode
from leaf node to the root node (`/`) iteratively [1]. This means the
runtime needs to modify the cgroups created by container runtime (e.g.
containerd). Considering cgroupfs + cgroup v2 is not a common combination,
its behavior is aligned with systemd + cgroup v2, which is not allowed to
manage process at the thread level.
1: https://www.kernel.org/doc/html/v4.18/admin-guide/cgroup-v2.html#threadsFixes: #11356
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
As some reasons, it first should make it align with runtime-go, this
commit will do this work.
Fixes#11543
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
When enabling systemd cgroup driver and sandbox cgroup only, the shim is
under a systemd unit. When the unit is stopping, systemd sends SIGTERM to
the shim. The shim can't exit immediately, as there are some cleanups to
do. Therefore, ignoring SIGTERM is required here. The shim should complete
the work within a period (Kata sets it to 300s by default). Once a timeout
occurs, systemd will send SIGKILL.
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
It is important that we continue to support VirtIO-SCSI. While
VirtIO-BLK is a common choice, virtio-scsi offers significant
performance advantages in specific scenarios, particularly when
utilizing iothreads and with NVMe Fabrics.
Maintaining Flexibility and Choice by supporting both virtio-blk and
virtio-scsi, we provide greater flexibility for users to choose the
optimal storage(virtio-blk, virtio-scsi) interface based on their
specific workload requirements and hardware configurations.
As virtio-scsi controller has been created when qemu vm starts with
block device driver is set to `virtio-scsi`. This commit is for blockdev_add
the backend block device and device_add frondend virtio-scsi device via qmp.
Fixes#11516
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
As block device index is an very important unique id of a block device
and can indicate a block device which is equivalent to device_id.
In case of index is required in calculating scsi LUN and reduce
useless arguments within reusing `hotplug_block_device`, we'd better
change the device_id with block device index.
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
In this commit, block device aio are introduced within hotplug_block_device
within qemu via qmp and the "iouring" is set the default.
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
It should be correctly handled within the device manager when do
create_block_device if the driver_option is virtio-scsi.
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
It supports handling scsi device when block device driver is `scsi`.
And it will ensure a correct storage source with LUN.
Fixes#11516
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
It's used to help discover scsi devices inside guest and also add a
new const value `KATA_SCSI_DEV_TYPE` to help pass information.
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
Although Previous implementation of hotplugging block device via QMP
can successfully hot-plug the regular file based block device, but it
fails when the backend is /dev/xxx(e.g. /dev/loop0). With analysis about
it, we can know that it lacks the ablility to hotplug host block devices.
This commit will fill the gap, and make it work well for host block
devices.
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>
According to the issue [1], Tokio will panic when we are giving a blocking
socket to Tokio's `from_std()` method, the information is as follows:
```
A panic occurred at crates/agent/src/sock/vsock.rs:59: Registering a
blocking socket with the tokio runtime is unsupported. If you wish to do
anyways, please add `--cfg tokio_allow_from_blocking_fd` to your RUSTFLAGS.
```
A workaround is to set the socket to non-blocking.
1: https://github.com/tokio-rs/tokio/issues/7172
Signed-off-by: Xuewei Niu <niuxuewei.nxw@antgroup.com>
Bump these crates across various components to remove the
dependency on unmaintained instant crate and remediate
RUSTSEC-2024-0384
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The init data could not be read properly within kata-agent because the
data length field was omitted, a consequence of a mismatch in the data
write format.
Fixes#11556
Signed-off-by: alex.lyn <alex.lyn@antgroup.com>