Even though ociSpec.Linux.Devices is preserved when vfio_mode is VFIO,
it has not been updated correctly for coldplug scenarios. This happens
because the device info passed to the agent via CreateContainerRequest
is dropped by the Kata runtime.
This commit ensures that the device info is added to the sandbox's
device manager when vfio_mode is VFIO and coldPlugVFIO is true
(e.g., vfio-ap-cold), allowing ociSpec.Linux.Devices to be properly
updated with the device information before the container is created on
the guest.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
As we don't have any CI, nor maintainer to keep ACRN code around, we
better have it removed than give users the expectation that it should or
would work at some point.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
First of all, this is a controversial piece, and I know that.
In this commit we're trying to make a less greedy approach regards the
amount of vCPUs we allocate for the VMM, which will be advantageous
mainly when using the `static_sandbox_resource_mgmt` feature, which is
used by the confidential guests.
The current approach we have basically does:
* Gets the amount of vCPUs set in the config (an integer)
* Gets the amount of vCPUs set as limit (an integer)
* Sum those up
* Starts / Updates the VMM to use that total amount of vCPUs
The fact we're dealing with integers is logical, as we cannot request
500m vCPUs to the VMMs. However, it leads us to, in several cases, be
wasting one vCPU.
Let's take the example that we know the VMM requires 500m vCPUs to be
running, and the workload sets 250m vCPUs as a resource limit.
In that case, we'd do:
* Gets the amount of vCPUs set in the config: 1
* Gets the amount of vCPUs set as limit: ceil(0.25)
* 1 + ceil(0.25) = 1 + 1 = 2 vCPUs
* Starts / Updates the VMM to use 2 vCPUs
With the logic changed here, what we're doing is considering everything
as float till just before we start / update the VMM. So, the flow
describe above would be:
* Gets the amount of vCPUs set in the config: 0.5
* Gets the amount of vCPUs set as limit: 0.25
* ceil(0.5 + 0.25) = 1 vCPUs
* Starts / Updates the VMM to use 1 vCPUs
In the way I've written this patch we introduce zero regressions, as
the default values set are still the same, and those will only be
changed for the TEE use cases (although I can see firecracker, or any
other user of `static_sandbox_resource_mgmt=true` taking advantage of
this).
There's, though, an implicit assumption in this patch that we'd need to
make explicit, and that's that the default_vcpus / default_memory is the
amount of vcpus / memory required by the VMM, and absolutely nothing
else. Also, the amount set there should be reflected in the
podOverhead for the specific runtime class.
One other possible approach, which I am not that much in favour of
taking as I think it's **less clear**, is that we could actually get the
podOverhead amount, subtract it from the default_vcpus (treating the
result as a float), then sum up what the user set as limit (as a float),
and finally ceil the result. It could work, but IMHO this is **less
clear**, and **less explicit** on what we're actually doing, and how the
default_vcpus / default_memory should be used.
Fixes: #6909
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
On hotplug of memory as containers are started, remount all ephemeral mounts with size option set to the total sandbox memory
Fixes: #6417
Signed-off-by: Sidhartha Mani <sidhartha_mani@apple.com>
Fixes: #5993
Several tests utilize linux'isms like Mounts, bindmounts, vsock etc.
Let's ensure that these are still tested on Linux, but that we also skip
these tests when on other operating systems (Darwin). This commit just
moves tests; there shouldn't be any functional test changes. While the
tests still won't be runnable on Darwin/other hosts yet, this is a necessary
step forward.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Signed-off-by: Danny Canter <danny@dcantah.dev>
Augment the mock hypervisor so that we can validate that ACPI memory hotplug
is carried out as expected.
We'll augment the number of memory slots in the hypervisor config each
time the memory of the hypervisor is changed. In this way we can ensure
that large memory hotplugs are broken up into appropriately sized
pieces in the unit test.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
The directory created by `T.TempDir` is automatically removed when the
test and all its subtests complete.
This commit also updates the unit test advice to use `T.TempDir` to
create temporary directory in tests.
Fixes: #3924
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Currently, the block driver option is specifed by hard coding, maybe it
is better to use const string variables instead of hard coded strings.
Another modification is to remove duplicate consts for virtio driver in
manager.go.
Fixes: #3321
Signed-off-by: Jason Zhang <zhanghj.lc@inspur.com>
To resourcecontrol, and make it consistent with the fact that cgroups
are a Linux implementation of the ResourceController interface.
Fixes: #3601
Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
Support hugepages and port from:
96dbb2e8f0Fixes: #3342
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Signed-off-by: bin <bin@hyper.sh>
We are converting the Network structure into an interface, so that
different host OSes can have different networking implementations for
Kata.
One step into that direction is to rename all the Network structure
fields and methods to something that is less Linux networking namespace
specific. This will make the Network interface naming consistent.
Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
The OCI container spec specifies a limit of -1 signifies
unlimited memory. Update the sandbox memory calculator
to reflect this part of the spec.
Fixes: #3512
Signed-off-by: Braden Rayhorn <bradenrayhorn@fastmail.com>
The new API is based on containerd's cgroups package.
With that conversion we can simpligy the virtcontainers sandbox code and
also uniformize our cgroups external API dependency. We now only depend
on containerd/cgroups for everything cgroups related.
Depends-on: github.com/kata-containers/tests#3805
Signed-off-by: Samuel Ortiz <samuel.e.ortiz@protonmail.com>
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
This commit add option "enable_guest_swap" to config hypervisor.qemu.
It will enable swap in the guest. Default false.
When enable_guest_swap is enabled, insert a raw file to the guest as the
swap device if the swappiness of a container (set by annotation
"io.katacontainers.container.resource.swappiness") is bigger than 0.
The size of the swap device should be
swap_in_bytes (set by annotation
"io.katacontainers.container.resource.swap_in_bytes") - memory_limit_in_bytes.
If swap_in_bytes is not set, the size should be memory_limit_in_bytes.
If swap_in_bytes and memory_limit_in_bytes is not set, the size should be
default_memory.
Fixes: #2201
Signed-off-by: Hui Zhu <teawater@antfin.com>
To workaround virtiofs' lack of inotify support, we'll special case
particular mounts which are typically watched, and pass on information
to the agent so it can ensure that the mount presented to the container
is indeed watchable (see applicable agent commit).
This commit will:
- identify watchable mounts based on file count and mount source
- create a watchable-bind storage object for these mounts to
communicate intent to the agent
- update the OCI spec to take the updated watchable mount source into account
Unit tests added and updated for the newly introduced
functionality/functions.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
There's no reason to pass the paths; they can be
determined when they are actually used.
Let's make the return values more comparable to the other mount handling
functions (we'll add storage object in future commit), and pass the mount maps as
function parameters.
...No functional changes here...
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
The property name make newcomers confused when reading code.
Since in Kata Containers 2.0 there will only be one type of store,
so it's safe to replace it by `store` simply.
Fixes: #1660
Signed-off-by: bin <bin@hyper.sh>
Modify calls in unit tests to use context since many functions were
updated to accept local context to fix trace span ordering.
Fixes#1355
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
A significant number of trace calls did not use a parent context that
would create proper span ordering in trace output. Add local context to
functions for use in trace calls to facilitate proper span ordering.
Additionally, change whether trace function returns context in some
functions in virtcontainers and use existing context rather than
background context in bindMount() so that span exists as a child of a
parent span.
Fixes#1355
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
In PR 1079, CleanupContainer's parameter of sandboxID is changed to VCSandbox, but at cleanup,
there is no VCSandbox is constructed, we should load it from disk by loadSandboxConfig() in
persist.go. This commit reverts parts of #1079Fixes: #1119
Signed-off-by: bin liu <bin@hyper.sh>
Remove global sandbox variable, and save *Sandbox to hypervisor struct.
For some needs, hypervisor may need to use methods from Sandbox.
Signed-off-by: bin liu <bin@hyper.sh>
Make `asset.go` the arbiter of asset annotations by removing all asset
annotations lists from other parts of the codebase.
This makes the code simpler, easier to maintain, and more robust.
Specifically, the previous behaviour was inconsistent as the following
ways:
- `createAssets()` in `sandbox.go` was not handling the following asset
annotations:
- firmware:
- `io.katacontainers.config.hypervisor.firmware`
- `io.katacontainers.config.hypervisor.firmware_hash`
- hypervisor:
- `io.katacontainers.config.hypervisor.path`
- `io.katacontainers.config.hypervisor.hypervisor_hash`
- hypervisor control binary:
- `io.katacontainers.config.hypervisor.ctlpath`
- `io.katacontainers.config.hypervisor.hypervisorctl_hash`
- jailer:
- `io.katacontainers.config.hypervisor.jailer_path`
- `io.katacontainers.config.hypervisor.jailer_hash`
- `addAssetAnnotations()` in the `oci` package was not handling the
following asset annotations:
- hypervisor:
- `io.katacontainers.config.hypervisor.path`
- `io.katacontainers.config.hypervisor.hypervisor_hash`
- hypervisor control binary:
- `io.katacontainers.config.hypervisor.ctlpath`
- `io.katacontainers.config.hypervisor.hypervisorctl_hash`
- jailer:
- `io.katacontainers.config.hypervisor.jailer_path`
- `io.katacontainers.config.hypervisor.jailer_hash`
This change fixes the bug where specifying a custom hypervisor path via an
asset annotation was having no effect.
Fixes: #1085.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
bindmount remount events are not propagated through mount subtrees,
so we have to remount the shared dir mountpoint directly.
E.g.,
```
mkdir -p source dest foo source/foo
mount -o bind --make-shared source dest
mount -o bind foo source/foo
echo bind mount rw
mount | grep foo
echo remount ro
mount -o remount,bind,ro source/foo
mount | grep foo
```
would result in:
```
bind mount rw
/dev/xvda1 on /home/ubuntu/source/foo type ext4 (rw,relatime,discard,data=ordered)
/dev/xvda1 on /home/ubuntu/dest/foo type ext4 (rw,relatime,discard,data=ordered)
remount ro
/dev/xvda1 on /home/ubuntu/source/foo type ext4 (ro,relatime,discard,data=ordered)
/dev/xvda1 on /home/ubuntu/dest/foo type ext4 (rw,relatime,discard,data=ordered)
```
The reason is that bind mount creats new mount structs and attaches them to different mount subtrees.
However, MS_REMOUNT only looks for existing mount structs to modify and does not try to propagate the
change to mount structs in other subtrees.
Fixes: #1061
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
CPUSet cgroup allows for pinning the memory associated with a cpuset to
a given numa node. Similar to cpuset.cpus, we should take cpuset.mems
into account for the sandbox-cgroup that Kata creates.
Signed-off-by: Eric Ernst <eric.g.ernst@gmail.com>
CPUSet cgroup allows for pinning the memory associated with a cpuset to
a given numa node. Similar to cpuset.cpus, we should take cpuset.mems
into account for the sandbox-cgroup that Kata creates.
Signed-off-by: Eric Ernst <eric.g.ernst@gmail.com>
Remove tests from virtcontainers/sandbox_test.go which were moved to
virtcontainers/types/sandbox_test.go.
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
To use the kata-containers repo path.
Most of the change is generated by script:
find . -type f -name "*.go" |xargs sed -i -e \
's|github.com/kata-containers/runtime|github.com/kata-containers/kata-containers/src/runtime|g'
Fixes: #201
Signed-off-by: Peng Tao <bergwolf@hyper.sh>