Commit Graph

17197 Commits

Author SHA1 Message Date
Fabiano Fidêncio
e85cf83573 k8s: tests: Fix default for EXPERIMENTAL_FORCE_GUEST_PULL
It takes either a shim name or "", but we were treating this (thankfully
only in this specific file) as a boolean.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-10 13:01:30 +01:00
Manuel Huber
8b39468b36 tests: nvidia: Logging for NIM
Adjust output to the setup_file and teardown_file behavior.
With this, we will be able to observe relevant logging rather than
adding to the output variable.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2025-11-10 13:01:30 +01:00
Fabiano Fidêncio
812191c1f3 tests: nvidia: Do not deploy NFD on nvidia-gpu cases
As it'll come from the GPU Operator for now.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-10 13:01:30 +01:00
Pavel Mores
74f9fdb11f runtime-rs: remove hardcoding of SEV physical address reduction
Previous commit enabled getting the physical address reduction from
processor but just stored it for later use.  This commit adds handling
of the value to ProtectionDevice and enables the QEMU driver to use it.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2025-11-10 13:01:03 +01:00
Pavel Mores
6f9178d290 runtime-rs: get SEV params using CPUID and store them in SevSnpDetails
An implementation of cbitpos acquisition is supplied that was missing
so far.  We also get the physical address reduction value from the same
source (CPUID Fn8000_001f function).  This has been hardcoded at 1 so far,
following the Go runtime example, but it's better to get it from the
processor.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2025-11-10 13:01:03 +01:00
Greg Kurz
5810279edf Merge pull request #12008 from microsoft/saulparedes/allow_priv
webhook: allow privileged containers
2025-11-10 11:13:41 +01:00
Zvonko Kaiser
df58972d41 Merge pull request #12051 from microsoft/danmihai1/agent-version
agent: update version.rs when VERSION file changed
2025-11-09 20:34:58 -05:00
Fabiano Fidêncio
37d4eb0b77 ci: nvidia: Ensure K8S_TEST_HOST_TYPE=baremetal
So the proper cleanups are performed in case something goes awry in a
previous run.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-09 10:51:33 +01:00
Dan Mihai
7b10f4c72a agent: update version.rs when VERSION file changed
- version.rs gets generated from version.rs.in
- version.rs.in contains values read from VERSION
- so version.rs (and maybe other Agent files too) must be
  re-generated when the VERSION file changes

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-08 17:53:09 +00:00
Alex Lyn
83b0a59215 Merge pull request #12046 from Apokleos/disable-guest-emptydir
Disable guest emptydir
2025-11-08 11:54:15 +08:00
Dan Mihai
df7ee2dd38 ci: k8s: AUTO_GENERATE_POLICY for cbl-mariner
Auto-generate policy on cbl-mariner Hosts if the user didn't
specify an AUTO_GENERATE_POLICY value.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-08 00:00:09 +01:00
Dan Mihai
53acb74f26 genpolicy: adapt to new AKS pause container behavior
The new image reference has changed to mcr.microsoft.com/oss/v2/kubernetes/pause:3.6
from mcr.microsoft.com/oss/kubernetes/pause:3.6.

The new image uses by default UID=0, GID=0 while the older. The older image had:
UID=65535, GID=65535.

There is a new pause_container_id_policy field in genpolicy-settings.json, informing
genpolicy about the way AdditionalGids gets updated - "v1" for the older behavior
and "v2" for the newer AKS version:
- When using v1, the default value of AdditionalGids is {65535}.
- When using v2, the default value of AdditionalGids is {}.

UID=65535 and GID=65535 are still hard-coded by default in genpolicy-settings.json.
We might be able to remove/ignore these fields in the future, if we'll stop relying
on policy::KataSpec::get_process_fields to use these fields.

A new CI function adapt_common_policy_settings_for_aks() changes the pause container
UID, GID, pause_container_id_policy, and image ref settings values when testing on
AKS Hosts - i.e., when testing coco-dev or mariner Hosts.

The genpolicy workarounds for the unexpected behavior with guest pull enabled have
been improved to use the current container's GID instead of hard-coding GID=0 as the
guest pull default. Also, AdditionalGids gets updated when the current container's GID is
changing, instead of always changing the AdditionalGids at the very end of
policy::AgentPolicy::get_container_process(), when the relevant evolution of the GID
value was no longer available.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-08 00:00:09 +01:00
Dan Mihai
1f784bb770 genpolicy: improve policy generation comments
Make it easier to understand the source of the UID/GID/AdditionalGids
values from the container in the auto-generated policy.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-08 00:00:09 +01:00
Dan Mihai
969b8e0fb8 genpolicy: more detailed UID/GID debug logs
Add more details to code paths handling UID/GID values, for easier
debugging.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-08 00:00:09 +01:00
Dan Mihai
cacd37ee6e tests: genpolicy: restore test settings for non-Coco configMap
These settings got broken recently because the non-CoCo tests were
disabled for unrelated reasons.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-08 00:00:09 +01:00
Alex Lyn
23024876b2 runtime-rs: Use the configurable disable_guest_empty_dir
Correct the hardcoded value of disable_guest_empty_dir, instead,
we use the real value of it which comes from the configuration.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2025-11-07 19:52:11 +08:00
Alex Lyn
382924bdf3 kata-sys-util: Introduce a sandbox annotation for disable guest emptydir
A sandbox annotation that determines if it should create Kubernetes
emptyDir mounts on the guest filesystem.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2025-11-07 19:48:42 +08:00
Alex Lyn
720a229579 kata-types: Introduce disable guest emptydir flag
It acts as if it should create Kubernetes emptyDir mounts on the
guest filesystem. If enabled, the runtime will not create Kubernetes
emptyDir mounts on the guest filesystem.Instead, emptyDir mounts will
be created on the host and shared via virtio-fs.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2025-11-07 19:45:55 +08:00
Fabiano Fidêncio
03e06fdf4d tests: nvidia: Deploy Trustee
Let's ensure Trustee is deployed as some of the tests rely images that
live behind authentication. /o\

The approach taken here to deploy Trustee is exactly the same one taken
on the other CoCo tests, apart from an env var passed to ensure we're
using the NVIDIA remote verifier (which will be in handy very very
soon).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-07 12:32:11 +01:00
Pavel Mores
841fee28da runtime-rs: add a helper to run external command and capture its output
This isn't really related to remote hypervisor though it was useful for
its debugging.  It's a small helper I've been using regularly during
development for quite some time that I think might be useful more broadly.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2025-11-07 10:49:14 +01:00
Pavel Mores
72c704b287 runtime-rs: make error reporting for CreateVM a bit more explicit
A naked ttrpc error with no context turns out to be rather hard to
understand or even notice in log.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2025-11-07 10:49:14 +01:00
Pavel Mores
45d8141edc runtime-rs: remote hv needs neither image nor initrd specified in config
The remote hypervisor launches no VM, it just instructs the Cloud API
Adaptor to do so, therefore it has no need for an image or initrd to boot
from and should be exempt from the mandate for one or the other to be
specified.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2025-11-07 10:49:14 +01:00
Pavel Mores
80ef102a00 runtime-rs: fix scoping of the remote hv Hypervisor service
The go runtime's .proto file - which is also used by the Cloud API
Adaptor - puts the Hypervisor service into the "hypervisor" package.
runtime-rs has to do the same to avoid an "unimplemented" error.

Signed-off-by: Pavel Mores <pmores@redhat.com>
2025-11-07 10:49:14 +01:00
Alex Lyn
d5e2071869 Merge pull request #11921 from Apokleos/enhance-copyfile2
runtime-rs: Add support LocalStorage for emptyDir within nontee cases
2025-11-07 16:58:39 +08:00
Fabiano Fidêncio
a591cda466 gatekeeper: Adjust the nvidia gpu test name
With the change made to the matrix when the CC GPU runner was added,
there was a change in the job name (@sprt saw that coming, but I
didn't).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-06 16:28:33 +01:00
Manuel Huber
c6dc176a03 tests: nvidia: cc: Enable NIMs tests
Same deal as the previous commut, just enabling the tests here, with the
same list of improvements that we will need to go through in order to
get is working in a perfect way.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-06 16:28:33 +01:00
Manuel Huber
8ca77f2655 tests: nvidia: cc: Run CUDA vectorAdd tests on CC mode
While the primary goal of this change is to detect regressions to the
NVIDIA SNP GPU scenario, various improvements to reflect a more
realistic CC setting are planned in subsequent changes, such as:

* moving away from the overlayfs snapshotter
* disabling filesystem sharing
* applying a pod security policy
* activating the GPUs only after attestation
* using a refined approach for GPU cold-plugging without requiring
  annotations
* revisiting pod timeout and overhead parameters (the podOverhead value
  was increased due to CUDA vectorAdd requiring about 6Gi of
  podOverhead, as well as the inference and embedqa requiring at least
  12Gi, respectively, 14Gi of podOverhead to run without invoking the
  host's oom-killer. We will revisit this aspect after addressing
  points 1. and 2.)

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-06 16:28:33 +01:00
Manuel Huber
25ce0afd52 kata-deploy: Allow the CDI annotation for CC GPU cases
For the nvidia-gpu-snp and nvidia-gpu-tdx we must set containerd to
allow the CDI annotation to be passed to down.

This solution may become obsolete soon enough, but the cleanest way to
have it properly working is by adding it here (even if we remove it
before the next release).

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-06 16:28:33 +01:00
Manuel Huber
c91edf884b runtimeclasses: nvidia: Bump TEE podOverhead
It's been noticed that as more RAM is needed to run the CC tests, we
also need to update the podOverhead of the NVIDIA CC runtime classes to
avoid getting OOM Killed.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-06 16:28:33 +01:00
Fupan Li
aac2a37ff5 runtime-rs: enable pselect6 syscall for dragonball seccomp
Since the nerdctl's network hook would call pselect6 syscall
by xtables-nft-multi, thus we'd better add it to the seccomp's
whitelist.

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
2025-11-06 11:17:57 +01:00
Hyounggyu Choi
ff429072b6 Merge pull request #11924 from BbolroC/fix-static-checks-actionspz
ci: Fix failing static checks to enable IBM actionspz - Z specific
2025-11-06 09:04:04 +01:00
Zvonko Kaiser
fce6a75899 Merge pull request #12027 from fidencio/topic/kata-deploy-make-ALLOWED_HYPERVISOR_ANNOTATIONS-per-arch
kata-deploy: Add per arch ALLOWED_HYPERVISOR_ANNOTATIONS
2025-11-05 18:20:14 -05:00
Manuel Huber
d8953f67c5 ci: Onboard another NVIDIA machine
Let's add a new NVIDIA machine, which later on will be used for CC
related tests.

For now the current tests are skipped in the CC capable machine.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-05 23:23:08 +01:00
Fabiano Fidêncio
b2ee64a2d6 kata-deploy: scripts: Ensure we don't add duplicated values
Let's now make sure that we don't add duplicated values to any of our
entries, making the script as sane as possible for sequential runs.

Vibed with Cursor's help!

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-05 19:48:24 +01:00
Fabiano Fidêncio
78ae79d153 kata-deploy: scripts: Add helper functions to avoid duplicated items
Let's add some helper functions, not yet used, to avoid adding
duplicated items.

This idea is an expansion of Choi's idea to avoid setting duplicated
items, and it'll help on making the whole script idempotent on
sequential runs.

Vibed with Cursor's help!

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-05 19:48:24 +01:00
Fabiano Fidêncio
f773368d93 kata-deploy: Add per arch ALLOWED_HYPERVISOR_ANNOTATIONS
I know, this is not simplifying much things for now, but it has a good
intent in the background and will serve as base for making the
kata-deploy helm chart more user friendly.

With that said, let's add ALLOWED_HYPERVISOR_ANNOTATIONS per arch, while
adding support to set something like "qemu:foo,bar clh:bar foobar
barfoo". Why? Because in the future we'll have a better way to set this
per shim (and the shim is per arch ...).

More details of what we'll do in the future are being discussed here:
https://github.com/kata-containers/kata-containers/issues/12024

Anyways, the variables are **DELIBERATELY** not exposed to the chart for
now, as those will be later on when addressing the issue mentioned
above.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-05 19:45:34 +01:00
Fabiano Fidêncio
66e133e096 kata-deploy: Add missing runtimeClasses
When the runtimeClasses were added, as part of 7cfa826804, the
firecracker runtimeClass ended up missing from the dictionary.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-05 19:07:28 +01:00
Anton Ippolitov
23c46b8a00 docs: Update devmapper containerd plugin name
The Firecracker installation docs had an outaded containerd configuration for the devmapper plugin.
This commit updates the instructions so that they are compatible with more recent versions of containerd.

Signed-off-by: Anton Ippolitov <anton.ippolitov@datadoghq.com>
2025-11-05 18:42:29 +01:00
Fabiano Fidêncio
ace9cf942d tests: guest-pull: Fix names
When added, I've mistakenly used the wrong test-type name, which is now
fixed and should be enough to trigger the tests correctly.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-05 18:21:48 +01:00
Hyounggyu Choi
4ee2037974 GHA: Run runtime tests on self-hosted runners for P/Z
On IBM actionspz P/Z runners, the following error was observed during
runtime tests:

```
host system doesn't support vsock: stat /dev/vhost-vsock: no such file or directory
```

Since loading the vsock module on the fly is not permitted, this commit
moves the runtime tests back to self-hosted runners for P/Z.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-11-05 16:35:04 +00:00
Hyounggyu Choi
32da38273a agent/tests: Skip if kernel module is not found
On IBM actionspz Z runners, the following error occurs when running
`modprobe`:

```
modprobe: FATAL: Module bridge not found in directory /lib/modules/6.8.0-85-generic
```

Additionally, there are no files under `/lib/modules`, for example:

```
total 0
drwxr-xr-x 1 root root    0 Aug  5 13:09 .
drwxr-xr-x 1 root root 2.0K Oct  1 22:59 ..
```

This commit skips the `test_load_kernel_module` test if the module is
not found or if running `modprobe` is not permitted.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-11-05 16:35:04 +00:00
Hyounggyu Choi
075de4dc62 agent/tests: Skip test if error is EACCES (permission denied)
On IBM actionspz Z runners, write operations on network interfaces
are not allowed, even for the root user.
This commit skips the `add_update_addresses` test if the operation
fails with EACCES (-13, permission denied).

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-11-05 16:35:04 +00:00
Hyounggyu Choi
3f84b623a3 agent/tests: Skip RNG reseeding test on restricted environments
On IBM actionspz Z runners, the ioctl system call is not allowed even
for the root user. There is likely an additional security mechanism
(such as AppArmor or seccomp) in place on Ubuntu runners.
This commit introduces a new helper, `is_permission_error()`,
which skips the test if ioctl operations in `reseed_rng()` are not
permitted.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-11-05 16:35:04 +00:00
Hyounggyu Choi
c2abc4da34 agent/tests: Use detected filesystem for baremounted points
The IBM actionspz Z runners mount /dev as tmpfs, while other systems
use devtmpfs. This difference causes an assertion failure for
test_already_baremounted.
This commit sets the detected filesystem for bare-mounted points
as the expected value.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-11-05 16:35:04 +00:00
Hyounggyu Choi
faa048893d agent/tests: Handle error messages differetnly based on root filesystem
The root filesystem for IBM actionspz Z runners is `btrfs` instead of `ext4`.
The error message differs when an unprivileged user tries to perform a bind mount.
This commit adjusts the handling of error messages based on the detected root
filesystem type.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-11-05 16:35:04 +00:00
Fupan Li
0df6c795d8 runtime-rs: disable the default static resource management
Since the qemu & cloud-hypervisor support the cpu & memory
hotplug now, thus disable the static resource management
for qemu and cloud-hypervisor by default.

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
2025-11-05 16:59:13 +01:00
Fupan Li
02ecab40e4 tests: disable the cpu hotplug test for coco dev runtime
Since qemu-coco-dev-runtime-rs and qemu-coco-dev had disabled the
cpu&memory hotplug by enable static_sandbox_resource_mgmt, thus
we should disable the cpu hotplug test for those two runtime.

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
2025-11-05 16:59:13 +01:00
Fupan Li
1fc05491a2 tests: enable the cpu hotplug test for dragonball etc
Since the qemu, cloud-hypervisor and dragonball had supported the
cpu hotplug on runtime-rs, thus enable the cpu hotplug test in CI.

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
2025-11-05 16:59:13 +01:00
Fabiano Fidêncio
0a0de4e6e3 Revert "tests: Do not enable NFD on s390x"
This reverts commit c75a46d17f, as NFD now
publishes an s390x image (and also a ppc64le one).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-05 16:06:33 +01:00
Alex Lyn
8f0dd4c44b runtime-rs: Introduce disable_guest_empty_dir flag
This commit introduces the configuration flag `disable_guest_empty_dir`
to control the placement of Kubernetes emptyDir volumes.

By default, the value is set to `false`, maintaining the current
behavior of creating emptyDirs within the guest VM

When set to `true`, emptyDirs will be created on the host filesystem.
This is essential for scenarios where users need to share data between
the host and the guest VM via an emptyDir.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2025-11-05 15:05:45 +08:00