Commit Graph

16989 Commits

Author SHA1 Message Date
Fabiano Fidêncio
aa7e46b5ed tests: Check the multi-snapshotter situation on containerd
One problem that we've been having for a reasonable amount of time, is
containerd not behaving very well when we have multiple snapshotters.

Although I'm adding this test with my "CoCo" hat in mind, the issue can
happen easily with any other case that requires a different snapshotter
(such as, for instance, firecracker + devmapper).

With this in mind, let's do some stability tests, checking every hour a
simple case of running a few pre-defined containers with runc, and then
running the same containers with kata.

This should be enough to put us in the situation where containerd gets
confused about which snapshotter owns the image layers, and break on us
(or not break and show us that this has been solved ...).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-15 13:35:43 +02:00
Manuel Huber
8221361915 gpu: Use variable to differentiate rootfs variants
With this change we namespace the stage one rootfs tarball name
and use the same name across all uses. This will help overcome
several subtle local build problems.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2025-10-15 12:39:44 +02:00
Hyounggyu Choi
88c333f2a6 agent: Fix race in tests calling LinuxContainer::new()
We fix the following error:

```
thread 'sandbox::tests::add_and_get_container' panicked at src/sandbox.rs:901:10:
called `Result::unwrap()` on an `Err` value: Create cgroupfs manager

Caused by:
    0: fs error caused by: Os { code: 17, kind: AlreadyExists, message: "File exists" }
    1: File exists (os error 17)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
```

by ensuring that the cgroup path is unique for tests run in the same millisecond.

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-10-15 11:32:22 +02:00
Hyounggyu Choi
8412af919d agent/netlink: Attempt to fix ARP and routes tests
test_add_one_arp_neighbor
=========================

We attempt to fix the following error:

```
thread 'netlink::tests::test_add_one_arp_neighbor' panicked at src/netlink.rs:1163:9:
assertion `left == right` failed
  left: ""
     right: "192.0.2.127 lladdr 6a:92:3a:59:70:aa PERMANENT"
```

by adding a sleep to prepare_env_for_test_add_one_arp_neighbor() to
wait for the kernel interfaces to settle.

list_routes
===========

We attempt to fix the following error (notice that the available devices
contain "dummy_for_arp"):

```
thread 'netlink::tests::list_routes' panicked at src/netlink.rs:986:14:
Failed to list routes: available devices: [Interface { device: "", name: "lo", IPAddresses: [IPAddress { family: v6,
address: "127.0.0.1", mask: "8", special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None },
cached_size: CachedSize { size: 0 } } }, IPAddress { family: v6, address: "169.254.1.1", mask: "31", special_fields:
SpecialFields { unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }, IPAddress {
family: v4, address: "2001:db8:85a3::8a2e:370:7334", mask: "128", special_fields: SpecialFields { unknown_fields:
UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }, IPAddress { family: v4, address: "::1", mask:
"128", special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0
} } }], mtu: 65536, hwAddr: "00:00:00:00:00:00", devicePath: "", type_: "", raw_flags: 0, special_fields: SpecialFields
{ unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }, Interface { device: "", name:
"enc0", IPAddresses: [IPAddress { family: v6, address: "10.249.65.4", mask: "24", special_fields: SpecialFields {
unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }, IPAddress { family: v4,
address: "fe80::4ff:fe57:b3e4", mask: "64", special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None
}, cached_size: CachedSize { size: 0 } } }], mtu: 1500, hwAddr: "02:00:04:57:B3:E4", devicePath: "", type_: "",
raw_flags: 0, special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize {
size: 0 } } }, Interface { device: "", name: "docker0", IPAddresses: [IPAddress { family: v6, address: "172.17.0.1",
mask: "16", special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize {
size: 0 } } }, IPAddress { family: v4, address: "fe80::42:56ff:fe5c:d9f9", mask: "64", special_fields: SpecialFields {
unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }], mtu: 1500, hwAddr:
"02:42:56:5C:D9:F9", devicePath: "", type_: "", raw_flags: 0, special_fields: SpecialFields { unknown_fields:
UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }, Interface { device: "", name: "dummy_for_arp",
IPAddresses: [IPAddress { family: v6, address: "192.0.2.2", mask: "24", special_fields: SpecialFields { unknown_fields:
UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } } }, IPAddress { family: v4, address:
"fe80::f4f2:64ff:fe46:2b01", mask: "64", special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None },
cached_size: CachedSize { size: 0 } } }], mtu: 1500, hwAddr: "4A:73:DE:A3:07:64", devicePath: "", type_: "", raw_flags:
0, special_fields: SpecialFields { unknown_fields: UnknownFields { fields: None }, cached_size: CachedSize { size: 0 } }
}]

Caused by:
    0: error looking up device 19888
    1: Received a netlink error message No such device (os error 19)
```

by calling clean_env_for_test_add_one_arp_neighbor() at the start of the
test.

However this fix is uncertain: the original assumption for the fix was that
the "dummy_for_arp" interface left over from test_add_one_arp_neighbor was
the cause of the error. But (3) below shows that running list_routes in
isolation while that interface is present is NOT enough to repro the error:

1. Running all tests + no clean_env in list_routes  => list_routes FAILS  (before this PR)
2. Running all tests + clean_env in list_routes     => list_routes PASSES (after this PR)
3. Running only list_routes + dummy_for_arp present => list_routes PASSES (manual test, see below)

```
$ ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
        valid_lft forever preferred_lft forever
    inet 169.254.1.1/31 brd 169.254.1.1 scope global lo
        valid_lft forever preferred_lft forever
    inet6 2001:db8:85a3::8a2e:370:7334/128 scope global
        valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
        valid_lft forever preferred_lft forever
2: enc0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 02:00:01:02:e2:47 brd ff:ff:ff:ff:ff:ff
    inet 10.240.64.4/24 metric 100 brd 10.240.64.255 scope global dynamic enc0
        valid_lft 159sec preferred_lft 159sec
    inet6 fe80::1ff:fe02:e247/64 scope link
        valid_lft forever preferred_lft forever
311: dummy_for_arp: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether ee:79:66:3a:dc:bc brd ff:ff:ff:ff:ff:ff
    inet 192.0.2.2/24 scope global dummy_for_arp
        valid_lft forever preferred_lft forever
    inet6 fe80::4c2e:83ff:fe7d:ef00/64 scope link
        valid_lft forever preferred_lft forever
$ sudo -E PATH=$PATH make test
../../utils.mk:162: "WARNING: s390x-unknown-linux-musl target is unavailable"
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.25s
Running unittests src/main.rs (target/s390x-unknown-linux-gnu/debug/deps/kata_agent-b2b5b200deca712e)

running 1 test
test netlink::tests::list_routes ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 224 filtered out; finished in 0.00s
```

Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-10-15 11:32:22 +02:00
Paul Meyer
06ed957a45 virtcontainers: fix nydus cleanup on rootfs unmount
This was discovered by @sprt in https://github.com/kata-containers/kata-containers/pull/10243#discussion_r2373709407.
Checking for state.Fstype makes no sense as we know it is empty.

Signed-off-by: Paul Meyer <katexochen0@gmail.com>
2025-10-15 09:22:51 +02:00
Zvonko Kaiser
10f8ec0c20 cdi: Add Crate remove Github Hash
Use CDI exclusively from crates.io and not from a GH repository.
Cargo can easily check if a new version is available and we can
far more easier bump it if needed.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2025-10-15 09:22:20 +02:00
Greg Kurz
3507b2038e Merge pull request #11936 from ldoktor/ocp-helm
ci.ocp: Use helm to install kata
2025-10-14 18:22:28 +02:00
Lukáš Doktor
bdb0afc4e0 ci.ocp: Fix incorrectly quoted argument
with the shellcheck fixes we accidentally quoted the "-n NAMESPACE"
argument where we should have used array instead, which lead to oc
considering this as a pod name and returning error.

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
2025-10-14 17:59:33 +02:00
Lukáš Doktor
f891f340bc ci.ocp: Use helm to install kata
which is the current supported way to deploy kata-containers directly.

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
2025-10-14 17:59:33 +02:00
Aurélien Bombo
0c6fcde198 Merge pull request #11918 from fidencio/topic/builds-qemu-use-liburing-newer-than-2.2
builds: qemu: Use a liburing newer than 2.2
2025-10-14 10:17:16 -05:00
Steve Horsman
363701d767 Merge pull request #11915 from stevenhorsman/ibm-runner-followups-part-i
ci: Add protobuf-compiler dependencies
2025-10-14 13:28:45 +01:00
Fabiano Fidêncio
2ad81c4797 build: qemu: Fix cache logic
We need to ensure that any change on the Dockerfile (and its dir) leads
to the build being retriggered, rather than using the cached version.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-14 12:17:43 +02:00
Fabiano Fidêncio
2f73e34e33 builds: qemu: Use a liburing newer than 2.2
Due to a potential regression introduced by:
984a32f17e (565f3835aaed6321caab4f7c4f8560a687f6000b_379_386)

Reported-by: Aurélien Bombo <abombo@microsoft.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-14 12:17:28 +02:00
stevenhorsman
8ce714cf97 ci: Add protobuf-compiler dependencies
We are seeing more protoc related failures on the new
runners, so try adding the protobuf-compiler dependency
to these steps to see if it helps.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2025-10-14 10:58:58 +01:00
Fabiano Fidêncio
b0b0038689 versions: Bump QEMU to 10.1.1
QEMU 10.1.1 was released on October 8th, 2025, let's bump it on our
side.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-13 23:52:01 +02:00
Fabiano Fidêncio
d46474cfc0 tests: Run apt-get update before installing a package
Otherwise it'll just break. :-)

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-13 23:33:46 +02:00
Fabiano Fidêncio
fb43d3419f build: Fix nvidia kernel breakage
On commit 9602ba6ccc, from February this
year, we've introduced a check to ensure that the files needed for
signing the kernel build are present. However, we've noticed last week
that there were a reasonable amount of wrong assumptions with the
workflow. :-)

Zvonko fixed the majority of those, but this bit was left and it'd cause
breakages when using kernel that was cached ... although passing when
building new kernels.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-13 19:28:40 +02:00
Fupan Li
8b06f3d95d Merge pull request #11905 from Apokleos/coldplug-scsidev
runtime-rs: Support virtio-scsi for initdata within non-TEE
2025-10-11 16:11:39 +08:00
Xuewei Niu
5acb6d8e13 Merge pull request #11863 from lifupan/fupan_blk_remove
runtime-rs: ad the block device hot unplug for clh
2025-10-11 10:31:48 +08:00
Aurélien Bombo
ff973a95c8 Merge pull request #11916 from zvonkok/fix-kernel-module-signing
gpu: Fix kernel module signing
2025-10-10 17:17:08 -05:00
Zvonko Kaiser
b00013c717 kernel: Add KBUILD_SIGN_PIN pass through
This is needed to the kernel setup picks up the correct
config values from our fragments directories.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2025-10-10 15:45:34 -04:00
Zvonko Kaiser
37bd5e3c9d gpu: Add kernel CONFIG check
We need to make sure that the kernel we're using has the
correct configs set, otherwise the module signing will not work.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2025-10-10 15:45:34 -04:00
Fabiano Fidêncio
e782d1ad50 ci: k8s: Test experimental_force_guest_pull
Now that we have added the ability to deploy kata-containers with
experimental_force_guest_pull configured, let's make sure we test it to
avoid any kind of regressions.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-10 20:08:10 +02:00
Fabiano Fidêncio
1bc89d09ae tests: Consider SNAPSHOTTER in the cluster name
Otherwise we have no way to differentiate running tests on qemu-coco-dev
with different snapshotters.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-10 20:08:10 +02:00
Fabiano Fidêncio
496e255ea2 build: Fix KBUILD_SIGN_PIN usage
What was done in the past, trying to set the env var on the same step
it'd be used, simply does not work.

Instead, we need to properly set it through the `env` set up, as done
now.

We're also bumping the kata_config_version to ensure we retrigger the
kernel builds.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-10 15:25:10 +02:00
Paul Meyer
5ae891ab46 versions: bump opa 1.6.0 -> 1.9.0
Bumping opa to latest release.

Signed-off-by: Paul Meyer <katexochen0@gmail.com>
2025-10-10 10:58:51 +02:00
Steve Horsman
a570fdc0fd Merge pull request #11909 from kata-containers/ibm-runners-test
ci: Enable new ibm runners
2025-10-10 09:42:53 +01:00
stevenhorsman
8dcd91cf5f ci: Enable new ibm runners
We have some scalable s390x and ppc runners, so
start to use them for build and test, to improve
the throughput of our CI

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Co-authored-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
2025-10-10 09:42:06 +01:00
Fabiano Fidêncio
06a3bbdd44 ci: k8s: coco: Add "Report tests" step
For some reason we didn't have the "Report tests" step as part of the
TEE jobs. This step immensely helps to check which tests are failing and
why, so let's add it while touching the workflow.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-10 09:51:59 +02:00
Fabiano Fidêncio
a1f90fe350 tests: k8s: Unify k8s TEE tests
There's no reason to have the code duplication between the SNP / TDX
tests for CoCo, as those are basically using the same configuration
nowadays.

Note that for the TEEs case, as the nydus-snapshotter is deployed by the
admin, once, instead of deploying it on every run ... I'm actually
removing the nydus-snapshotter steps so we make it clear that those
steps are not performed by the CI.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-10 09:51:59 +02:00
Alex Lyn
4c386b51d9 runtime-rs: Add support for handling virtio-scsi devices
As virtio-scsi has been set the default block device driver, the
runtime also need to correctly handle the virtio-scsi info, specially
the SCSI address required within kata-agent handling logic.

And getting and assigning the scsi_addr to kata agent device id
will be enough. This commit just do such work.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2025-10-10 11:31:04 +08:00
Fupan Li
4002a91452 runtime-rs: ad the block device hot unplug for clh
Since runtime-rs support the block device hotplug with
creating new containers, and the device would also be
removed when the container stopped, thus add the block
device unplug for clh.

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
2025-10-10 10:02:12 +08:00
Zvonko Kaiser
afbec780a9 Merge pull request #11903 from zvonkok/ppcie
gpu: PPCIE support DGX like systems
2025-10-09 21:06:41 -04:00
Aurélien Bombo
a3a45429f6 Merge pull request #11865 from microsoft/danmihai1/nested-configmap-secret
tests: k8s-nested-configmap-secret policy
2025-10-09 11:33:50 -05:00
Alex Lyn
b42ef09ffb Merge pull request #11888 from spuzirev/main
runtime: fix "num-queues expects uint64" error with virtio-blk
2025-10-09 20:21:32 +08:00
Xuewei Niu
2a43bf37ed Merge pull request #11894 from M-Phansa/main
runtime: fix device typo
2025-10-09 16:53:40 +08:00
Alex Lyn
a54d95966b runtime-rs: Support virtio-scsi for initdata within non-TEE
This commit introduces support for selecting `virtio-scsi` as the
block device driver for QEMU during initial setup.

The primary goal is to resolve a conflict in non-TEE environments:
1. The global block device configuration defaults to `virtio-scsi`.
2. The `initdata` device driver was previously designed and hardcoded
to `virtio-blk-pci`.
3. This conflict prevented unified block device usage.

By allowing `virtio-scsi` to be configured at cold boot, the `initdata`
device can now correctly adhere to the global setting, eliminating the
need for a hardcoded driver and ensuring consistent block device
configuration across all supported devices (excluding rootfs).

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2025-10-09 15:52:33 +08:00
Xuewei Niu
5208ee4ec0 Merge pull request #11674 from was-saw/dragonball_seccomp
runtime-rs: add seccomp support for dragonball
2025-10-09 15:01:15 +08:00
wangxinge
8e1b33cc14 docs: add document for seccomp
This commit adds a document to use
seccomp in runtime-rs

Signed-off-by: wangxinge <wangxinge@bupt.edu.cn>
2025-10-09 13:25:17 +08:00
wangxinge
2abf6965ff dragonball: add seccomp support for dragonball
This commit modifies seccomp framework to
support different restrictions for different threads.

Signed-off-by: wangxinge <wangxinge@bupt.edu.cn>
2025-10-09 13:25:17 +08:00
wangxinge
bb6fb8ff39 runtime-rs: add seccomp support for dragonball
The implementation of the seccomp feature in Dragonball currently has a basic framework.
But the actual restriction rules are empty.

This pull request includes the following changes:
- Modifiy configuration files to relevant configuration files.
- Modifiy seccomp framework to support different restrictions for different threads.
- Add new seccomp rules for the modified framework.

This commit primarily implements the changes 1 and 3 for runtime-rs.

Fixes: #11673

Signed-off-by: wangxinge <wangxinge@bupt.edu.cn>
2025-10-09 13:25:17 +08:00
Zvonko Kaiser
91739d4425 gpu: PPCIE support DGX like systems
For DGX like systems we need additional binaries and libraries,
enable the Kata AND CoCo use-case.

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>

Update tools/osbuilder/rootfs-builder/nvidia/nvidia_rootfs.sh

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-09 00:00:12 +00:00
Dan Mihai
364d3cded0 tests: k8s-nested-configmap-secret policy
Add auto-generated agent policy in k8s-nested-configmap-secret.bats.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-10-08 23:37:54 +00:00
Sergei Puzyrev
62b12953c7 runtime: fix "num-queues expects uint64" error with virtio-blk
Unneeded type-conversion was removed.

Fixes #11887

Signed-off-by: Sergei Puzyrev <spuzirev@gmail.com>
2025-10-08 17:09:22 -05:00
Adeet Phanse
4e4f9c44ae runtime: fix device typo
Fix device typo in dragonball / runtime-rs / runtime.

Signed-off-by: Adeet Phanse <adeet.phanse@mongodb.com>
2025-10-08 17:08:27 -05:00
Aurélien Bombo
d954932876 Merge pull request #11883 from kata-containers/sprt/zizmor-fixes3
ci: zizmor: Address all issues
2025-10-08 17:01:48 -05:00
Aurélien Bombo
07645cf58b ci: actionlint: Address issues and set as required
Address issues just introduced and set actionlint as a required by removing
the path filter.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2025-10-08 16:55:27 -05:00
Aurélien Bombo
b3a551d438 ci: zizmor: Reestablish as required test
We can re-require this now that we've addressed all the issues.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2025-10-08 16:55:27 -05:00
Aurélien Bombo
5a4ddb8c71 ci: zizmor: Fix all template-injection alerts
Fix all instances of template injection by using environment variables as
recommended by Zizmor, instead of directly injecting values into the
commands.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2025-10-08 16:55:26 -05:00
Aurélien Bombo
7b203d1b43 ci: zizmor: Ignore dangerous-triggers audit for known safe usage
The two ignored cases are strictly necessary for the CI to work today, and we
have various security mitigations in place.

Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
2025-10-08 16:55:08 -05:00