Commit Graph

1736 Commits

Author SHA1 Message Date
Fabiano Fidêncio
75996945aa kata-deploy: try-kata-values.yaml -> values.yaml
This makes the user experience better, as the admin can deploy Kata
Containers without having to download / set up any additional file.

Of course, if the admin wants something more specific, examples are
provided.

Tests and documentation are updated to reflect this change.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-17 12:16:17 +01:00
Fabiano Fidêncio
2e000129a9 kata-deploy: tests: Add example values files for easy Kata deployment
Add three example values files to make it easier for users to try out
different Kata Containers configurations:

- try-kata.values.yaml: Enables all available shims
- try-kata-tee.values.yaml: Enables only TEE/confidential computing shims
- try-kata-nvidia-gpu.values.yaml: Enables only NVIDIA GPU shims

These files use the new structured configuration format and serve as
ready-to-use examples for common deployment scenarios.

Also update the README.md to document these example files and how to use them.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-15 09:36:14 +01:00
Fabiano Fidêncio
8717312599 tests: Migrate helm_helper to use new structured configuration
Update the helm_helper function in gha-run-k8s-common.sh to use the
new structured configuration format instead of the legacy env.* format.

All possible settings have been migrated to the structured format:
- HELM_DEBUG now sets root-level 'debug' boolean
- HELM_SHIMS now enables shims in structured format with automatic
  architecture detection based on shim name
- HELM_DEFAULT_SHIM now sets per-architecture defaultShim mapping
- HELM_EXPERIMENTAL_SETUP_SNAPSHOTTER now sets snapshotter.setup array
- HELM_ALLOWED_HYPERVISOR_ANNOTATIONS now sets per-shim allowedHypervisorAnnotations
- HELM_SNAPSHOTTER_HANDLER_MAPPING now sets per-shim containerd.snapshotter
- HELM_AGENT_HTTPS_PROXY and HELM_AGENT_NO_PROXY now set per-shim agent proxy settings
- HELM_PULL_TYPE_MAPPING now sets per-shim forceGuestPull/guestPull settings
- HELM_EXPERIMENTAL_FORCE_GUEST_PULL now sets per-shim forceGuestPull/guestPull

The test helper automatically determines supported architectures for
each shim (e.g., qemu-se supports s390x, qemu-cca supports arm64,
qemu-snp/qemu-tdx support amd64, etc.) and applies per-shim settings
to the appropriate shims based on HELM_SHIMS.

Only HELM_HOST_OS remains in legacy env.* format as it doesn't have
a structured equivalent yet.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-15 09:36:14 +01:00
Dan Mihai
5cc1024936 ci: k8s: AUTO_GENERATE_POLICY for coco-dev
Re-enable AUTO_GENERATE_POLICY for coco-dev Hosts, unless PULL_TYPE is
"experimental-force-guest-pull", or the caller specified a different
value for AUTO_GENERATE_POLICY.

Auto-generated Policy has been disabled accidentally and recently for
these Hosts, by a GHA workflow change.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-14 15:53:34 +01:00
stevenhorsman
b7abcc4c37 tests: Fix wildcard skip in k8s-cpu-ns
The formatting wasn't quite right, so the `qemu-coco-dev-runtime-rs`
hypervisor wasn't skipping this test

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2025-11-13 14:21:05 +00:00
stevenhorsman
b51af53bc7 tests/k8s: call teardown_common in some policy tests
The teardown_common will print the description of the running pods, kill
them all and print the system's syslogs afterwards.

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2025-11-13 14:18:43 +00:00
stevenhorsman
0335012824 tests/k8s: Enable tests for qemu-coco-dev-runtime-rs
Add the runtime class to the non-tee tests and
enable it to run in the test code

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2025-11-13 14:18:43 +00:00
stevenhorsman
a1ddd2c3dd kata-deploy: Add kata-qemu-coco-dev-runtime-rs runtime class
Add the runtime class and shim references for the new
 non-tee runtime-rs class

Signed-off-by: stevenhorsman <steven@uk.ibm.com>
2025-11-13 14:18:43 +00:00
Hyounggyu Choi
2dec247a54 Merge pull request #12038 from lifupan/fix_smaller-memeory
runtime-rs: fix the issue of hot-unplug memory smaller
2025-11-12 11:22:04 +01:00
Zvonko Kaiser
d783e59b42 Merge pull request #12055 from fidencio/topic/coco-bump-trustee
versions: Bump Trustee
2025-11-12 02:48:16 -05:00
Zvonko Kaiser
76e4e6bc24 Merge pull request #12061 from Apokleos/correct-unexpected-cap
tests: Correct unexpected capability for policy failure test
2025-11-11 12:20:33 -05:00
Fabiano Fidêncio
d82eb8d0f1 ci: Drop docker tests
We have had those tests broken for months. It's time to get rid of
those.

NOTE that we could easily revert this commit and re-add those tests as
soon as we find someone to maintain and be responsible for such
integration.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-11 17:02:02 +01:00
Fabiano Fidêncio
2d2b0de160 tests: kbs: Try to get the pod logs on deployment failure
As this helps immensely to figure out what went wrong with the
deployment.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-11 08:08:24 +01:00
Fabiano Fidêncio
58df06d90e versions: Bump Trustee
This is a bump pre-release, which brings several fixes and some
improvements related to initData, and NVIDIA's remote verifier.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-11 08:08:05 +01:00
Alex Lyn
c225cba0e6 tests: Correct unexpected capability for policy failure test
The test case designed to verify policy failures due to an "unexpected
capability" was misconfigured. It was using "CAP_SYS_CHROOT" as the
unexpected capability to be added.

This configuration was flawed for two main reasons:
1.Incorrect Syntax: Kubernetes Pod specs expect capability names without
the "CAP_" prefix (e.g., "SYS_CHROOT", not "CAP_SYS_CHROOT").
This made the test case's premise incorrect from a K8s API perspective.
2.Part of Default Set: "SYS_CHROOT" is already included in the
`default_caps` list for a standard container. Therefore, adding it would
 not trigger a policy violation, defeating the purpose of the
 "unexpected capability" test.

Furthermore, a related issue was observed where a malformed capability
like "CAP_CAP_SYS_CHROOT" was being generated, causing parsing failures
in the `oci-spec-rs` library. This was a symptom of incorrect string
manipulation when handling capabilities.

This commit corrects the test by selecting "SYS_NICE" as the unexpected
capability. "SYS_NICE" is a more suitable choice because:
- It is a valid Linux capability.
- It is relatively harmless.
- It is **not** part of the default capability set defined in
  `genpolicy-settings.json`.

By using "SYS_NICE", the test now accurately simulates a scenario where
a Pod requests a legitimate but non-default capability, which the policy
(generated from a baseline Pod without this capability) should correctly
reject. This change fixes the test's logic and also resolves the
downstream `oci-spec-rs` parsing error by ensuring only valid capability
names are processed.

Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
2025-11-11 14:06:30 +08:00
Fabiano Fidêncio
92226d0a19 tests: nvidia: Be prepared for TDX
Thankfully there's only one piece that's still SNP specific (for the
supported TEEs).  Let's adjust it so we can have an easy and smooth
execution when adding a TDX CI machine.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-10 13:01:30 +01:00
Fabiano Fidêncio
4d314e8676 tests: nvidia: nims: Adjust to CC
There are several changes needed in order to get this test working with
CC, and yet we still are skipping it.

Basically, we need to:
* Pull an authenticated image inside the guest, which requires:
 * Using Trustee to release the credential
   * We still depend on a PR to be merged on Trustee side
     * https://github.com/confidential-containers/trustee/pull/1035
   * We still depend on a Trustee bump (including the PR above) on our
     side

Apart from those changes, I ended up "duplicating" the tests by adding a
"-tee" version of those, which already have:
* The proper kbs annotations set up
* Dropped host mounts
* Increases the memory needed

Last but not least, as "bats" probably means "being a terrible script",
I had to re-arrange a few things otherwise the tests would not even run
due to bats-isms that I am sincerely not able to pin-point.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-10 13:01:30 +01:00
Fabiano Fidêncio
8cedd96d54 tests: nvidia: k8s: Enforce experimental_force_guest_pull
We added the tests using virtio-9p as we knew it'd require incremental
changes to be able to use any kind of guest-pull method.

Now, as in the coming commits we'll be actually ensuring that guest-pull
works and is in use, we can enforce the experimental_force_guest_pull
usage for the nvidia cases.

Note: We're using experimental_force_guest_pull instead of
nydus-snapshotter due to stability concerns with the snapshotter.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-10 13:01:30 +01:00
Fabiano Fidêncio
e85cf83573 k8s: tests: Fix default for EXPERIMENTAL_FORCE_GUEST_PULL
It takes either a shim name or "", but we were treating this (thankfully
only in this specific file) as a boolean.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-10 13:01:30 +01:00
Manuel Huber
8b39468b36 tests: nvidia: Logging for NIM
Adjust output to the setup_file and teardown_file behavior.
With this, we will be able to observe relevant logging rather than
adding to the output variable.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2025-11-10 13:01:30 +01:00
Fabiano Fidêncio
812191c1f3 tests: nvidia: Do not deploy NFD on nvidia-gpu cases
As it'll come from the GPU Operator for now.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-10 13:01:30 +01:00
Dan Mihai
df7ee2dd38 ci: k8s: AUTO_GENERATE_POLICY for cbl-mariner
Auto-generate policy on cbl-mariner Hosts if the user didn't
specify an AUTO_GENERATE_POLICY value.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-08 00:00:09 +01:00
Dan Mihai
53acb74f26 genpolicy: adapt to new AKS pause container behavior
The new image reference has changed to mcr.microsoft.com/oss/v2/kubernetes/pause:3.6
from mcr.microsoft.com/oss/kubernetes/pause:3.6.

The new image uses by default UID=0, GID=0 while the older. The older image had:
UID=65535, GID=65535.

There is a new pause_container_id_policy field in genpolicy-settings.json, informing
genpolicy about the way AdditionalGids gets updated - "v1" for the older behavior
and "v2" for the newer AKS version:
- When using v1, the default value of AdditionalGids is {65535}.
- When using v2, the default value of AdditionalGids is {}.

UID=65535 and GID=65535 are still hard-coded by default in genpolicy-settings.json.
We might be able to remove/ignore these fields in the future, if we'll stop relying
on policy::KataSpec::get_process_fields to use these fields.

A new CI function adapt_common_policy_settings_for_aks() changes the pause container
UID, GID, pause_container_id_policy, and image ref settings values when testing on
AKS Hosts - i.e., when testing coco-dev or mariner Hosts.

The genpolicy workarounds for the unexpected behavior with guest pull enabled have
been improved to use the current container's GID instead of hard-coding GID=0 as the
guest pull default. Also, AdditionalGids gets updated when the current container's GID is
changing, instead of always changing the AdditionalGids at the very end of
policy::AgentPolicy::get_container_process(), when the relevant evolution of the GID
value was no longer available.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-08 00:00:09 +01:00
Dan Mihai
cacd37ee6e tests: genpolicy: restore test settings for non-Coco configMap
These settings got broken recently because the non-CoCo tests were
disabled for unrelated reasons.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-08 00:00:09 +01:00
Manuel Huber
c6dc176a03 tests: nvidia: cc: Enable NIMs tests
Same deal as the previous commut, just enabling the tests here, with the
same list of improvements that we will need to go through in order to
get is working in a perfect way.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-06 16:28:33 +01:00
Manuel Huber
8ca77f2655 tests: nvidia: cc: Run CUDA vectorAdd tests on CC mode
While the primary goal of this change is to detect regressions to the
NVIDIA SNP GPU scenario, various improvements to reflect a more
realistic CC setting are planned in subsequent changes, such as:

* moving away from the overlayfs snapshotter
* disabling filesystem sharing
* applying a pod security policy
* activating the GPUs only after attestation
* using a refined approach for GPU cold-plugging without requiring
  annotations
* revisiting pod timeout and overhead parameters (the podOverhead value
  was increased due to CUDA vectorAdd requiring about 6Gi of
  podOverhead, as well as the inference and embedqa requiring at least
  12Gi, respectively, 14Gi of podOverhead to run without invoking the
  host's oom-killer. We will revisit this aspect after addressing
  points 1. and 2.)

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-06 16:28:33 +01:00
Fupan Li
bfe8da6c8a tests: disable the qemu-runtime-rs cpu hotplug test
Since there's something wrong with the cpu hotplug
on qemu-runtime-rs, thus disable this test temporally.

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
2025-11-06 21:37:01 +08:00
Manuel Huber
d8953f67c5 ci: Onboard another NVIDIA machine
Let's add a new NVIDIA machine, which later on will be used for CC
related tests.

For now the current tests are skipped in the CC capable machine.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-05 23:23:08 +01:00
Fupan Li
02ecab40e4 tests: disable the cpu hotplug test for coco dev runtime
Since qemu-coco-dev-runtime-rs and qemu-coco-dev had disabled the
cpu&memory hotplug by enable static_sandbox_resource_mgmt, thus
we should disable the cpu hotplug test for those two runtime.

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
2025-11-05 16:59:13 +01:00
Fupan Li
1fc05491a2 tests: enable the cpu hotplug test for dragonball etc
Since the qemu, cloud-hypervisor and dragonball had supported the
cpu hotplug on runtime-rs, thus enable the cpu hotplug test in CI.

Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
2025-11-05 16:59:13 +01:00
Fabiano Fidêncio
0a0de4e6e3 Revert "tests: Do not enable NFD on s390x"
This reverts commit c75a46d17f, as NFD now
publishes an s390x image (and also a ppc64le one).

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-05 16:06:33 +01:00
Fabiano Fidêncio
5b01eaf929 tests: Align kata-deploy helm's uninstall
Let's use the same method both on the kata-deploy and k8s tests.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-04 09:29:35 +01:00
Dan Mihai
6a4c336ca0 Merge pull request #12016 from microsoft/danmihai1/early-wait-abort
tests: k8s: reduce test time for unexpected CreateContainerRequest errors
2025-11-03 12:04:56 -08:00
Fabiano Fidêncio
3107533953 tests: Adjust to runtimeClass creation by the chart
It's just a follow-up on the previous commit where we move away from the
runtimeClass creation inside the script, and instead we do it using the
chart itself.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-03 17:32:18 +01:00
Fabiano Fidêncio
14039c9089 golang: Update to 1.24.9
In order to fix:
```

=== Running govulncheck on containerd-shim-kata-v2 ===
 Vulnerabilities found in containerd-shim-kata-v2:
=== Symbol Results ===

Vulnerability #1: GO-2025-4015
    Excessive CPU consumption in Reader.ReadResponse in net/textproto
  More info: https://pkg.go.dev/vuln/GO-2025-4015
  Standard library
    Found in: net/textproto@go1.24.6
    Fixed in: net/textproto@go1.24.8
    Vulnerable symbols found:
      #1: textproto.Reader.ReadResponse

Vulnerability #2: GO-2025-4014
    Unbounded allocation when parsing GNU sparse map in archive/tar
  More info: https://pkg.go.dev/vuln/GO-2025-4014
  Standard library
    Found in: archive/tar@go1.24.6
    Fixed in: archive/tar@go1.24.8
    Vulnerable symbols found:
      #1: tar.Reader.Next

Vulnerability #3: GO-2025-4013
    Panic when validating certificates with DSA public keys in crypto/x509
  More info: https://pkg.go.dev/vuln/GO-2025-4013
  Standard library
    Found in: crypto/x509@go1.24.6
    Fixed in: crypto/x509@go1.24.8
    Vulnerable symbols found:
      #1: x509.Certificate.Verify
      #2: x509.Certificate.Verify

Vulnerability #4: GO-2025-4012
    Lack of limit when parsing cookies can cause memory exhaustion in net/http
  More info: https://pkg.go.dev/vuln/GO-2025-4012
  Standard library
    Found in: net/http@go1.24.6
    Fixed in: net/http@go1.24.8
    Vulnerable symbols found:
      #1: http.Client.Do
      #2: http.Client.Get
      #3: http.Client.Head
      #4: http.Client.Post
      #5: http.Client.PostForm
      Use '-show traces' to see the other 9 found symbols

Vulnerability #5: GO-2025-4011
    Parsing DER payload can cause memory exhaustion in encoding/asn1
  More info: https://pkg.go.dev/vuln/GO-2025-4011
  Standard library
    Found in: encoding/asn1@go1.24.6
    Fixed in: encoding/asn1@go1.24.8
    Vulnerable symbols found:
      #1: asn1.Unmarshal
      #2: asn1.UnmarshalWithParams

Vulnerability #6: GO-2025-4010
    Insufficient validation of bracketed IPv6 hostnames in net/url
  More info: https://pkg.go.dev/vuln/GO-2025-4010
  Standard library
    Found in: net/url@go1.24.6
    Fixed in: net/url@go1.24.8
    Vulnerable symbols found:
      #1: url.JoinPath
      #2: url.Parse
      #3: url.ParseRequestURI
      #4: url.URL.Parse
      #5: url.URL.UnmarshalBinary

Vulnerability #7: GO-2025-4009
    Quadratic complexity when parsing some invalid inputs in encoding/pem
  More info: https://pkg.go.dev/vuln/GO-2025-4009
  Standard library
    Found in: encoding/pem@go1.24.6
    Fixed in: encoding/pem@go1.24.8
    Vulnerable symbols found:
      #1: pem.Decode

Vulnerability #8: GO-2025-4008
    ALPN negotiation error contains attacker controlled information in
    crypto/tls
  More info: https://pkg.go.dev/vuln/GO-2025-4008
  Standard library
    Found in: crypto/tls@go1.24.6
    Fixed in: crypto/tls@go1.24.8
    Vulnerable symbols found:
      #1: tls.Conn.Handshake
      #2: tls.Conn.HandshakeContext
      #3: tls.Conn.Read
      #4: tls.Conn.Write
      #5: tls.Dial
      Use '-show traces' to see the other 4 found symbols

Vulnerability #9: GO-2025-4007
    Quadratic complexity when checking name constraints in crypto/x509
  More info: https://pkg.go.dev/vuln/GO-2025-4007
  Standard library
    Found in: crypto/x509@go1.24.6
    Fixed in: crypto/x509@go1.24.9
    Vulnerable symbols found:
      #1: x509.CertPool.AppendCertsFromPEM
      #2: x509.Certificate.CheckCRLSignature
      #3: x509.Certificate.CheckSignature
      #4: x509.Certificate.CheckSignatureFrom
      #5: x509.Certificate.CreateCRL
      Use '-show traces' to see the other 27 found symbols

Vulnerability #10: GO-2025-4006
    Excessive CPU consumption in ParseAddress in net/mail
  More info: https://pkg.go.dev/vuln/GO-2025-4006
  Standard library
    Found in: net/mail@go1.24.6
    Fixed in: net/mail@go1.24.8
    Vulnerable symbols found:
      #1: mail.AddressParser.Parse
      #2: mail.AddressParser.ParseList
      #3: mail.Header.AddressList
      #4: mail.ParseAddress
      #5: mail.ParseAddressList
```

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-03 16:57:22 +01:00
Dan Mihai
c563ee99fa tests: policy-rc: detect create container errors early
During the ${wait_time} for an expected condition, if
CreateContainerRequest was NOT expected to fail: detect possible
CreateContainerRequest failures early and abort the wait.

For example, before this change:

not ok 1 Successful replication controller with auto-generated policy in 123335ms
ok 2 Policy failure: unexpected container command in 14601ms
ok 3 Policy failure: unexpected volume mountPath in 14443ms
ok 4 Policy failure: unexpected host device mapping in 14515ms
ok 5 Policy failure: unexpected securityContext.allowPrivilegeEscalation in 14485ms
ok 6 Policy failure: unexpected capability in 14382ms
ok 7 Policy failure: unexpected UID = 1000 in 14578ms

After this change:

not ok 1 Successful replication controller with auto-generated policy in 17108ms
ok 2 Policy failure: unexpected container command in 14427ms
ok 3 Policy failure: unexpected volume mountPath in 14636ms
ok 4 Policy failure: unexpected host device mapping in 14493ms
ok 5 Policy failure: unexpected securityContext.allowPrivilegeEscalation in 14554ms
ok 6 Policy failure: unexpected capability in 15087ms
ok 7 Policy failure: unexpected UID = 1000 in 14371ms

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-03 15:55:55 +00:00
Dan Mihai
319400dc0d tests: policy-pvc: detect create container errors early
During the ${wait_time} for an expected condition, if
CreateContainerRequest was NOT expected to fail: detect possible
CreateContainerRequest failures early and abort the wait.

For example, before this change:

not ok 1 Successful pod with auto-generated policy in 94852ms
ok 2 Policy failure: unexpected device mount in 17807ms

After this change:

not ok 1 Successful pod with auto-generated policy in 35194ms
ok 2 Policy failure: unexpected device mount in 21355ms

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-03 15:55:55 +00:00
Dan Mihai
1914fcb812 tests: policy-log: detect create container errors early
During the ${wait_time} for an expected condition, if
CreateContainerRequest was NOT expected to fail: detect possible
CreateContainerRequest failures early and abort the wait.

For example, before this change:

not ok 1 Logs empty when ReadStreamRequest is blocked in 102257ms

After this change:

not ok 1 Logs empty when ReadStreamRequest is blocked in 17339ms

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-03 15:55:55 +00:00
Dan Mihai
a0bd9e02ca tests: policy-job: detect create container errors early
During the ${wait_time} for an expected condition, if
CreateContainerRequest was NOT expected to fail: detect possible
CreateContainerRequest failures early and abort the wait.

For example, before this change:

not ok 1 Successful job with auto-generated policy in 107111ms
ok 2 Policy failure: unexpected environment variable in 7920ms
ok 3 Policy failure: unexpected command line argument in 7874ms
ok 4 Policy failure: unexpected emptyDir volume in 7823ms
ok 5 Policy failure: unexpected projected volume in 7812ms
ok 6 Policy failure: unexpected readOnlyRootFilesystem in 7903ms
ok 7 Policy failure: unexpected UID = 222 in 7720ms

After this change:

not ok 1 Successful job with auto-generated policy in 10271ms
ok 2 Policy failure: unexpected environment variable in 8018ms
ok 3 Policy failure: unexpected command line argument in 7886ms
ok 4 Policy failure: unexpected emptyDir volume in 7621ms
ok 5 Policy failure: unexpected projected volume in 7843ms
ok 6 Policy failure: unexpected readOnlyRootFilesystem in 7632ms
ok 7 Policy failure: unexpected UID = 222 in 7619ms

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-03 15:55:55 +00:00
Dan Mihai
992c91371c tests: policy-deployment-sc: detect create container errors early
During the ${wait_time} for an expected condition, if
CreateContainerRequest was NOT expected to fail: detect possible
CreateContainerRequest failures early and abort the wait.

For example, before this change:

ok 1 Successful sc deployment with auto-generated policy and container image volumes in 14769ms
ok 2 Successful sc with fsGroup/supplementalGroup deployment with auto-generated policy and container image volumes in 8384ms
not ok 3 Successful sc deployment with security context choosing another valid user in 136149ms
ok 4 Successful layered sc deployment with auto-generated policy and container image volumes in 8862ms
ok 5 Policy failure: unexpected GID = 0 for layered securityContext deployment in 7941ms
ok 6 Policy failure: malicious root group added via supplementalGroups deployment in 11612ms

After:

ok 1 Successful sc deployment with auto-generated policy and container image volumes in 15230ms
ok 2 Successful sc with fsGroup/supplementalGroup deployment with auto-generated policy and container image volumes in 9364ms
not ok 3 Successful sc deployment with security context choosing another valid user in 11060ms
ok 4 Successful layered sc deployment with auto-generated policy and container image volumes in 9124ms
ok 5 Policy failure: unexpected GID = 0 for layered securityContext deployment in 7919ms
ok 6 Policy failure: malicious root group added via supplementalGroups deployment in 11666ms

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-03 15:55:55 +00:00
Dan Mihai
704ee76f1e tests: policy-deployment-sc: reduced redundancy
Call common function instead of copy/paste of three commands.

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-03 15:55:55 +00:00
Dan Mihai
2cafb10a6a tests: policy-pod: detect create container errors early
During the ${wait_time} for an expected condition, if
CreateContainerRequest was NOT expected to fail: detect possible
CreateContainerRequest failures early and abort the wait.

For example, before this change:

not ok 1 Successful pod with auto-generated policy in 110801ms
not ok 2 Able to read env variables sourced from configmap using envFrom in 94104ms
not ok 3 Successful pod with auto-generated policy and runtimeClassName filter in 95838ms
not ok 4 Successful pod with auto-generated policy and custom layers cache path in 110712ms
ok 5 Policy failure: unexpected container image in 8113ms
ok 6 Policy failure: unexpected privileged security context in 7943ms
ok 7 Policy failure: unexpected terminationMessagePath in 11530ms
ok 8 Policy failure: unexpected hostPath volume mount in 7970ms
ok 9 Policy failure: unexpected config map in 7933ms
not ok 10 Policy failure: unexpected lifecycle.postStart.exec.command in 112677ms
ok 11 RuntimeClassName filter: no policy in 2302ms
not ok 12 ExecProcessRequest tests in 93946ms
not ok 13 Successful pod: runAsUser having the same value as the UID from the container image in 94003ms
ok 14 Policy failure: unexpected UID = 0 in 8016ms
ok 15 Policy failure: unexpected UID = 1234 in 7850ms

After:

not ok 1 Successful pod with auto-generated policy in 12182ms
not ok 2 Able to read env variables sourced from configmap using envFrom in 10121ms
not ok 3 Successful pod with auto-generated policy and runtimeClassName filter in 11738ms
not ok 4 Successful pod with auto-generated policy and custom layers cache path in 26592ms
ok 5 Policy failure: unexpected container image in 7742ms
ok 6 Policy failure: unexpected privileged security context in 7949ms
ok 7 Policy failure: unexpected terminationMessagePath in 7789ms
ok 8 Policy failure: unexpected hostPath volume mount in 7887ms
ok 9 Policy failure: unexpected config map in 7818ms
not ok 10 Policy failure: unexpected lifecycle.postStart.exec.command in 9120ms
ok 11 RuntimeClassName filter: no policy in 2081ms
not ok 12 ExecProcessRequest tests in 9883ms
not ok 13 Successful pod: runAsUser having the same value as the UID from the container image in 9870ms
ok 14 Policy failure: unexpected UID = 0 in 11161ms
ok 15 Policy failure: unexpected UID = 1234 in 7814ms

Signed-off-by: Dan Mihai <dmihai@microsoft.com>
2025-11-03 15:55:55 +00:00
Fabiano Fidêncio
c539a9e90e tests: k8s: parallel: Increase timeout
We've seen a few cases where we fail the test due to timeout and when we
print the pods we just see that they've been created.

With that in mind, let's just increase the timeout a little bit.

Example:
```
not ok 1 Parallel jobs in 6250ms
 (in test file k8s-parallel.bats, line 41)
   `kubectl wait --for=condition=Ready --timeout=$timeout pod -l jobgroup=${job_name}' failed
 No resources found in kata-containers-k8s-tests namespace.
 [bats-exec-test:71] INFO: k8s configured to use runtimeclass
 job.batch/process-item-test1 created
 job.batch/process-item-test2 created
 job.batch/process-item-test3 created
 NAME                 STATUS    COMPLETIONS   DURATION   AGE
 process-item-test1   Running   0/1                      0s
 process-item-test2   Running   0/1                      0s
 process-item-test3   Running   0/1                      0s
 error: no matching resources found
 No resources found in kata-containers-k8s-tests namespace.
 No resources found in kata-containers-k8s-tests namespace.
 DEBUG: system logs of node 'aks-nodepool1-25989463-vmss000000' since test start time (2025-11-01 16:39:03)
 -- No entries --
 job.batch "process-item-test1" deleted
 job.batch "process-item-test2" deleted
 job.batch "process-item-test3" deleted
```

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-01 18:09:37 +01:00
Fabiano Fidêncio
8a5ebd5d16 tests: k8s: run QoS tests on a bigger instance
It's been failing to start quite regularly on the smaller instance.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-11-01 17:54:58 +01:00
Fabiano Fidêncio
c75a46d17f tests: Do not enable NFD on s390x
As we're failing on the uninstall, which seems related to a bug on NFD
itself, but I don't have access to a s390x machine to debug, let's skip
the enablement for now and enable it back once we've experimented it
better on s390x.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-31 16:30:13 +01:00
Fabiano Fidêncio
67e38e0f92 tests: Do not enable NFD on cbl-mariner
As we're failing to install NFD on CBL Mariner, let's skip the
enablement there, and enable it once we've experimented it better there.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-31 16:30:13 +01:00
Fabiano Fidêncio
1bc873397b tests: Use NFD as part of the tests
As we have the ability to deploy NFD as a sub-chart of our chart, let's
make sure we test it during our CI.

We had to increase the timeout values, where we had timeouts set, to
deploy / undeploy kata, as now NFD is also deployed / undeployed.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-31 16:30:13 +01:00
Fabiano Fidêncio
e30e2b5f45 tests: k8s: Remove tests running on GitHub provided runner
We have 2 tests running on GitHub provided runners:
* devmapper
* CRI-O

- devmapper situation

For devmapper, we're currently testing devmapper with s390x as part of
one of its jobs.

More than that, this test has been failing here due to a lack of space
in the machine for quite some time, and no-action was taken to bring it
back either via GARM or some other way.

With that said, let's rely on the s390x CI to test devmapper and avoid
one extra failure on our CI by removing this one.

- cri-o situation

CRI-O is being tested with a fixed version of kubernetes that's already
reached its EOL, and a CRI-O version that matches that k8s version.

There has been attempts to raise issues, and also to provide a PR that
does at least part of the work ... leaving the debugging part for the
maintainers of the CI. However, there was no action on those from the
maintainers.

Signed-off-by: Fabiano Fidêncio <ffidencio@nvidia.com>
2025-10-30 11:46:59 +01:00
Manuel Huber
8dc78057d6 ci: Refactor NVIDIA NIM test
Change NIM bats file logic to allow skipping test cases which
require multiple GPUs. This can be helpful for test clusters where
there is only one node with a single GPU, or for local test
environments with a single-node cluster with a single GPU.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2025-10-28 19:12:16 +01:00
Manuel Huber
be32b77baf ci: Add NVIDIA CUDA vectoradd test
This change adds a CUDA vectoradd test case and makes enabling NVRC
tracing optional and idempotent.

Signed-off-by: Manuel Huber <manuelh@nvidia.com>
2025-10-28 19:12:16 +01:00