The block volume test has failed on 10/10 nightlies
and all the PRs I've seen, so skip it until it can be assessed.
See #10873
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
the latest containerd had an issue for its e2e test, thus we should do
the following fix to workaround this issue. For much info about this issue,
please see:
https://github.com/containerd/containerd/pull/11240
Once this pr was merged and release new version, we can remove
this workaround.
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
A test case is added based on the intergrated cri-containerd case.
The difference between cri containerd integrated testcase and sandbox
api testcase is the "sandboxer" setting in the sandbox runtime handler.
If the "sandboxer" is set to "" or "podsandbox", then containerd will
use the legacy shimv2 api, and if the "sandboxer" is set to "shim", then
it will use the sandbox api to launch the pod.
In addition, add a containerd v2.0.0 version. Because containerd officially
supports the sandbox api from version 2.0.0.
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
k8s-policy-job is modeled after the older k8s-job, and it appears
that both of them fail occasionally on coco-dev.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Preparing to install nydus permanently on the AMD node,
so disabling deploy and delete command for SNP and SEV.
Signed-off-by: Arvind Kumar <arvinkum@amd.com>
As the devices controller works in a different way in cgroupsv2, the
"/sys/fs/cgroup/devices/devices.list" file simply doesn't exist.
For now, let's skip the test till the test maintainer decides to
re-enable it for cgroupsv2.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
The changes done are:
* cpu/cpu.shares was replaced by cpu.weight
* The weight, according to our reference[0], is calculated by:
weight = (1 + ((request - 2) * 9999) / 262142)
* cpu/cpu.cfs_quota_us & cpu/cpu.cfs_period_us were replaced by cpu.max,
where quota and period are written together (in this order)
[0]: https://github.com/containers/crun/blob/main/crun.1.md#cgroup-v2
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
This reverts commit 091ad2a1b2, in order
to ensure tests would be running with cgroupsv2 on the guest.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
I've noticed the following error when running the tests with SEV:
```
2025-01-21T17:10:28.7999896Z # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
2025-01-21T17:10:28.8000614Z # @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
2025-01-21T17:10:28.8001217Z # @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
2025-01-21T17:10:28.8001857Z # IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
2025-01-21T17:10:28.8003009Z # Someone could be eavesdropping on you right now (man-in-the-middle attack)!
2025-01-21T17:10:28.8003348Z # It is also possible that a host key has just been changed.
2025-01-21T17:10:28.8004422Z # The fingerprint for the ED25519 key sent by the remote host is
2025-01-21T17:10:28.8005019Z # SHA256:x7wF8zI+LLyiwphzmUhqY12lrGY4gs5qNCD81f1Cn1E.
2025-01-21T17:10:28.8005459Z # Please contact your system administrator.
2025-01-21T17:10:28.8006734Z # Add correct host key in /home/kata/.ssh/known_hosts to get rid of this message.
2025-01-21T17:10:28.8007031Z # Offending ED25519 key in /home/kata/.ssh/known_hosts:178
2025-01-21T17:10:28.8007254Z # remove with:
2025-01-21T17:10:28.8008172Z # ssh-keygen -f "/home/kata/.ssh/known_hosts" -R "10.244.0.71"
```
And this was causing a failure to ssh into the confidential pod.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
Relying on dmesg is really not ideal, as we may lose important info,
mainly those which happen very early in the boot, depending on the size
of kernel ring buffer.
So, for this specific test, let's increase the kernel ring buffer, by
default, to 4M.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
While working on #10559, I realized that some parts of the codebase use
$GH_PR_NUMBER, while other parts use $PR_NUMBER.
Notably, in that PR, since I used $GH_PR_NUMBER for CoCo non-TEE tests
without realizing that TEE tests use $PR_NUMBER, the tests on that PR
fail on TEEs:
https://github.com/kata-containers/kata-containers/actions/runs/12818127344/job/35744760351?pr=10559#step:10:45
...
44 error: error parsing STDIN: error converting YAML to JSON: yaml: line 90: mapping values are not allowed in this context
...
135 image: ghcr.io/kata-containers/csi-kata-directvolume:
...
So let's unify on $GH_PR_NUMBER so that this issue doesn't repro in the
future: I replaced all instances of PR_NUMBER with GH_PR_NUMBER.
Note that since some test scripts also refer to that variable, the CI
for this PR will fail (would have also happened with the converse
substitution), hence I'm not adding the ok-to-test label and we should
force-merge this after review.
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
On s390x, some tests for trusted storage occasionally failed due to:
```bash
etcdserver: request timed out
```
or
```bash
Internal error occurred: resource quota evaluation timed out
```
These timeouts were not observed previously on k3s but occur
sporadically on kubeadm. Importantly, they appear to be temporary
and transient, which means they can be ignored in most cases.
To address this, we introduced a new wrapper function, `retry_kubectl_apply()`,
for `kubectl create`. This function retries applying a given manifest up to 5
times if it fails due to a timeout. However, it will still catch and handle
any other errors during pod creation.
Fixes: #10651
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Since we bumped to the 6.12.x LTS kernel, we've also adjusted the
aggressivity of the OOM test, which may be enough to allow us to
re-enable it for mariner.
Fixes: #8821
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
Since from https://github.com/containerd/containerd/pull/9096
containerd removed cri-containerd-*.tar.gz release bundles,
thus we'd better change the tarball name to "containerd".
BTW, the containerd tarball containerd the follow files:
bin/
bin/containerd-shim
bin/ctr
bin/containerd-shim-runc-v1
bin/containerd-stress
bin/containerd
bin/containerd-shim-runc-v2
thus we should untar containerd into /usr/local directory instead of "/"
to keep align with the cri-containerd.
In addition, there's no containerd.service file,runc binary and cni-plugin
included, thus we should add a specific containerd.service file and
install install the runc binary and cni-pluginspecifically.
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
An issue has been created for this, and we should fix the issue before
the next release. However, for now, let's unblock the kernel bump and
have the test skipped.
Reference: https://github.com/kata-containers/kata-containers/issues/10706
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
Fix the logic to make the test skipped on qemu-coco-dev,
rather than the opposite and update the syntax to make it
clearer as it incorrectly got written and reviewed by three
different people in it's prior form.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
- Remove default_namespace from settings
- Ensure container namespaces in a pod match each other in case no namespace is specified in the YAML
Signed-off-by: Saul Paredes <saulparedes@microsoft.com>
The tests is unstable on this platform, so skip it for now to prevent
the regular known failures covering up other issues. See #10616
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
We're doing this as, at Intel, we have two different kind of machines we
can plug into our CI. Without going much into details, only one of
those two kinds of machines will work for the attestation tests we
perform with ITA, thus in order to speed up the CI and improve test
coverage (OS wise), we're going to run different tests in different
machines.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
Trustee's deployment must set the correct https_proxy as env var on the
container that will talk to the ITA / ITTS server, otherwise the kbs
service won't be able to start, causing then issues in our CI.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
Signed-off-by: Krzysztof Sandowicz <krzysztof.sandowicz@intel.com>
Currently the error we are checking for is
`CreateContainerRequest timed out`, but this message
doesn't always seem to be printed to our pod log.
Try using a more general message that should be present
more reliably.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
We have an error with service name resolution with this test when using crio.
This error could not be reproduced outside of the CI for now.
Skipping it to keep the CI job running until we find a solution.
See: #10414
Signed-off-by: Julien Ropé <jrope@redhat.com>
By the moment we're testing it also with qemu-coco-dev, it becomes
easier for a developer without access to TEE to also test it locally.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
This reverts commit 973b8a1d8f.
As @danmihai1 points out https://github.com/bats-core/bats-core/issues/364
states that using traps in bats is error prone, so this could be the cause
of the confidential test instability we've been seeing, like it was
in the static checks, so let's try and revert this.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
At @danmihai1's suggestion add a die message in case
the call to setup_common fails, so we can see if in the test
output.
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
We've seen some issues with tests not being run in
some of the Coco CI jobs (Issue #10451) and in the
envrionments that are more stable we noticed that
they had a newer version of bats installed.
Try updating the version to 1.10+ and print out
the version for debug purposes
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
When creating a k8s cluster via kubeadm, the devmapper setup
for containerd requires a different configuration.
This commit introduces a new `kubeadm` option for the KUBERNETES
variable and adjusts the path to the containerd config file for
devmapper setup.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
As we're now building everything needed to test TDX with measured rootfs
support, let's bring this test back in (for TDX only, at least for now).
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
inotify-configmap-pod.yaml is using: "inotifywait --timeout 120",
so wait for up to 180 seconds for the pod termination to be
reported.
Hopefully, some of the sporadic errors from #10413 will be avoided
this way:
not ok 1 configmap update works, and preserves symlinks
waitForProcess "${wait_time}" "$sleep_time" "${command}" failed
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
While Kubernetes defines `binaryData` as `[]byte`,
when defined in a YAML file the raw bytes are
base64 encoded. Therefore, we need to read the YAML
value as `String` and not as `Vec<u8>`.
Fixes: #10410
Signed-off-by: Leonard Cohnen <lc@edgeless.systems>
When dealing with a specific release, it was easier to just do some
adjustments on the image that has to be used for ITA without actually
adding a new entry in the versions.yaml.
However, it's been proven to be more complicated than that when it comes
to dealing with staged images, and we better explicitly add (and
update) those versions altogether to avoid CI issues.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
The reason we're doing this is because mariner image uses, by default,
cgroups default-hierarchy as `unified` (aka, cgroupsv2).
In order to keep the same initrd behaviour for mariner, let's enforce
that `SYSTEMD_CGROUP_ENABLE_LEGACY_FORCE=1
systemd.legacy_systemd_cgroup_controller=yes
systemd.unified_cgroup_hierarchy=0` is passed to the kernel cmdline, at
least for now.
Other tests that are setting `kernel_params` are not running on mariner,
then we're safe taking this path as it's done as part of this PR.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
As an image has been added for mariner as part of the commit 63c1f81c2,
let's start using it in the CI, instead of using the initrd.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
In https://github.com/confidential-containers/trustee/pull/521
the overlays logic was modified to add non-SE
s390x support and simplify non-ibm-se platforms.
We need to update the logic in `kbs_k8s_deploy`
to match and can remove the dummying of `IBM_SE_CREDS_DIR`
for non-SE now
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
The behavior of Kata CI doesn't change.
For local testing using kubernetes/gha-run.sh and AUTO_GENERATE_POLICY=yes:
1. Before these changes users were forced to use:
- SEV, SNP, or TDX guests, or
- KATA_HOST_OS=cbl-mariner
2. After these changes users can also use other platforms that are
configured with "shared_fs = virtio-fs" - e.g.,
- KATA_HOST_OS=ubuntu + KATA_HYPERVISOR=qemu
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
The behavior of Kata CI doesn't change.
For local testing using kubernetes/gha-run.sh:
1. Before these changes:
- AUTO_GENERATE_POLICY=yes was always used by the users of SEV, SNP,
TDX, or KATA_HOST_OS=cbl-mariner.
2. After these changes:
- Users of SEV, SNP, TDX, or KATA_HOST_OS=cbl-mariner must specify
AUTO_GENERATE_POLICY=yes if they want to auto-generate policy.
- These users have the option to test just using hard-coded policies
(e.g., using the default policy built into the Guest rootfs) by
using AUTO_GENERATE_POLICY=no. AUTO_GENERATE_POLICY=no is the default
value of this env variable.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This PR adds the trap statement in the confidential kbs script
to clean up temporary files and ensure we are leaving them.
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
The tests is disabled for qemu-coco-dev / qemu-tdx, but it doesn't seen
to actually be failing on those. Plus, it's passing on SEV / SNP, which
means that we most likely missed re-enabling this one in the past.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
Currently, `qemu-runtime-rs` does not support `virtio-scsi`,
which causes the `k8s-block-volume.bats` test to fail.
We should skip this test until `virtio-scsi` is supported by the runtime.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The nginx container seems to error out when using UID=123.
Depending on the timing between container initialization and "kubectl
wait", the test might have gotten lucky and found the pod briefly in
Ready state before nginx errored out. But on some of the nodes, the pod
never got reported as Ready.
Also, don't block in "kubectl wait --for=condition=Ready" when wrapping
that command in a waitForProcess call, because waitForProcess is
designed for short-lived commands.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This imports the k8s-block-volume test from the tests repo and modifies
it slightly to set up the host volume on the AKS host.
This is a follow-up to #7132.
Fixes: #7164
Signed-off-by: Salvador Fuentes <salvador.fuentes@intel.com>
Signed-off-by: Aurélien Bombo <abombo@microsoft.com>
The ita kustomization for Trustee, as well as previously used one
(DCAP), doesn't have a $(uname -m) directory after the deployment
directory name.
Let's follow the same logic used for the deploy-kbs script and clean
those up accordingly.
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
Intel Tiber Trust Services (formerly known as Intel Trust Authority) is
Intel's own attestation service, and we want to take advantage of the
TDX CI in order to ensure ITTS works as expected.
In order to do so, let's replace the former method used (DCAP) to use
ITTS instead.
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano@fidencio.org>
There are many similar or duplicated code patterns in `teardown()`.
This commit consolidates them into a new function, `teardown_common()`,
which is now called within `teardown()`.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
The current `exec_host()` accepts a given node name and
creates a node debugger pod, even if the name is invalid.
This could result in the creation of an unnecessary pending
pod (since we are using nodeAffinity; if the given name
does not match any actual node names, the pod won’t be scheduled),
which wastes resources.
This commit introduces validation for the node name to
prevent this situation.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
It was observed that the custom node debugger pod is not
cleaned up when a test times out.
This commit ensures the pod is cleaned up by triggering
the cleanup on EXIT, preventing any debugger pods from
being left behind.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
With #10232 merged, we now have a persistent node debugger pod throughout the test.
As a result, there’s no need to spawn another debugger pod using `kubectl debug`,
which could lead to false negatives due to premature pod termination, as reported
in #10081.
This commit removes the `print_node_journal()` call that uses `kubectl debug` and
instead uses `exec_host()` to capture the host journal. The `exec_host()` function
is relocated to `tests/integration/kubernetes/lib.sh` to prevent cyclical dependencies
between `tests_common.sh` and `lib.sh`.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
`assert_pod_fail()` currently calls `k8s_create_pod()` to ensure that a pod
does not become ready within the default 120s. However, this delays the test's
completion even if an error message is detected earlier in the journal.
This commit removes the use of `k8s_create_pod()` and modifies `assert_pod_fail()`
to fail as soon as the pod enters a failed state.
All failing pods end up in one of the following states:
- CrashLoopBackOff
- ImagePullBackOff
The function now polls the pod's state every 5 seconds to check for these conditions.
If the pod enters a failed state, the function immediately returns 0. If the pod
does not reach a failed state within 120 seconds, it returns 1.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Now that the issue with handling loop devices has been resolved,
this commit re-enables the guest-pull-image tests for `qemu-coco-dev`.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Timeouts occur (e.g. `create_container_timeout` and `wait_time`)
when using qemu-coco-dev.
This commit increases these timeouts for the trusted image storage
test cases
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
If the host running the tests is different from the host where the cluster is running,
the *_loop_device() functions do not work as expected because the device is created
on the test host, while the cluster expects the device to be local.
This commit ensures that all commands for the relevant functions are executed via exec_host()
so that a device should be handled on a cluster node.
Additionally, it modifies exec_host() to return the exit code of the last executed command
because the existing logic with `kubectl debug` sometimes includes unexpected characters
that are difficult to handle. `kubectl exec` appears to properly return the exit code for
a given command to it.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
Creating and deleting a node debugger pod for every `exec_host()`
call is inefficient.
This commit changes the test suite to create and delete the pod
only once, globally.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
This commit addresses an issue with handling loop devices
via a node debugger due to restricted privileges.
It runs a pod with full privileges, allowing it to mount
the host root to `/host`, similar to the node debugger.
This change enables us to run tests for trusted image storage
using the `qemu-coco-dev` runtime class.
Fixes: #10133
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
We've added s390x test container image, so add support
to use them based on the arch the test is running on
Fixes: #10302
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
fixuop
This commit brings some public parts of image pulling test series like
encrypted image pulling, pulling images from authenticated registry and
image verification. This would help to reduce the cost of maintainance.
Co-authored-by: stevenhorsman <steven@uk.ibm.com>
Signed-off-by: Xynnn007 <xynnn@linux.alibaba.com>
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Close#8120
**Case 1**
Create a pod from an unsigned image, on an insecureAcceptAnything
registry works.
Image: quay.io/prometheus/busybox:latest
Policy rule:
```
"default": [
{
"type": "insecureAcceptAnything"
}
]
```
**Case 2**
Create a pod from an unsigned image, on a 'restricted registry' is
rejected.
Image: ghcr.io/confidential-containers/test-container-image-rs:unsigned
Policy rule:
```
"quay.io/confidential-containers/test-container-image-rs": [
{
"type": "sigstoreSigned",
"keyPath": "kbs:///default/cosign-public-key/test"
}
]
```
**Case 3**
Create a pod from a signed image, on a 'restricted registry' is
successful.
Image: ghcr.io/confidential-containers/test-container-image-rs:cosign-signed
Policy rule:
```
"ghcr.io/confidential-containers/test-container-image-rs": [
{
"type": "sigstoreSigned",
"keyPath": "kbs:///default/cosign-public-key/test"
}
]
```
**Case 4**
Create a pod from a signed image, on a 'restricted registry', but with
the wrong key is rejected
Image:
ghcr.io/confidential-containers/test-container-image-rs:cosign-signed-key2
Policy:
```
"ghcr.io/confidential-containers/test-container-image-rs": [
{
"type": "sigstoreSigned",
"keyPath": "kbs:///default/cosign-public-key/test"
}
]
```
**Case 5**
Create a pod from an unsigned image, on a 'restricted registry' works
if enable_signature_verfication is false
Image: ghcr.io/kata-containers/confidential-containers:unsigned
image security enable: false
Signed-off-by: stevenhorsman <steven@uk.ibm.com>
Change pod runAsUser value of a Replication Controller after generating
the RC's policy, and verify that the RC pods get rejected due to this
change.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Change pod runAsUser value of a Job after generating the Job's policy,
and verify that the Job gets rejected due to this change.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Change pod runAsUser value of a Deployment after generating the
Deployment's policy, and verify that the Deployment fails due to
this change.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Poll/wait for pod termination instead of sleeping 2 minutes. This
change typically saves ~90 seconds in my test cluster.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Kata-deploy often fails due to a transiently unreachable k8s cluster
for the qemu-coco-dev test on s390x.
(e.g. https://github.com/kata-containers/kata-containers/actions/runs/10831142906/job/30058527098?pr=10009)
This commit introduces a retry mechanism to mitigate these failures by
retrying the command two more times with a 10-second interval as a workaround.
Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>
If the CI platform being tested doesn't support yet the prometheus
container image:
- Use busybox instead of prometheus.
- Skip the test cases that depend on the prometheus image.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
This PR fixes the indentation in the cri containerd tests as we
have in several places a misalignment in the script.
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
Remove the workaround for #9928, now that genpolicy is able to
convert user names from container images into the corresponding
UIDs from these images.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>
Disabling the UID Policy rule was a workaround for #9928. Re-enable
that rule here and add a new test/CI temporary workaround for this
issue. This new test workaround will be removed after fixing #9928.
Signed-off-by: Dan Mihai <dmihai@microsoft.com>