ci.ocp: Add steps to reproduce/bisect CI runs

in case the upstream CI fails it's useful to pin-point the PR that caused the regression. Currently openshift-ci does not allow doing that from their setup but we can mimic the setup on our infrastructure and use the available kata-deploy-ci images to find the first failing one. To help with that add a few helper scripts and a howto. Fixes: #9228 Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
2025-10-21 03:48:45 +00:00 · 2024-04-04 09:55:42 +02:00
parent a556ad7e01
commit f994f79078
7 changed files with 224 additions and 1 deletions
--- a/ci/openshift-ci/README.md
+++ b/ci/openshift-ci/README.md
@@ -8,3 +8,142 @@ There are 2 pipelines, history and logs can be accessed here:

 * [main - currently supported OCP](https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-kata-containers-kata-containers-main-e2e-tests)
 * [next - currently under development OCP](https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-kata-containers-kata-containers-main-next-e2e-tests)
+
+
+Running openshift-tests on OCP with kata-containers manually
+============================================================
+
+To run openshift-tests (or other suites) with kata-containers one can use
+the kata-webhook. To deploy everything you can mimic the CI pipeline by:
+
+```bash
+#!/bin/bash -e
+# Setup your kubectl and check it's accessible by
+kubectl nodes
+# Deploy kata (set KATA_DEPLOY_IMAGE to override the default kata-deploy-ci:latest image)
+./test.sh
+# Deploy the webhook
+KATA_RUNTIME=kata-qemu cluster/deploy_webhook.sh
+```
+
+This should ensure kata-containers as well as kata-webhook are installed and
+working. Before running the openshift-tests it's (currently) recommended to
+ignore some security features by:
+
+```bash
+#!/bin/bash -e
+oc adm policy add-scc-to-group privileged system:authenticated system:serviceaccounts
+oc adm policy add-scc-to-group anyuid system:authenticated system:serviceaccounts
+oc label --overwrite ns default pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/warn=baseline pod-security.kubernetes.io/audit=baseline
+```
+
+Now you should be ready to run the openshift-tests. Our CI only uses a subset
+of tests, to get the current ``TEST_SKIPS`` see
+[the pipeline config](https://github.com/openshift/release/tree/master/ci-operator/config/kata-containers/kata-containers).
+Following steps require the [openshift tests](https://github.com/openshift/origin)
+being cloned and built in the current directory:
+
+```bash
+#!/bin/bash -e
+# Define tests to be skipped (see the pipeline config for the current version)
+TEST_SKIPS="\[sig-node\] Security Context should support seccomp runtime/default\|\[sig-node\] Variable Expansion should allow substituting values in a volume subpath\|\[k8s.io\] Probing container should be restarted with a docker exec liveness probe with timeout\|\[sig-node\] Pods Extended Pod Container lifecycle evicted pods should be terminal\|\[sig-node\] PodOSRejection \[NodeConformance\] Kubelet should reject pod when the node OS doesn't match pod's OS\|\[sig-network\].*for evicted pods\|\[sig-network\].*HAProxy router should override the route\|\[sig-network\].*HAProxy router should serve a route\|\[sig-network\].*HAProxy router should serve the correct\|\[sig-network\].*HAProxy router should run\|\[sig-network\].*when FIPS.*the HAProxy router\|\[sig-network\].*bond\|\[sig-network\].*all sysctl on whitelist\|\[sig-network\].*sysctls should not affect\|\[sig-network\] pods should successfully create sandboxes by adding pod to network"
+# Get the list of tests to be executed
+TESTS="$(./openshift-tests run --dry-run --provider "${TEST_PROVIDER}" "${TEST_SUITE}")"
+# Store the list of tests in /tmp/tsts file
+echo "${TESTS}" | grep -v "$TEST_SKIPS" > /tmp/tsts
+# Remove previously-existing temporarily files as well as previous results
+OUT=RESULTS/tmp
+rm -Rf /tmp/*test* /tmp/e2e-*
+rm -R $OUT
+mkdir -p $OUT
+# Run the tests ignoring the monitor health checks
+./openshift-tests run --provider azure -o "$OUT/job.log" --junit-dir "$OUT" --file /tmp/tsts --max-parallel-tests 5 --cluster-stability Disruptive --run '^\[sig-node\].*|^\[sig-network\]'
+```
+
+[!NOTE]
+Note we are ignoring the cluster stability checks because our public cloud is
+not that stable and running with VMs instead of containers results in minor
+stability issues. Some of the old monitor stability tests do not reflect
+the ``--cluster-stability`` setting, one should simply ignore these. If you
+get a message like ``invariant was violated`` or ``error: failed due to a
+MonitorTest failure``, it's usually an indication that only those kind of
+tests failed but the real tests passed. See
+[wrapped-openshift-tests.sh](https://github.com/openshift/release/blob/master/ci-operator/config/kata-containers/kata-containers/wrapped-openshift-tests.sh)
+for details how our pipeline deals with that.
+
+[!TIP]
+To compare multiple results locally one can use
+[junit2html](https://github.com/inorton/junit2html) tool.
+
+
+Best-effort kata-containers cleanup
+===================================
+
+If you need to cleanup the cluster after testing, you can use the
+``cleanup.sh`` script from the current directory. It tries to delete all
+resources created by ``test.sh`` as well as ``cluster/deploy_webhook.sh``
+ignoring all failures. The primary purpose of this script is to allow
+soft-cleanup after deployment to test different versions without
+re-provisioning everything.
+
+[!WARNING]
+Do not rely on this script in production, return codes are not checked!**
+
+
+Bisecting e2e tests failures
+============================
+
+Let's say the OCP pipeline passed running with
+``quay.io/kata-containers/kata-deploy-ci:kata-containers-d7afd31fd40e37a675b25c53618904ab57e74ccd-amd64``
+but failed running with
+``quay.io/kata-containers/kata-deploy-ci:kata-containers-9f512c016e75599a4a921bd84ea47559fe610057-amd64``
+and you'd like to know which PR caused the regression. You can either run with
+all the 60 tags between or you can utilize the [bisecter](https://github.com/ldoktor/bisecter)
+to optimize the number of steps in between.
+
+Before running the bisection you need a reproducer script. Sample one called
+``sample-test-reproducer.sh`` is provided in this directory but you might
+want to copy and modify it, especially:
+
+* ``OCP_DIR`` - directory where your openshift/release is located (can be exported)
+* ``E2E_TEST`` - openshift-test(s) to be executed (can be exported)
+* behaviour of SETUP (returning 125 skips the current image tag, returning
+  >=128 interrupts the execution, everything else reports the tag as failure
+* what should be executed (perhaps running the setup is enough for you or
+  you might want to be looking for specific failures...)
+* use ``timeout`` to interrupt execution in case you know things should be faster
+
+Executing that script with the GOOD commit should pass
+``./sample-test-reproducer.sh quay.io/kata-containers/kata-deploy-ci:kata-containers-d7afd31fd40e37a675b25c53618904ab57e74ccd-amd64``
+and fail when executed with the BAD commit
+``./sample-test-reproducer.sh quay.io/kata-containers/kata-deploy-ci:kata-containers-9f512c016e75599a4a921bd84ea47559fe610057-amd64``.
+
+To get the list of all tags in between those two PRs you can use the
+``bisect-range.sh`` script
+
+```bash
+./bisect-range.sh d7afd31fd40e37a675b25c53618904ab57e74ccd 9f512c016e75599a4a921bd84ea47559fe610057
+```
+
+[!NOTE]
+The tagged images are only built per PR, not for individual commits. See
+[kata-deploy-ci](https://quay.io/kata-containers/kata-deploy-ci) to see the
+available images.
+
+To find out which PR caused this regression, you can either manually try the
+individual commits or you can simply execute:
+
+```bash
+bisecter start "$(./bisect-range.sh d7afd31fd40 9f512c016)"
+OCP_DIR=/path/to/openshift/release bisecter run ./sample-test-reproducer.sh
+```
+
+[!NOTE]
+If you use ``KATA_WITH_SYSTEM_QEMU=yes`` you might want to deploy once with
+it and skip it for the cleanup. That way you might (in most cases) test
+all images with a single MCP update instead of per-image MCP update.
+
+[!TIP]
+You can check the bisection progress during/after execution by running
+``bisecter log`` from the current directory. Before starting a new
+bisection you need to execute ``bisecter reset``.
--- a/ci/openshift-ci/bisect-range.sh
+++ b/ci/openshift-ci/bisect-range.sh
@@ -0,0 +1,24 @@
+#!/bin/bash
+# Copyright (c) 2024 Red Hat, Inc.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+if [ "$#" -gt 2 ] || [ "$#" -lt 1 ] ; then
+	echo "Usage: $0 GOOD [BAD]"
+	echo "Prints list of available kata-deploy-ci tags between GOOD and BAD commits (by default BAD is the latest available tag)"
+	exit 255
+fi
+GOOD="$1"
+[ -n "$2" ] && BAD="$2"
+ARCH=amd64
+REPO="quay.io/kata-containers/kata-deploy-ci"
+
+TAGS=$(skopeo list-tags "docker://$REPO")
+# Only amd64
+TAGS=$(echo "$TAGS" | jq '.Tags' | jq "map(select(endswith(\"$ARCH\")))" | jq -r '.[]')
+# Tags since $GOOD
+TAGS=$(echo "$TAGS" | sed -n -e "/$GOOD/,$$p")
+# Tags up to $BAD
+[ -n "$BAD" ] && TAGS=$(echo "$TAGS" | sed "/$BAD/q")
+# Comma separated tags with repo
+echo "$TAGS" | sed -e "s@^@$REPO:@" | paste -s -d, -
--- a/ci/openshift-ci/sample-test-reproducer.sh
+++ b/ci/openshift-ci/sample-test-reproducer.sh
@@ -0,0 +1,50 @@
+#!/bin/bash
+# Copyright (c) 2024 Red Hat, Inc.
+#
+# SPDX-License-Identifier: Apache-2.0
+#
+# A sample script to deploy, configure, run E2E_TEST and soft-cleanup
+# afterwards OCP cluster using kata-containers primarily created for use
+# with https://github.com/ldoktor/bisecter
+
+[ "$#" -ne 1 ] && echo "Provide image as the first and only argument" && exit 255
+export KATA_DEPLOY_IMAGE="$1"
+OCP_DIR="${OCP_DIR:-/path/to/your/openshift/release/}"
+E2E_TEST="${E2E_TEST:-'"[sig-node] Container Runtime blackbox test on terminated container should report termination message as empty when pod succeeds and TerminationMessagePolicy FallbackToLogsOnError is set [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]"'}"
+KATA_CI_DIR="${KATA_CI_DIR:-$(pwd)}"
+export KATA_RUNTIME="${KATA_RUNTIME:-kata-qemu}"
+
+## SETUP
+# Deploy kata
+SETUP=0
+pushd "$KATA_CI_DIR" || { echo "Failed to cd to '$KATA_CI_DIR'"; exit 255; }
+./test.sh || SETUP=125
+cluster/deploy_webhook.sh || SETUP=125
+if [ $SETUP != 0 ]; then
+    ./cleanup.sh
+    exit "$SETUP"
+fi
+popd || true
+# Disable security
+oc adm policy add-scc-to-group privileged system:authenticated system:serviceaccounts
+oc adm policy add-scc-to-group anyuid system:authenticated system:serviceaccounts
+oc label --overwrite ns default pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/warn=baseline pod-security.kubernetes.io/audit=baseline
+
+## TEST EXECUTION
+# Run the testing
+pushd "$OCP_DIR" || { echo "Failed to cd to '$OCP_DIR'"; exit 255; }
+echo "$E2E_TEST" > /tmp/tsts
+# Remove previously-existing temporarily files as well as previous results
+OUT=RESULTS/tmp
+rm -Rf /tmp/*test* /tmp/e2e-*
+rm -R $OUT
+mkdir -p $OUT
+# Run the tests ignoring the monitor health checks
+./openshift-tests run --provider azure -o "$OUT/job.log" --junit-dir "$OUT" --file /tmp/tsts --max-parallel-tests 5 --cluster-stability Disruptive
+RET=$?
+popd || true
+
+## CLEANUP
+./cleanup.sh
+exit "$RET"
+
--- a/tests/cmd/check-spelling/data/acronyms.txt
+++ b/tests/cmd/check-spelling/data/acronyms.txt
@@ -11,6 +11,7 @@ AUFS # Another Union FS
 AWS/AB
 BDF/AB
 CFS/AB
+ci/AB
 CLI/AB
 CNI/AB
 CNM/AB
@@ -33,6 +34,7 @@ gRPC/AB
 GSC/AB
 GVT/AB
 IaaS/B  # Infrastructure as a Service
+io/B
 IOMMU/AB
 IoT/AB	# Internet of Things
 IOV/AB
--- a/tests/cmd/check-spelling/data/main.txt
+++ b/tests/cmd/check-spelling/data/main.txt
@@ -67,6 +67,7 @@ metadata
 microcontroller/AB
 miniOS
 mmap/AB
+MonitorTest/A
 nack/AB
 namespace/ABCD
 netlink
--- a/tests/cmd/check-spelling/data/projects.txt
+++ b/tests/cmd/check-spelling/data/projects.txt
@@ -6,6 +6,7 @@

 Ansible/B
 AppArmor/B
+bisecter/B
 blogbench/B
 BusyBox/B
 Cassandra/B
@@ -62,6 +63,7 @@ Netlify/B
 Nginx/B
 OpenCensus/B
 OpenPGP/B
+openshift/B  # lower-case used for some sub-projects
 OpenShift/B
 OpenSSL/B
 OpenStack/B
--- a/tests/cmd/check-spelling/kata-dictionary.dic
+++ b/tests/cmd/check-spelling/kata-dictionary.dic
@@ -1,4 +1,4 @@
-386
+387
 ACPI/AB
 ACS/AB
 API/AB
@@ -90,6 +90,7 @@ MITRE/B
 MacOS/B
 Mellanox/B
 Minikube/B
+MonitorTest/A
 NEMU/AB
 NIC/AB
 NVDIMM/AB
@@ -197,6 +198,7 @@ backend
 backport/ACD
 backtick/AB
 backtrace
+bisecter/B
 blogbench/B
 bootloader/AB
 ccloudvm/B
@@ -204,6 +206,7 @@ centric/B
 cgroup/AB
 checkbox/A
 chipset/AB
+ci/AB
 cnn/B
 codebase
 codecov/B
@@ -255,6 +258,7 @@ init/AB
 initramfs/AB
 initrd/AB
 intel
+io/B
 ioctl/A
 iodepth/A
 ioengine/A
@@ -295,6 +299,7 @@ netns/AB
 nvidia/A
 onwards
 openSUSE/B
+openshift/B
 osbuilder/B
 packagecloud/B
 parallelize/AC