ci.ocp: Add steps to reproduce/bisect CI runs

in case the upstream CI fails it's useful to pin-point the PR that caused the regression. Currently openshift-ci does not allow doing that from their setup but we can mimic the setup on our infrastructure and use the available kata-deploy-ci images to find the first failing one. To help with that add a few helper scripts and a howto. Fixes: #9228 Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
2025-07-12 22:58:58 +00:00 · 2024-04-04 09:55:42 +02:00 · 2024-04-04 09:55:42 +02:00 · f994f79078
commit f994f79078
parent a556ad7e01
7 changed files with 224 additions and 1 deletions
--- a/ci/openshift-ci/README.md
+++ b/ci/openshift-ci/README.md
@ -8,3 +8,142 @@ There are 2 pipelines, history and logs can be accessed here:
 * [main - currently supported OCP](https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-kata-containers-kata-containers-main-e2e-tests)
 * [next - currently under development OCP](https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-kata-containers-kata-containers-main-next-e2e-tests)
 Running openshift-tests on OCP with kata-containers manually
 ============================================================
 To run openshift-tests (or other suites) with kata-containers one can use
 the kata-webhook. To deploy everything you can mimic the CI pipeline by:
 ```bash
 #!/bin/bash -e
 # Setup your kubectl and check it's accessible by
 kubectl nodes
 # Deploy kata (set KATA_DEPLOY_IMAGE to override the default kata-deploy-ci:latest image)
 ./test.sh
 # Deploy the webhook
 KATA_RUNTIME=kata-qemu cluster/deploy_webhook.sh
 ```
 This should ensure kata-containers as well as kata-webhook are installed and
 working. Before running the openshift-tests it's (currently) recommended to
 ignore some security features by:
 ```bash
 #!/bin/bash -e
 oc adm policy add-scc-to-group privileged system:authenticated system:serviceaccounts
 oc adm policy add-scc-to-group anyuid system:authenticated system:serviceaccounts
 oc label --overwrite ns default pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/warn=baseline pod-security.kubernetes.io/audit=baseline
 ```
 Now you should be ready to run the openshift-tests. Our CI only uses a subset
 of tests, to get the current ``TEST_SKIPS`` see
 [the pipeline config](https://github.com/openshift/release/tree/master/ci-operator/config/kata-containers/kata-containers).
 Following steps require the [openshift tests](https://github.com/openshift/origin)
 being cloned and built in the current directory:
 ```bash
 #!/bin/bash -e
 # Define tests to be skipped (see the pipeline config for the current version)
 TEST_SKIPS="\[sig-node\] Security Context should support seccomp runtime/default\|\[sig-node\] Variable Expansion should allow substituting values in a volume subpath\|\[k8s.io\] Probing container should be restarted with a docker exec liveness probe with timeout\|\[sig-node\] Pods Extended Pod Container lifecycle evicted pods should be terminal\|\[sig-node\] PodOSRejection \[NodeConformance\] Kubelet should reject pod when the node OS doesn't match pod's OS\|\[sig-network\].*for evicted pods\|\[sig-network\].*HAProxy router should override the route\|\[sig-network\].*HAProxy router should serve a route\|\[sig-network\].*HAProxy router should serve the correct\|\[sig-network\].*HAProxy router should run\|\[sig-network\].*when FIPS.*the HAProxy router\|\[sig-network\].*bond\|\[sig-network\].*all sysctl on whitelist\|\[sig-network\].*sysctls should not affect\|\[sig-network\] pods should successfully create sandboxes by adding pod to network"
 # Get the list of tests to be executed
 TESTS="$(./openshift-tests run --dry-run --provider "${TEST_PROVIDER}" "${TEST_SUITE}")"
 # Store the list of tests in /tmp/tsts file
 echo "${TESTS}" | grep -v "$TEST_SKIPS" > /tmp/tsts
 # Remove previously-existing temporarily files as well as previous results
 OUT=RESULTS/tmp
 rm -Rf /tmp/*test* /tmp/e2e-*
 rm -R $OUT
 mkdir -p $OUT
 # Run the tests ignoring the monitor health checks
 ./openshift-tests run --provider azure -o "$OUT/job.log" --junit-dir "$OUT" --file /tmp/tsts --max-parallel-tests 5 --cluster-stability Disruptive --run '^\[sig-node\].*|^\[sig-network\]'
 ```
 [!NOTE]
 Note we are ignoring the cluster stability checks because our public cloud is
 not that stable and running with VMs instead of containers results in minor
 stability issues. Some of the old monitor stability tests do not reflect
 the ``--cluster-stability`` setting, one should simply ignore these. If you
 get a message like ``invariant was violated`` or ``error: failed due to a
 MonitorTest failure``, it's usually an indication that only those kind of
 tests failed but the real tests passed. See
 [wrapped-openshift-tests.sh](https://github.com/openshift/release/blob/master/ci-operator/config/kata-containers/kata-containers/wrapped-openshift-tests.sh)
 for details how our pipeline deals with that.
 [!TIP]
 To compare multiple results locally one can use
 [junit2html](https://github.com/inorton/junit2html) tool.
 Best-effort kata-containers cleanup
 ===================================
 If you need to cleanup the cluster after testing, you can use the
 ``cleanup.sh`` script from the current directory. It tries to delete all
 resources created by ``test.sh`` as well as ``cluster/deploy_webhook.sh``
 ignoring all failures. The primary purpose of this script is to allow
 soft-cleanup after deployment to test different versions without
 re-provisioning everything.
 [!WARNING]
 Do not rely on this script in production, return codes are not checked!**
 Bisecting e2e tests failures
 ============================
 Let's say the OCP pipeline passed running with
 ``quay.io/kata-containers/kata-deploy-ci:kata-containers-d7afd31fd40e37a675b25c53618904ab57e74ccd-amd64``
 but failed running with
 ``quay.io/kata-containers/kata-deploy-ci:kata-containers-9f512c016e75599a4a921bd84ea47559fe610057-amd64``
 and you'd like to know which PR caused the regression. You can either run with
 all the 60 tags between or you can utilize the [bisecter](https://github.com/ldoktor/bisecter)
 to optimize the number of steps in between.
 Before running the bisection you need a reproducer script. Sample one called
 ``sample-test-reproducer.sh`` is provided in this directory but you might
 want to copy and modify it, especially:
 * ``OCP_DIR`` - directory where your openshift/release is located (can be exported)
 * ``E2E_TEST`` - openshift-test(s) to be executed (can be exported)
 * behaviour of SETUP (returning 125 skips the current image tag, returning
  >=128 interrupts the execution, everything else reports the tag as failure
 * what should be executed (perhaps running the setup is enough for you or
  you might want to be looking for specific failures...)
 * use ``timeout`` to interrupt execution in case you know things should be faster
 Executing that script with the GOOD commit should pass
 ``./sample-test-reproducer.sh quay.io/kata-containers/kata-deploy-ci:kata-containers-d7afd31fd40e37a675b25c53618904ab57e74ccd-amd64``
 and fail when executed with the BAD commit
 ``./sample-test-reproducer.sh quay.io/kata-containers/kata-deploy-ci:kata-containers-9f512c016e75599a4a921bd84ea47559fe610057-amd64``.
 To get the list of all tags in between those two PRs you can use the
 ``bisect-range.sh`` script
 ```bash
 ./bisect-range.sh d7afd31fd40e37a675b25c53618904ab57e74ccd 9f512c016e75599a4a921bd84ea47559fe610057
 ```
 [!NOTE]
 The tagged images are only built per PR, not for individual commits. See
 [kata-deploy-ci](https://quay.io/kata-containers/kata-deploy-ci) to see the
 available images.
 To find out which PR caused this regression, you can either manually try the
 individual commits or you can simply execute:
 ```bash
 bisecter start "$(./bisect-range.sh d7afd31fd40 9f512c016)"
 OCP_DIR=/path/to/openshift/release bisecter run ./sample-test-reproducer.sh
 ```
 [!NOTE]
 If you use ``KATA_WITH_SYSTEM_QEMU=yes`` you might want to deploy once with
 it and skip it for the cleanup. That way you might (in most cases) test
 all images with a single MCP update instead of per-image MCP update.
 [!TIP]
 You can check the bisection progress during/after execution by running
 ``bisecter log`` from the current directory. Before starting a new
 bisection you need to execute ``bisecter reset``.
--- a/ci/openshift-ci/bisect-range.sh
+++ b/ci/openshift-ci/bisect-range.sh
@ -0,0 +1,24 @@
 #!/bin/bash
 # Copyright (c) 2024 Red Hat, Inc.
 #
 # SPDX-License-Identifier: Apache-2.0
 #
 if [ "$#" -gt 2 ] || [ "$#" -lt 1 ] ; then
 	echo "Usage: $0 GOOD [BAD]"
 	echo "Prints list of available kata-deploy-ci tags between GOOD and BAD commits (by default BAD is the latest available tag)"
 	exit 255
 fi
 GOOD="$1"
 [ -n "$2" ] && BAD="$2"
 ARCH=amd64
 REPO="quay.io/kata-containers/kata-deploy-ci"
 TAGS=$(skopeo list-tags "docker://$REPO")
 # Only amd64
 TAGS=$(echo "$TAGS" | jq '.Tags' | jq "map(select(endswith(\"$ARCH\")))" | jq -r '.[]')
 # Tags since $GOOD
 TAGS=$(echo "$TAGS" | sed -n -e "/$GOOD/,$$p")
 # Tags up to $BAD
 [ -n "$BAD" ] && TAGS=$(echo "$TAGS" | sed "/$BAD/q")
 # Comma separated tags with repo
 echo "$TAGS" | sed -e "s@^@$REPO:@" | paste -s -d, -
--- a/ci/openshift-ci/sample-test-reproducer.sh
+++ b/ci/openshift-ci/sample-test-reproducer.sh
@ -0,0 +1,50 @@
 #!/bin/bash
 # Copyright (c) 2024 Red Hat, Inc.
 #
 # SPDX-License-Identifier: Apache-2.0
 #
 # A sample script to deploy, configure, run E2E_TEST and soft-cleanup
 # afterwards OCP cluster using kata-containers primarily created for use
 # with https://github.com/ldoktor/bisecter
 [ "$#" -ne 1 ] && echo "Provide image as the first and only argument" && exit 255
 export KATA_DEPLOY_IMAGE="$1"
 OCP_DIR="${OCP_DIR:-/path/to/your/openshift/release/}"
 E2E_TEST="${E2E_TEST:-'"[sig-node] Container Runtime blackbox test on terminated container should report termination message as empty when pod succeeds and TerminationMessagePolicy FallbackToLogsOnError is set [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]"'}"
 KATA_CI_DIR="${KATA_CI_DIR:-$(pwd)}"
 export KATA_RUNTIME="${KATA_RUNTIME:-kata-qemu}"
 ## SETUP
 # Deploy kata
 SETUP=0
 pushd "$KATA_CI_DIR" || { echo "Failed to cd to '$KATA_CI_DIR'"; exit 255; }
 ./test.sh || SETUP=125
 cluster/deploy_webhook.sh || SETUP=125
 if [ $SETUP != 0 ]; then
    ./cleanup.sh
    exit "$SETUP"
 fi
 popd || true
 # Disable security
 oc adm policy add-scc-to-group privileged system:authenticated system:serviceaccounts
 oc adm policy add-scc-to-group anyuid system:authenticated system:serviceaccounts
 oc label --overwrite ns default pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/warn=baseline pod-security.kubernetes.io/audit=baseline
 ## TEST EXECUTION
 # Run the testing
 pushd "$OCP_DIR" || { echo "Failed to cd to '$OCP_DIR'"; exit 255; }
 echo "$E2E_TEST" > /tmp/tsts
 # Remove previously-existing temporarily files as well as previous results
 OUT=RESULTS/tmp
 rm -Rf /tmp/*test* /tmp/e2e-*
 rm -R $OUT
 mkdir -p $OUT
 # Run the tests ignoring the monitor health checks
 ./openshift-tests run --provider azure -o "$OUT/job.log" --junit-dir "$OUT" --file /tmp/tsts --max-parallel-tests 5 --cluster-stability Disruptive
 RET=$?
 popd || true
 ## CLEANUP
 ./cleanup.sh
 exit "$RET"
--- a/tests/cmd/check-spelling/data/acronyms.txt
+++ b/tests/cmd/check-spelling/data/acronyms.txt
@ -11,6 +11,7 @@ AUFS # Another Union FS
 AWS/AB
 BDF/AB
 CFS/AB
 ci/AB
 CLI/AB
 CNI/AB
 CNM/AB
@ -33,6 +34,7 @@ gRPC/AB
 GSC/AB
 GVT/AB
 IaaS/B  # Infrastructure as a Service
 io/B
 IOMMU/AB
 IoT/AB	# Internet of Things
 IOV/AB
--- a/tests/cmd/check-spelling/data/main.txt
+++ b/tests/cmd/check-spelling/data/main.txt
@ -67,6 +67,7 @@ metadata
 microcontroller/AB
 miniOS
 mmap/AB
 MonitorTest/A
 nack/AB
 namespace/ABCD
 netlink
--- a/tests/cmd/check-spelling/data/projects.txt
+++ b/tests/cmd/check-spelling/data/projects.txt
@ -6,6 +6,7 @@
 Ansible/B
 AppArmor/B
 bisecter/B
 blogbench/B
 BusyBox/B
 Cassandra/B
@ -62,6 +63,7 @@ Netlify/B
 Nginx/B
 OpenCensus/B
 OpenPGP/B
 openshift/B  # lower-case used for some sub-projects
 OpenShift/B
 OpenSSL/B
 OpenStack/B
--- a/tests/cmd/check-spelling/kata-dictionary.dic
+++ b/tests/cmd/check-spelling/kata-dictionary.dic
@ -1,4 +1,4 @@
-386
+387
 ACPI/AB
 ACS/AB
 API/AB
@ -90,6 +90,7 @@ MITRE/B
 MacOS/B
 Mellanox/B
 Minikube/B
 MonitorTest/A
 NEMU/AB
 NIC/AB
 NVDIMM/AB
@ -197,6 +198,7 @@ backend
 backport/ACD
 backtick/AB
 backtrace
 bisecter/B
 blogbench/B
 bootloader/AB
 ccloudvm/B
@ -204,6 +206,7 @@ centric/B
 cgroup/AB
 checkbox/A
 chipset/AB
 ci/AB
 cnn/B
 codebase
 codecov/B
@ -255,6 +258,7 @@ init/AB
 initramfs/AB
 initrd/AB
 intel
 io/B
 ioctl/A
 iodepth/A
 ioengine/A
@ -295,6 +299,7 @@ netns/AB
 nvidia/A
 onwards
 openSUSE/B
 openshift/B
 osbuilder/B
 packagecloud/B
 parallelize/AC