mirror of
https://github.com/kata-containers/kata-containers.git
synced 2025-04-28 03:42:09 +00:00
ci.ocp: Add steps to reproduce/bisect CI runs
in case the upstream CI fails it's useful to pin-point the PR that caused the regression. Currently openshift-ci does not allow doing that from their setup but we can mimic the setup on our infrastructure and use the available kata-deploy-ci images to find the first failing one. To help with that add a few helper scripts and a howto. Fixes: #9228 Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
This commit is contained in:
parent
a556ad7e01
commit
f994f79078
@ -8,3 +8,142 @@ There are 2 pipelines, history and logs can be accessed here:
|
||||
|
||||
* [main - currently supported OCP](https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-kata-containers-kata-containers-main-e2e-tests)
|
||||
* [next - currently under development OCP](https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-kata-containers-kata-containers-main-next-e2e-tests)
|
||||
|
||||
|
||||
Running openshift-tests on OCP with kata-containers manually
|
||||
============================================================
|
||||
|
||||
To run openshift-tests (or other suites) with kata-containers one can use
|
||||
the kata-webhook. To deploy everything you can mimic the CI pipeline by:
|
||||
|
||||
```bash
|
||||
#!/bin/bash -e
|
||||
# Setup your kubectl and check it's accessible by
|
||||
kubectl nodes
|
||||
# Deploy kata (set KATA_DEPLOY_IMAGE to override the default kata-deploy-ci:latest image)
|
||||
./test.sh
|
||||
# Deploy the webhook
|
||||
KATA_RUNTIME=kata-qemu cluster/deploy_webhook.sh
|
||||
```
|
||||
|
||||
This should ensure kata-containers as well as kata-webhook are installed and
|
||||
working. Before running the openshift-tests it's (currently) recommended to
|
||||
ignore some security features by:
|
||||
|
||||
```bash
|
||||
#!/bin/bash -e
|
||||
oc adm policy add-scc-to-group privileged system:authenticated system:serviceaccounts
|
||||
oc adm policy add-scc-to-group anyuid system:authenticated system:serviceaccounts
|
||||
oc label --overwrite ns default pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/warn=baseline pod-security.kubernetes.io/audit=baseline
|
||||
```
|
||||
|
||||
Now you should be ready to run the openshift-tests. Our CI only uses a subset
|
||||
of tests, to get the current ``TEST_SKIPS`` see
|
||||
[the pipeline config](https://github.com/openshift/release/tree/master/ci-operator/config/kata-containers/kata-containers).
|
||||
Following steps require the [openshift tests](https://github.com/openshift/origin)
|
||||
being cloned and built in the current directory:
|
||||
|
||||
```bash
|
||||
#!/bin/bash -e
|
||||
# Define tests to be skipped (see the pipeline config for the current version)
|
||||
TEST_SKIPS="\[sig-node\] Security Context should support seccomp runtime/default\|\[sig-node\] Variable Expansion should allow substituting values in a volume subpath\|\[k8s.io\] Probing container should be restarted with a docker exec liveness probe with timeout\|\[sig-node\] Pods Extended Pod Container lifecycle evicted pods should be terminal\|\[sig-node\] PodOSRejection \[NodeConformance\] Kubelet should reject pod when the node OS doesn't match pod's OS\|\[sig-network\].*for evicted pods\|\[sig-network\].*HAProxy router should override the route\|\[sig-network\].*HAProxy router should serve a route\|\[sig-network\].*HAProxy router should serve the correct\|\[sig-network\].*HAProxy router should run\|\[sig-network\].*when FIPS.*the HAProxy router\|\[sig-network\].*bond\|\[sig-network\].*all sysctl on whitelist\|\[sig-network\].*sysctls should not affect\|\[sig-network\] pods should successfully create sandboxes by adding pod to network"
|
||||
# Get the list of tests to be executed
|
||||
TESTS="$(./openshift-tests run --dry-run --provider "${TEST_PROVIDER}" "${TEST_SUITE}")"
|
||||
# Store the list of tests in /tmp/tsts file
|
||||
echo "${TESTS}" | grep -v "$TEST_SKIPS" > /tmp/tsts
|
||||
# Remove previously-existing temporarily files as well as previous results
|
||||
OUT=RESULTS/tmp
|
||||
rm -Rf /tmp/*test* /tmp/e2e-*
|
||||
rm -R $OUT
|
||||
mkdir -p $OUT
|
||||
# Run the tests ignoring the monitor health checks
|
||||
./openshift-tests run --provider azure -o "$OUT/job.log" --junit-dir "$OUT" --file /tmp/tsts --max-parallel-tests 5 --cluster-stability Disruptive --run '^\[sig-node\].*|^\[sig-network\]'
|
||||
```
|
||||
|
||||
[!NOTE]
|
||||
Note we are ignoring the cluster stability checks because our public cloud is
|
||||
not that stable and running with VMs instead of containers results in minor
|
||||
stability issues. Some of the old monitor stability tests do not reflect
|
||||
the ``--cluster-stability`` setting, one should simply ignore these. If you
|
||||
get a message like ``invariant was violated`` or ``error: failed due to a
|
||||
MonitorTest failure``, it's usually an indication that only those kind of
|
||||
tests failed but the real tests passed. See
|
||||
[wrapped-openshift-tests.sh](https://github.com/openshift/release/blob/master/ci-operator/config/kata-containers/kata-containers/wrapped-openshift-tests.sh)
|
||||
for details how our pipeline deals with that.
|
||||
|
||||
[!TIP]
|
||||
To compare multiple results locally one can use
|
||||
[junit2html](https://github.com/inorton/junit2html) tool.
|
||||
|
||||
|
||||
Best-effort kata-containers cleanup
|
||||
===================================
|
||||
|
||||
If you need to cleanup the cluster after testing, you can use the
|
||||
``cleanup.sh`` script from the current directory. It tries to delete all
|
||||
resources created by ``test.sh`` as well as ``cluster/deploy_webhook.sh``
|
||||
ignoring all failures. The primary purpose of this script is to allow
|
||||
soft-cleanup after deployment to test different versions without
|
||||
re-provisioning everything.
|
||||
|
||||
[!WARNING]
|
||||
Do not rely on this script in production, return codes are not checked!**
|
||||
|
||||
|
||||
Bisecting e2e tests failures
|
||||
============================
|
||||
|
||||
Let's say the OCP pipeline passed running with
|
||||
``quay.io/kata-containers/kata-deploy-ci:kata-containers-d7afd31fd40e37a675b25c53618904ab57e74ccd-amd64``
|
||||
but failed running with
|
||||
``quay.io/kata-containers/kata-deploy-ci:kata-containers-9f512c016e75599a4a921bd84ea47559fe610057-amd64``
|
||||
and you'd like to know which PR caused the regression. You can either run with
|
||||
all the 60 tags between or you can utilize the [bisecter](https://github.com/ldoktor/bisecter)
|
||||
to optimize the number of steps in between.
|
||||
|
||||
Before running the bisection you need a reproducer script. Sample one called
|
||||
``sample-test-reproducer.sh`` is provided in this directory but you might
|
||||
want to copy and modify it, especially:
|
||||
|
||||
* ``OCP_DIR`` - directory where your openshift/release is located (can be exported)
|
||||
* ``E2E_TEST`` - openshift-test(s) to be executed (can be exported)
|
||||
* behaviour of SETUP (returning 125 skips the current image tag, returning
|
||||
>=128 interrupts the execution, everything else reports the tag as failure
|
||||
* what should be executed (perhaps running the setup is enough for you or
|
||||
you might want to be looking for specific failures...)
|
||||
* use ``timeout`` to interrupt execution in case you know things should be faster
|
||||
|
||||
Executing that script with the GOOD commit should pass
|
||||
``./sample-test-reproducer.sh quay.io/kata-containers/kata-deploy-ci:kata-containers-d7afd31fd40e37a675b25c53618904ab57e74ccd-amd64``
|
||||
and fail when executed with the BAD commit
|
||||
``./sample-test-reproducer.sh quay.io/kata-containers/kata-deploy-ci:kata-containers-9f512c016e75599a4a921bd84ea47559fe610057-amd64``.
|
||||
|
||||
To get the list of all tags in between those two PRs you can use the
|
||||
``bisect-range.sh`` script
|
||||
|
||||
```bash
|
||||
./bisect-range.sh d7afd31fd40e37a675b25c53618904ab57e74ccd 9f512c016e75599a4a921bd84ea47559fe610057
|
||||
```
|
||||
|
||||
[!NOTE]
|
||||
The tagged images are only built per PR, not for individual commits. See
|
||||
[kata-deploy-ci](https://quay.io/kata-containers/kata-deploy-ci) to see the
|
||||
available images.
|
||||
|
||||
To find out which PR caused this regression, you can either manually try the
|
||||
individual commits or you can simply execute:
|
||||
|
||||
```bash
|
||||
bisecter start "$(./bisect-range.sh d7afd31fd40 9f512c016)"
|
||||
OCP_DIR=/path/to/openshift/release bisecter run ./sample-test-reproducer.sh
|
||||
```
|
||||
|
||||
[!NOTE]
|
||||
If you use ``KATA_WITH_SYSTEM_QEMU=yes`` you might want to deploy once with
|
||||
it and skip it for the cleanup. That way you might (in most cases) test
|
||||
all images with a single MCP update instead of per-image MCP update.
|
||||
|
||||
[!TIP]
|
||||
You can check the bisection progress during/after execution by running
|
||||
``bisecter log`` from the current directory. Before starting a new
|
||||
bisection you need to execute ``bisecter reset``.
|
||||
|
24
ci/openshift-ci/bisect-range.sh
Executable file
24
ci/openshift-ci/bisect-range.sh
Executable file
@ -0,0 +1,24 @@
|
||||
#!/bin/bash
|
||||
# Copyright (c) 2024 Red Hat, Inc.
|
||||
#
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
if [ "$#" -gt 2 ] || [ "$#" -lt 1 ] ; then
|
||||
echo "Usage: $0 GOOD [BAD]"
|
||||
echo "Prints list of available kata-deploy-ci tags between GOOD and BAD commits (by default BAD is the latest available tag)"
|
||||
exit 255
|
||||
fi
|
||||
GOOD="$1"
|
||||
[ -n "$2" ] && BAD="$2"
|
||||
ARCH=amd64
|
||||
REPO="quay.io/kata-containers/kata-deploy-ci"
|
||||
|
||||
TAGS=$(skopeo list-tags "docker://$REPO")
|
||||
# Only amd64
|
||||
TAGS=$(echo "$TAGS" | jq '.Tags' | jq "map(select(endswith(\"$ARCH\")))" | jq -r '.[]')
|
||||
# Tags since $GOOD
|
||||
TAGS=$(echo "$TAGS" | sed -n -e "/$GOOD/,$$p")
|
||||
# Tags up to $BAD
|
||||
[ -n "$BAD" ] && TAGS=$(echo "$TAGS" | sed "/$BAD/q")
|
||||
# Comma separated tags with repo
|
||||
echo "$TAGS" | sed -e "s@^@$REPO:@" | paste -s -d, -
|
50
ci/openshift-ci/sample-test-reproducer.sh
Executable file
50
ci/openshift-ci/sample-test-reproducer.sh
Executable file
@ -0,0 +1,50 @@
|
||||
#!/bin/bash
|
||||
# Copyright (c) 2024 Red Hat, Inc.
|
||||
#
|
||||
# SPDX-License-Identifier: Apache-2.0
|
||||
#
|
||||
# A sample script to deploy, configure, run E2E_TEST and soft-cleanup
|
||||
# afterwards OCP cluster using kata-containers primarily created for use
|
||||
# with https://github.com/ldoktor/bisecter
|
||||
|
||||
[ "$#" -ne 1 ] && echo "Provide image as the first and only argument" && exit 255
|
||||
export KATA_DEPLOY_IMAGE="$1"
|
||||
OCP_DIR="${OCP_DIR:-/path/to/your/openshift/release/}"
|
||||
E2E_TEST="${E2E_TEST:-'"[sig-node] Container Runtime blackbox test on terminated container should report termination message as empty when pod succeeds and TerminationMessagePolicy FallbackToLogsOnError is set [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]"'}"
|
||||
KATA_CI_DIR="${KATA_CI_DIR:-$(pwd)}"
|
||||
export KATA_RUNTIME="${KATA_RUNTIME:-kata-qemu}"
|
||||
|
||||
## SETUP
|
||||
# Deploy kata
|
||||
SETUP=0
|
||||
pushd "$KATA_CI_DIR" || { echo "Failed to cd to '$KATA_CI_DIR'"; exit 255; }
|
||||
./test.sh || SETUP=125
|
||||
cluster/deploy_webhook.sh || SETUP=125
|
||||
if [ $SETUP != 0 ]; then
|
||||
./cleanup.sh
|
||||
exit "$SETUP"
|
||||
fi
|
||||
popd || true
|
||||
# Disable security
|
||||
oc adm policy add-scc-to-group privileged system:authenticated system:serviceaccounts
|
||||
oc adm policy add-scc-to-group anyuid system:authenticated system:serviceaccounts
|
||||
oc label --overwrite ns default pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/warn=baseline pod-security.kubernetes.io/audit=baseline
|
||||
|
||||
## TEST EXECUTION
|
||||
# Run the testing
|
||||
pushd "$OCP_DIR" || { echo "Failed to cd to '$OCP_DIR'"; exit 255; }
|
||||
echo "$E2E_TEST" > /tmp/tsts
|
||||
# Remove previously-existing temporarily files as well as previous results
|
||||
OUT=RESULTS/tmp
|
||||
rm -Rf /tmp/*test* /tmp/e2e-*
|
||||
rm -R $OUT
|
||||
mkdir -p $OUT
|
||||
# Run the tests ignoring the monitor health checks
|
||||
./openshift-tests run --provider azure -o "$OUT/job.log" --junit-dir "$OUT" --file /tmp/tsts --max-parallel-tests 5 --cluster-stability Disruptive
|
||||
RET=$?
|
||||
popd || true
|
||||
|
||||
## CLEANUP
|
||||
./cleanup.sh
|
||||
exit "$RET"
|
||||
|
@ -11,6 +11,7 @@ AUFS # Another Union FS
|
||||
AWS/AB
|
||||
BDF/AB
|
||||
CFS/AB
|
||||
ci/AB
|
||||
CLI/AB
|
||||
CNI/AB
|
||||
CNM/AB
|
||||
@ -33,6 +34,7 @@ gRPC/AB
|
||||
GSC/AB
|
||||
GVT/AB
|
||||
IaaS/B # Infrastructure as a Service
|
||||
io/B
|
||||
IOMMU/AB
|
||||
IoT/AB # Internet of Things
|
||||
IOV/AB
|
||||
|
@ -67,6 +67,7 @@ metadata
|
||||
microcontroller/AB
|
||||
miniOS
|
||||
mmap/AB
|
||||
MonitorTest/A
|
||||
nack/AB
|
||||
namespace/ABCD
|
||||
netlink
|
||||
|
@ -6,6 +6,7 @@
|
||||
|
||||
Ansible/B
|
||||
AppArmor/B
|
||||
bisecter/B
|
||||
blogbench/B
|
||||
BusyBox/B
|
||||
Cassandra/B
|
||||
@ -62,6 +63,7 @@ Netlify/B
|
||||
Nginx/B
|
||||
OpenCensus/B
|
||||
OpenPGP/B
|
||||
openshift/B # lower-case used for some sub-projects
|
||||
OpenShift/B
|
||||
OpenSSL/B
|
||||
OpenStack/B
|
||||
|
@ -1,4 +1,4 @@
|
||||
386
|
||||
387
|
||||
ACPI/AB
|
||||
ACS/AB
|
||||
API/AB
|
||||
@ -90,6 +90,7 @@ MITRE/B
|
||||
MacOS/B
|
||||
Mellanox/B
|
||||
Minikube/B
|
||||
MonitorTest/A
|
||||
NEMU/AB
|
||||
NIC/AB
|
||||
NVDIMM/AB
|
||||
@ -197,6 +198,7 @@ backend
|
||||
backport/ACD
|
||||
backtick/AB
|
||||
backtrace
|
||||
bisecter/B
|
||||
blogbench/B
|
||||
bootloader/AB
|
||||
ccloudvm/B
|
||||
@ -204,6 +206,7 @@ centric/B
|
||||
cgroup/AB
|
||||
checkbox/A
|
||||
chipset/AB
|
||||
ci/AB
|
||||
cnn/B
|
||||
codebase
|
||||
codecov/B
|
||||
@ -255,6 +258,7 @@ init/AB
|
||||
initramfs/AB
|
||||
initrd/AB
|
||||
intel
|
||||
io/B
|
||||
ioctl/A
|
||||
iodepth/A
|
||||
ioengine/A
|
||||
@ -295,6 +299,7 @@ netns/AB
|
||||
nvidia/A
|
||||
onwards
|
||||
openSUSE/B
|
||||
openshift/B
|
||||
osbuilder/B
|
||||
packagecloud/B
|
||||
parallelize/AC
|
||||
|
Loading…
Reference in New Issue
Block a user