mirror of
https://github.com/kata-containers/kata-containers.git
synced 2025-04-27 11:31:05 +00:00
in case the upstream CI fails it's useful to pin-point the PR that caused the regression. Currently openshift-ci does not allow doing that from their setup but we can mimic the setup on our infrastructure and use the available kata-deploy-ci images to find the first failing one. To help with that add a few helper scripts and a howto. Fixes: #9228 Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
150 lines
7.7 KiB
Markdown
150 lines
7.7 KiB
Markdown
OpenShift CI
|
|
============
|
|
|
|
This directory contains scripts used by
|
|
[the OpenShift CI](https://github.com/openshift/release/tree/master/ci-operator/config/kata-containers/kata-containers)
|
|
pipelines to monitor selected functional tests on OpenShift.
|
|
There are 2 pipelines, history and logs can be accessed here:
|
|
|
|
* [main - currently supported OCP](https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-kata-containers-kata-containers-main-e2e-tests)
|
|
* [next - currently under development OCP](https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-kata-containers-kata-containers-main-next-e2e-tests)
|
|
|
|
|
|
Running openshift-tests on OCP with kata-containers manually
|
|
============================================================
|
|
|
|
To run openshift-tests (or other suites) with kata-containers one can use
|
|
the kata-webhook. To deploy everything you can mimic the CI pipeline by:
|
|
|
|
```bash
|
|
#!/bin/bash -e
|
|
# Setup your kubectl and check it's accessible by
|
|
kubectl nodes
|
|
# Deploy kata (set KATA_DEPLOY_IMAGE to override the default kata-deploy-ci:latest image)
|
|
./test.sh
|
|
# Deploy the webhook
|
|
KATA_RUNTIME=kata-qemu cluster/deploy_webhook.sh
|
|
```
|
|
|
|
This should ensure kata-containers as well as kata-webhook are installed and
|
|
working. Before running the openshift-tests it's (currently) recommended to
|
|
ignore some security features by:
|
|
|
|
```bash
|
|
#!/bin/bash -e
|
|
oc adm policy add-scc-to-group privileged system:authenticated system:serviceaccounts
|
|
oc adm policy add-scc-to-group anyuid system:authenticated system:serviceaccounts
|
|
oc label --overwrite ns default pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/warn=baseline pod-security.kubernetes.io/audit=baseline
|
|
```
|
|
|
|
Now you should be ready to run the openshift-tests. Our CI only uses a subset
|
|
of tests, to get the current ``TEST_SKIPS`` see
|
|
[the pipeline config](https://github.com/openshift/release/tree/master/ci-operator/config/kata-containers/kata-containers).
|
|
Following steps require the [openshift tests](https://github.com/openshift/origin)
|
|
being cloned and built in the current directory:
|
|
|
|
```bash
|
|
#!/bin/bash -e
|
|
# Define tests to be skipped (see the pipeline config for the current version)
|
|
TEST_SKIPS="\[sig-node\] Security Context should support seccomp runtime/default\|\[sig-node\] Variable Expansion should allow substituting values in a volume subpath\|\[k8s.io\] Probing container should be restarted with a docker exec liveness probe with timeout\|\[sig-node\] Pods Extended Pod Container lifecycle evicted pods should be terminal\|\[sig-node\] PodOSRejection \[NodeConformance\] Kubelet should reject pod when the node OS doesn't match pod's OS\|\[sig-network\].*for evicted pods\|\[sig-network\].*HAProxy router should override the route\|\[sig-network\].*HAProxy router should serve a route\|\[sig-network\].*HAProxy router should serve the correct\|\[sig-network\].*HAProxy router should run\|\[sig-network\].*when FIPS.*the HAProxy router\|\[sig-network\].*bond\|\[sig-network\].*all sysctl on whitelist\|\[sig-network\].*sysctls should not affect\|\[sig-network\] pods should successfully create sandboxes by adding pod to network"
|
|
# Get the list of tests to be executed
|
|
TESTS="$(./openshift-tests run --dry-run --provider "${TEST_PROVIDER}" "${TEST_SUITE}")"
|
|
# Store the list of tests in /tmp/tsts file
|
|
echo "${TESTS}" | grep -v "$TEST_SKIPS" > /tmp/tsts
|
|
# Remove previously-existing temporarily files as well as previous results
|
|
OUT=RESULTS/tmp
|
|
rm -Rf /tmp/*test* /tmp/e2e-*
|
|
rm -R $OUT
|
|
mkdir -p $OUT
|
|
# Run the tests ignoring the monitor health checks
|
|
./openshift-tests run --provider azure -o "$OUT/job.log" --junit-dir "$OUT" --file /tmp/tsts --max-parallel-tests 5 --cluster-stability Disruptive --run '^\[sig-node\].*|^\[sig-network\]'
|
|
```
|
|
|
|
[!NOTE]
|
|
Note we are ignoring the cluster stability checks because our public cloud is
|
|
not that stable and running with VMs instead of containers results in minor
|
|
stability issues. Some of the old monitor stability tests do not reflect
|
|
the ``--cluster-stability`` setting, one should simply ignore these. If you
|
|
get a message like ``invariant was violated`` or ``error: failed due to a
|
|
MonitorTest failure``, it's usually an indication that only those kind of
|
|
tests failed but the real tests passed. See
|
|
[wrapped-openshift-tests.sh](https://github.com/openshift/release/blob/master/ci-operator/config/kata-containers/kata-containers/wrapped-openshift-tests.sh)
|
|
for details how our pipeline deals with that.
|
|
|
|
[!TIP]
|
|
To compare multiple results locally one can use
|
|
[junit2html](https://github.com/inorton/junit2html) tool.
|
|
|
|
|
|
Best-effort kata-containers cleanup
|
|
===================================
|
|
|
|
If you need to cleanup the cluster after testing, you can use the
|
|
``cleanup.sh`` script from the current directory. It tries to delete all
|
|
resources created by ``test.sh`` as well as ``cluster/deploy_webhook.sh``
|
|
ignoring all failures. The primary purpose of this script is to allow
|
|
soft-cleanup after deployment to test different versions without
|
|
re-provisioning everything.
|
|
|
|
[!WARNING]
|
|
Do not rely on this script in production, return codes are not checked!**
|
|
|
|
|
|
Bisecting e2e tests failures
|
|
============================
|
|
|
|
Let's say the OCP pipeline passed running with
|
|
``quay.io/kata-containers/kata-deploy-ci:kata-containers-d7afd31fd40e37a675b25c53618904ab57e74ccd-amd64``
|
|
but failed running with
|
|
``quay.io/kata-containers/kata-deploy-ci:kata-containers-9f512c016e75599a4a921bd84ea47559fe610057-amd64``
|
|
and you'd like to know which PR caused the regression. You can either run with
|
|
all the 60 tags between or you can utilize the [bisecter](https://github.com/ldoktor/bisecter)
|
|
to optimize the number of steps in between.
|
|
|
|
Before running the bisection you need a reproducer script. Sample one called
|
|
``sample-test-reproducer.sh`` is provided in this directory but you might
|
|
want to copy and modify it, especially:
|
|
|
|
* ``OCP_DIR`` - directory where your openshift/release is located (can be exported)
|
|
* ``E2E_TEST`` - openshift-test(s) to be executed (can be exported)
|
|
* behaviour of SETUP (returning 125 skips the current image tag, returning
|
|
>=128 interrupts the execution, everything else reports the tag as failure
|
|
* what should be executed (perhaps running the setup is enough for you or
|
|
you might want to be looking for specific failures...)
|
|
* use ``timeout`` to interrupt execution in case you know things should be faster
|
|
|
|
Executing that script with the GOOD commit should pass
|
|
``./sample-test-reproducer.sh quay.io/kata-containers/kata-deploy-ci:kata-containers-d7afd31fd40e37a675b25c53618904ab57e74ccd-amd64``
|
|
and fail when executed with the BAD commit
|
|
``./sample-test-reproducer.sh quay.io/kata-containers/kata-deploy-ci:kata-containers-9f512c016e75599a4a921bd84ea47559fe610057-amd64``.
|
|
|
|
To get the list of all tags in between those two PRs you can use the
|
|
``bisect-range.sh`` script
|
|
|
|
```bash
|
|
./bisect-range.sh d7afd31fd40e37a675b25c53618904ab57e74ccd 9f512c016e75599a4a921bd84ea47559fe610057
|
|
```
|
|
|
|
[!NOTE]
|
|
The tagged images are only built per PR, not for individual commits. See
|
|
[kata-deploy-ci](https://quay.io/kata-containers/kata-deploy-ci) to see the
|
|
available images.
|
|
|
|
To find out which PR caused this regression, you can either manually try the
|
|
individual commits or you can simply execute:
|
|
|
|
```bash
|
|
bisecter start "$(./bisect-range.sh d7afd31fd40 9f512c016)"
|
|
OCP_DIR=/path/to/openshift/release bisecter run ./sample-test-reproducer.sh
|
|
```
|
|
|
|
[!NOTE]
|
|
If you use ``KATA_WITH_SYSTEM_QEMU=yes`` you might want to deploy once with
|
|
it and skip it for the cleanup. That way you might (in most cases) test
|
|
all images with a single MCP update instead of per-image MCP update.
|
|
|
|
[!TIP]
|
|
You can check the bisection progress during/after execution by running
|
|
``bisecter log`` from the current directory. Before starting a new
|
|
bisection you need to execute ``bisecter reset``.
|