hack/ginkgo-e2e.sh: forward TERM/INT to Ginkgo

What happens at the moment in e.g. pull-kubernetes-e2e-kind in case of a timeout is that ginkgo-e2e.sh gets killed with SIGTERM. This is not propagated to the E2E test suite processes, therefore there is no "Interrupted by User" report and no JUnit file, depending on timing during the process shutdown. Running the Ginkgo CLI with job control enabled creates a new process group, which then can be used to kill the Ginko CLI and the E2E test suite processes. With these changes, more information is produced. Some of it seems a bit redundant, but it's better than none: *** hack/ginkgo-e2e.sh: received termination signal -> asking Ginkgo to stop. *** *** Beware that a timeout may have been caused by some earlier test, *** not necessarily the one which gets interrupted now. *** See the "Spec runtime" for information about how long the *** interrupted test was running. ------------------------------ Interrupted by User First interrupt received; Ginkgo will run any cleanup and reporting nodes but will skip all remaining specs. Interrupt again to skip cleanup. Here's a current progress report: [sig-node] DRA [Feature:DynamicResourceAllocation] [FeatureGate:DynamicResourceAllocation] [Beta] ResourceSlice Controller creates slices (Spec Runtime: 9 .065s) k8s.io/kubernetes/test/e2e/dra/dra.go:812 In [It] (Node Runtime: 9.044s) k8s.io/kubernetes/test/e2e/dra/dra.go:812 At [By Step] Creating slices (Step Runtime: 8.884s) k8s.io/kubernetes/test/e2e/dra/dra.go:847 ... Begin Additional Progress Reports >> There is no failure as the matcher passed to Consistently has not yet failed << End Additional Progress Reports ------------------------------ • [INTERRUPTED] [11.955 seconds] [sig-node] DRA [Feature:DynamicResourceAllocation] [FeatureGate:DynamicResourceAllocation] [Beta] ResourceSlice Controller [It] creates slices [sig-node, Feature:DynamicResourceAllocation, FeatureGate:DynamicResourceAllocation, Feature:Beta] k8s.io/kubernetes/test/e2e/dra/dra.go:812 Timeline >> STEP: Creating a kubernetes client @ 01/09/25 17:18:59.769 ... [FAILED] in [It] - k8s.io/kubernetes/test/e2e/dra/dra.go:881 @ 01/09/25 17:19:08.835 I0109 17:19:11.703212 302727 helper.go:125] Waiting up to 7m0s for all (but 0) nodes to be ready STEP: dump namespace information after failure @ 01/09/25 17:19:11.706 STEP: Collecting events from namespace "dra-7998". @ 01/09/25 17:19:11.706 STEP: Found 0 events. @ 01/09/25 17:19:11.708 ... STEP: Destroying namespace "dra-7998" for this suite. @ 01/09/25 17:19:11.72 << Timeline [INTERRUPTED] Interrupted by User In [It] at: k8s.io/kubernetes/test/e2e/dra/dra.go:812 @ 01/09/25 17:19:08.833 This is the Progress Report generated when the interrupt was received: [sig-node] DRA [Feature:DynamicResourceAllocation] [FeatureGate:DynamicResourceAllocation] [Beta] ResourceSlice Controller creates slices (Spec Runtime: 9 .065s) ... [FAILED] An interrupt occurred and then the following failure was recorded in the interrupted node before it exited: Context was cancelled (cause: Interrupted by User) after 0.329s. There is no failure as the matcher passed to Consistently has not yet failed In [It] at: k8s.io/kubernetes/test/e2e/dra/dra.go:881 @ 01/09/25 17:19:08.835 ------------------------------ Checking for custom logdump instances, if any ---------------------------------------------------------------------------------------------------- k/k version of the log-dump.sh script is deprecated! Please migrate your test job to use test-infra's repo version of log-dump.sh! Migration steps can be found in the readme file. ---------------------------------------------------------------------------------------------------- Sourcing kube-util.sh Detecting project Skeleton Provider: detect-project not implemented Dumping logs from master locally to '/tmp/test' Master SSH not supported for local Dumping logs from nodes locally to '/tmp/test' Node SSH not supported for local Summarizing 1 Failure: [INTERRUPTED] [sig-node] DRA [Feature:DynamicResourceAllocation] [FeatureGate:DynamicResourceAllocation] [Beta] ResourceSlice Controller [It] creates slices [sig-node, Feature:DynamicResourceAllocation, FeatureGate:DynamicResourceAllocation, Feature:Beta] k8s.io/kubernetes/test/e2e/dra/dra.go:812 Ran 1 of 6644 Specs in 12.208 seconds FAIL! - Interrupted by User -- 0 Passed | 1 Failed | 0 Pending | 6643 Skipped --- FAIL: TestE2E (12.74s) FAIL Ginkgo ran 1 suite in 13.379078611s
2025-09-21 09:57:52 +00:00 · 2025-01-09 14:35:39 +01:00
parent 2d0a4f7556
commit ce9e398641
1 changed files with 48 additions and 1 deletions
--- a/hack/ginkgo-e2e.sh
+++ b/hack/ginkgo-e2e.sh
@@ -204,6 +204,49 @@ fi
 # is not used.
 suite_args+=(--report-complete-ginkgo --report-complete-junit)

+# When SIGTERM doesn't reach the E2E test suite binaries, ginkgo will exit
+# without collecting information from about the currently running and
+# potentially stuck tests. This seems to happen when Prow shuts down a test
+# job because of a timeout.
+#
+# It's useful to print one final progress report in that case,
+# so GINKGO_PROGRESS_REPORT_ON_SIGTERM (enabled by default when CI=true)
+# catches SIGTERM and forwards it to all processes spawned by ginkgo.
+#
+# Manual invocations can trigger a similar report with `killall -USR1 e2e.test`
+# without having to kill the test run.
+GINKGO_CLI_PID=
+signal_handler() {
+  if [ -n "${GINKGO_CLI_PID}" ]; then
+    cat <<EOF
+
+*** $0: received $1 signal -> asking Ginkgo to stop.
+***
+*** Beware that a timeout may have been caused by some earlier test,
+*** not necessarily the one which gets interrupted now.
+*** See the "Spec runtime" for information about how long the
+*** interrupted test was running.
+
+EOF
+    # This goes to the process group, which is important because we
+    # need to reach the e2e.test processes forked by the Ginkgo CLI.
+    kill -TERM "-${GINKGO_CLI_PID}" || true
+
+    echo "Waiting for Ginkgo with pid ${GINKGO_CLI_PID}..."
+    wait "{$GINKGO_CLI_PID}"
+    echo "Ginkgo terminated."
+  fi
+}
+case "${GINKGO_PROGRESS_REPORT_ON_SIGTERM:-${CI:-no}}" in
+  y|yes|true)
+    kube::util::trap_add "signal_handler INT" INT
+    kube::util::trap_add "signal_handler TERM" TERM
+    # Job control is needed to make the Ginkgo CLI and all workers run
+    # in their own process group.
+    set -m
+    ;;
+esac
+
 # The following invocation is fairly complex. Let's dump it to simplify
 # determining what the final options are. Enabled by default in CI
 # environments like Prow.
@@ -236,4 +279,8 @@ case "${GINKGO_SHOW_COMMAND:-${CI:-no}}" in y|yes|true) set -x ;; esac
  ${E2E_REPORT_DIR:+"--report-dir=${E2E_REPORT_DIR}"} \
  ${E2E_REPORT_PREFIX:+"--report-prefix=${E2E_REPORT_PREFIX}"} \
  "${suite_args[@]:+${suite_args[@]}}" \
-  "${@}"
+  "${@}" &
+
+set +x
+GINKGO_CLI_PID=$!
+wait "${GINKGO_CLI_PID}"