Commit Graph

57 Commits

Author SHA1 Message Date
Clayton Coleman
6b9a381185
kubelet: Force deleted pods can fail to move out of terminating
If a CRI error occurs during the terminating phase after a pod is
force deleted (API or static) then the housekeeping loop will not
deliver updates to the pod worker which prevents the pod's state
machine from progressing. The pod will remain in the terminating
phase but no further attempts to terminate or cleanup will occur
until the kubelet is restarted.

The pod worker now maintains a store of the pods state that it is
attempting to reconcile and uses that to resync unknown pods when
SyncKnownPods() is invoked, so that failures in sync methods for
unknown pods no longer hang forever.

The pod worker's store tracks desired updates and the last update
applied on podSyncStatuses. Each goroutine now synchronizes to
acquire the next work item, context, and whether the pod can start.
This synchronization moves the pending update to the stored last
update, which will ensure third parties accessing pod worker state
don't see updates before the pod worker begins synchronizing them.

As a consequence, the update channel becomes a simple notifier
(struct{}) so that SyncKnownPods can coordinate with the pod worker
to create a synthetic pending update for unknown pods (i.e. no one
besides the pod worker has data about those pods). Otherwise the
pending update info would be hidden inside the channel.

In order to properly track pending updates, we have to be very
careful not to mix RunningPods (which are calculated from the
container runtime and are missing all spec info) and config-
sourced pods. Update the pod worker to avoid using ToAPIPod()
and instead require the pod worker to directly use
update.Options.Pod or update.Options.RunningPod for the
correct methods. Add a new SyncTerminatingRuntimePod to prevent
accidental invocations of runtime only pod data.

Finally, fix SyncKnownPods to replay the last valid update for
undesired pods which drives the pod state machine towards
termination, and alter HandlePodCleanups to:

- terminate runtime pods that aren't known to the pod worker
- launch admitted pods that aren't known to the pod worker

Any started pods receive a replay until they reach the finished
state, and then are removed from the pod worker. When a desired
pod is detected as not being in the worker, the usual cause is
that the pod was deleted and recreated with the same UID (almost
always a static pod since API UID reuse is statistically
unlikely). This simplifies the previous restartable pod support.
We are careful to filter for active pods (those not already
terminal or those which have been previously rejected by
admission). We also force a refresh of the runtime cache to
ensure we don't see an older version of the state.

Future changes will allow other components that need to view the
pod worker's actual state (not the desired state the podManager
represents) to retrieve that info from the pod worker.

Several bugs in pod lifecycle have been undetectable at runtime
because the kubelet does not clearly describe the number of pods
in use. To better report, add the following metrics:

  kubelet_desired_pods: Pods the pod manager sees
  kubelet_active_pods: "Admitted" pods that gate new pods
  kubelet_mirror_pods: Mirror pods the kubelet is tracking
  kubelet_working_pods: Breakdown of pods from the last sync in
    each phase, orphaned state, and static or not
  kubelet_restarted_pods_total: A counter for pods that saw a
    CREATE before the previous pod with the same UID was finished
  kubelet_orphaned_runtime_pods_total: A counter for pods detected
    at runtime that were not known to the kubelet. Will be
    populated at Kubelet startup and should never be incremented
    after.

Add a metric check to our e2e tests that verifies the values are
captured correctly during a serial test, and then verify them in
detail in unit tests.

Adds 23 series to the kubelet /metrics endpoint.
2023-03-08 22:03:51 -06:00
David Porter
c5a1f0188b
test: Add node e2e test to verify static pod termination
Add node e2e test to verify that static pods can be started after a
previous static pod with the same config temporarily failed termination.

The scenario is:

1. Static pod is started
2. Static pod is deleted
3. Static pod termination fails (internally `syncTerminatedPod` fails)
4. At later time, pod termination should succeed
5. New static pod with the same config is (re)-added
6. New static pod is expected to start successfully

To repro this scenario, setup a pod using a NFS mount. The NFS server is
stopped which will result in volumes failing to unmount and
`syncTerminatedPod` to fail. The NFS server is later started, allowing
the volume to unmount successfully.

xref:

1. https://github.com/kubernetes/kubernetes/pull/113145#issuecomment-1289587988
2. https://github.com/kubernetes/kubernetes/pull/113065
3. https://github.com/kubernetes/kubernetes/pull/113093

Signed-off-by: David Porter <david@porter.me>
2023-03-03 10:00:48 -06:00
Gunju Kim
f690a0ce41
Fix createStaticPod to not use container.RestartPolicy 2023-02-23 21:18:24 +09:00
Patrick Ohly
136f89dfc5 e2e: use error wrapping with %w
The recently introduced failure handling in ExpectNoError depends on error
wrapping: if an error prefix gets added with `fmt.Errorf("foo: %v", err)`, then
ExpectNoError cannot detect that the root cause is an assertion failure and
then will add another useless "unexpected error" prefix and will not dump the
additional failure information (currently the backtrace inside the E2E
framework).

Instead of manually deciding on a case-by-case basis where %w is needed, all
error wrapping was updated automatically with

    sed -i "s/fmt.Errorf\(.*\): '*\(%s\|%v\)'*\",\(.* err)\)/fmt.Errorf\1: %w\",\3/" $(git grep -l 'fmt.Errorf' test/e2e*)

This may be unnecessary in some cases, but it's not wrong.
2023-02-06 15:39:13 +01:00
Antonio Ojea
7f5ae1c0c1
Revert "e2e: wait for pods with gomega" 2023-02-06 12:08:22 +01:00
Patrick Ohly
222f655062 e2e: use error wrapping with %w
The recently introduced failure handling in ExpectNoError depends on error
wrapping: if an error prefix gets added with `fmt.Errorf("foo: %v", err)`, then
ExpectNoError cannot detect that the root cause is an assertion failure and
then will add another useless "unexpected error" prefix and will not dump the
additional failure information (currently the backtrace inside the E2E
framework).

Instead of manually deciding on a case-by-case basis where %w is needed, all
error wrapping was updated automatically with

    sed -i "s/fmt.Errorf\(.*\): '*\(%s\|%v\)'*\",\(.* err)\)/fmt.Errorf\1: %w\",\3/" $(git grep -l 'fmt.Errorf' test/e2e*)

This may be unnecessary in some cases, but it's not wrong.
2023-01-31 13:01:39 +01:00
Patrick Ohly
2f6c4f5eab e2e: use Ginkgo context
All code must use the context from Ginkgo when doing API calls or polling for a
change, otherwise the code would not return immediately when the test gets
aborted.
2022-12-16 20:14:04 +01:00
Patrick Ohly
df5d84ae81 e2e: accept context from Ginkgo
Every ginkgo callback should return immediately when a timeout occurs or the
test run manually gets aborted with CTRL-C. To do that, they must take a ctx
parameter and pass it through to all code which might block.

This is a first automated step towards that: the additional parameter got added
with

    sed -i 's/\(framework.ConformanceIt\|ginkgo.It\)\(.*\)func() {$/\1\2func(ctx context.Context) {/' \
        $(git grep -l -e framework.ConformanceIt -e ginkgo.It )
    $GOPATH/bin/goimports -w $(git status | grep modified: | sed -e 's/.* //')

log_test.go was left unchanged.
2022-12-10 19:50:18 +01:00
Dave Chen
857458cfa5 update ginkgo from v1 to v2 and gomega to 1.19.0
- update all the import statements
- run hack/pin-dependency.sh to change pinned dependency versions
- run hack/update-vendor.sh to update go.mod files and the vendor directory
- update the method signatures for custom reporters

Signed-off-by: Dave Chen <dave.chen@arm.com>
2022-07-08 10:44:46 +08:00
Sergiusz Urbaniak
373c08e0c7
test/e2e/framework: configure pod security admission level for e2e tests 2022-03-28 15:42:10 +02:00
Elana Hashman
47086a6623
Add test for recreating a static pod 2021-09-15 14:01:48 -04:00
wojtekt
a74737eb03 Mark remaining e2e_node tests with [sig-*] label 2021-02-23 20:11:09 +01:00
Derek Carr
d2c78b6589 Verify running mirror pod has running containers 2020-08-25 12:23:24 -04:00
Jefftree
ace97738e2 Update formatting of conformance comment 2020-07-29 20:50:44 -07:00
Mike Danese
aaf855c1e6 deref all calls to metav1.NewDeleteOptions that are passed to clients.
This is gross but because NewDeleteOptions is used by various parts of
storage that still pass around pointers, the return type can't be
changed without significant refactoring within the apiserver. I think
this would be good to cleanup, but I want to minimize apiserver side
changes as much as possible in the client signature refactor.
2020-03-05 14:59:46 -08:00
Mike Danese
3aa59f7f30 generated: run refactor 2020-02-07 18:16:47 -08:00
danielqsj
6596a14d39 add missing alias of api errors under test 2019-12-26 17:29:38 +08:00
SataQiu
d2bdf89a8b fix golint issues in test/e2e_node 2019-11-26 16:26:55 +08:00
Tim Allclair
62e7d197e3 Add mirror pod e2e test 2019-10-29 16:14:06 -07:00
draveness
aeadd793cb feat: update multiple files in e2e node with framework helpers 2019-08-02 14:39:05 +08:00
SataQiu
641d330f89 e2e_node: clean up non-recommended import 2019-07-28 12:49:36 +08:00
Vu Cong Tuan
c747b7f38d Fix many typos in both code and comments
Signed-off-by: Vu Cong Tuan <tuanvc@vn.fujitsu.com>
2019-02-27 14:41:02 +07:00
Aaron Crickenberger
d724e979cd Remove [Conformance] from tests in e2e_node
None of these tests actually run as part of e2e testing, which is
the only way conformance tests are kicked off. They should not be
included as part of the conformance suite unless they live in
test/e2e/common
2018-08-07 10:43:59 -07:00
Manjunath A Kumatagi
1f7f33aaa4 Update the nginx image from hub.docker.com 2018-08-04 05:19:53 +05:30
Srini Brahmaroutu
dcb7bc313f Adding details to Conformance Tests using RFC 2119 standards. 2018-07-31 17:21:18 -07:00
Yu-Ju Hong
4ad9aedb04 test/e2e_node: Add [NodeConformance] to tests tagged [Conformance]
This has no effect yet until test configurations are updated.
2018-05-21 17:51:49 -07:00
Manjunath A Kumatagi
1bb810e749 Use pause manifest image 2018-04-06 11:00:50 +05:30
Michael Taufen
b4bddcc998 expunge the word 'manifest' from Kubelet's config API
The word 'manifest' technically refers to a container-group specification
that predated the Pod abstraction. We should avoid using this legacy
terminology where possible. Fortunately, the Kubelet's config API will
be beta in 1.10 for the first time, so we still had the chance to make
this change.

I left the flags alone, since they're deprecated anyway.

I changed a few var names in files I touched too, but this PR is the
just the first shot, not the whole campaign
(`git grep -i manifest | wc -l -> 1248`).
2018-02-23 11:44:06 -08:00
jianglingxia
76e90061a2 reopen #58913 Fix TODO move GetPauseImageNameForHostArch func 2018-01-31 15:06:32 +08:00
xiangpengzhao
7fdea2b0cf Use framework.ConformanceIt for node e2e conformance tests 2017-11-17 17:28:20 +08:00
Kevin
4c8539cece use core client with explicit version globally 2017-10-27 15:48:32 +08:00
Manjunath A Kumatagi
ee4d54c70c Port e2e tests for multi architecture 2017-09-01 05:40:52 +05:30
Jacob Simpson
29c1b81d4c Scripted migration from clientset_generated to client-go. 2017-07-17 15:05:37 -07:00
Chao Xu
60604f8818 run hack/update-all 2017-06-22 11:31:03 -07:00
Chao Xu
f4989a45a5 run root-rewrite-v1-..., compile 2017-06-22 10:25:57 -07:00
Dr. Stefan Schimanski
d7eb3b6870 pkg/util: move uuid and strategicpatch into k8s.io/apimachinery 2017-01-25 19:45:09 +01:00
Clayton Coleman
be6d2933df
refactor: Move *Options references to metav1 2017-01-24 13:41:51 -05:00
deads2k
77b4d55982 mechanical 2017-01-16 09:35:12 -05:00
NickrenREN
a12dea14e0 fix redundant alias clientset 2017-01-12 10:21:05 +08:00
deads2k
6a4d5cd7cc start the apimachinery repo 2017-01-11 09:09:48 -05:00
Chao Xu
03d8820edc rename /release_1_5 to /clientset 2016-12-14 12:39:48 -08:00
Wojciech Tyczynski
a9ec31209e GetOptions - fix tests 2016-12-09 09:42:01 +01:00
Chao Xu
29400ac195 test/e2e_node 2016-11-23 15:53:09 -08:00
Random-Liu
090809d8ad Remove dependency on kubelet related flags. 2016-11-17 10:17:32 -08:00
Random-Liu
f4aee8664d Mark more conformance tests. 2016-11-05 21:11:51 -07:00
Jan Chaloupka
4fde09d308 Replace client with clientset in code 2016-10-23 22:00:35 +02:00
Random-Liu
ed411c9042 Add image white list, images in white list will be prepulled, and
only images in white list could be used in the test. Currently only
enabled in node e2e test.
2016-09-19 14:39:23 -07:00
Random-Liu
3910a66bb5 Add run-services-mode option, and start e2e services in a separate
process.
2016-08-15 14:45:01 -07:00
Harry Zhang
c495397cae Refactor uuid into its own pkg 2016-07-30 00:07:02 -04:00
Random-Liu
9d48c76361 Make the node e2e test run in parallel. 2016-07-29 16:40:59 -07:00