Commit Graph

107 Commits

Author SHA1 Message Date
David Ashpole
3813ed1ef7 fix prometheus-to-sd image for fluentbit 2021-05-27 10:54:10 -07:00
David Ashpole
febf9d9366 update event-exporter and prometheus-to-sd versions in cluster addons 2021-05-13 11:40:41 -07:00
Yu-Ju Hong
bcd975aa65 Replace Beta OS/arch labels with the GA ones
Beta OS/arch labels have been deprecated since 1.14.
This change replaces these labels with the GA ones.
2020-02-13 09:38:51 -08:00
draveness
495faa22db feat: cleanup pod critical pod annotations feature 2019-08-09 08:41:23 +08:00
draveness
d83526d253 Revert "feat: cleanup pod critical pod annotations feature"
This reverts commit b6d41ee5cc.
2019-07-18 13:31:12 +08:00
draveness
b6d41ee5cc feat: cleanup pod critical pod annotations feature 2019-07-11 08:54:19 +08:00
Marek Siarkowicz
9e9b906047 Update gcp images with security patches
[stackdriver addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes.
[fluentd-gcp addon] Bump fluentd-gcp-scaler to v0.5.1 to pick up security fixes.
[fluentd-gcp addon] Bump event-exporter to v0.2.4 to pick up security fixes.
[fluentd-gcp addon] Bump prometheus-to-sd to v0.5.0 to pick up security fixes.
[metatada-proxy addon] Bump prometheus-to-sd v0.5.0 to pick up security fixes.
2019-03-15 09:24:32 +01:00
Kubernetes Prow Robot
45e5f6053b
Merge pull request #74424 from liggitt/drop-k8s-io-node-labels
Clean up self-set node labels
2019-03-06 08:24:26 -08:00
Jordan Liggitt
8975233788 Finish migration of fluentd to daemonset 2019-02-26 11:42:23 -05:00
Florent Delannoy
e627474e8f Fix fluentd-gcp addon liveness probe
Fix three issues with the fluentd-gcp liveness probe:

h1. STUCK_THRESHOLD_SECONDS was overridden by LIVENESS_THRESHOLD_SECONDS
if defined

Probably a copy/paste issue introduced in edf1ffc074

h1. `[[` is [a bashism](https://stackoverflow.com/a/47576482), and will always failed when called with `/bin/sh`

Introduced by a844523c20

Given that we call the liveness probe with `/bin/sh`, we cannot use the
double-bracketed `[[` syntax for test, as it is not POSIX-compliant and
will throw an error.

Annoyingly, even through it prints an error, `sh` returns with exit code 0
in this case:

```bash
root@fluentd-7mprs:/# sh liveness.sh
liveness.sh: 8: liveness.sh: [[: not found
liveness.sh: 15: liveness.sh: [[: not found
root@fluentd-7mprs:/# echo $?
0
```

Which means the liveness probe is considered successful by Kubernetes,
despite failing to test things as it was intended. This is also
probably the reason why this bug wasn't reported sooner :)

Thankfully, the test in this case can just as easily be written as
POSIX-compliant as it doesn't use any bash-specific features within the
`[[` block.

h1. Buffers are transient and cannot be relied upon for monitoring

Finally, after fixing the above issue, we started seeing the fluentd
containers being restarted very often, and found an issue with the
underlying logic of the liveness probe.

The probe checks that the pod is still alive by running the following
command:

`find /var/log/fluentd-buffers -type f -newer /tmp/marker-stuck -print -quit`

This checks if any _regular_ file exists under `/var/log/fluentd-buffers`
that is more recent than a predetermined time, and will return an empty
string otherwise.

The issue is that these buffers are temporary and volatile, they get created and
deleted constantly. Here is an example of running that check every second on a
running fluentd:

```
root@fluentd-eks-playground-jdc8m:/# LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300};
root@fluentd-eks-playground-jdc8m:/# STUCK_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-900};
root@fluentd-eks-playground-jdc8m:/# touch -d "${STUCK_THRESHOLD_SECONDS} seconds ago" /tmp/marker-stuck;
root@fluentd-eks-playground-jdc8m:/# touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness;
root@fluentd-eks-playground-jdc8m:/# while true; do date ; find /var/log/fluentd-buffers -type f -newer /tmp/marker-stuck -print -quit ; sleep 1 ; done
Fri Feb 22 10:52:57 UTC 2019
Fri Feb 22 10:52:58 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964ccf4c7004103c3fa7c8533f85.log
Fri Feb 22 10:52:59 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964ccf4c7004103c3fa7c8533f85.log
Fri Feb 22 10:53:00 UTC 2019
Fri Feb 22 10:53:01 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964fb8b2eedcccd2763ea7775cc2.log
Fri Feb 22 10:53:02 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827964fb8b2eedcccd2763ea7775cc2.log
Fri Feb 22 10:53:03 UTC 2019
Fri Feb 22 10:53:04 UTC 2019
Fri Feb 22 10:53:05 UTC 2019
Fri Feb 22 10:53:06 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:07 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:08 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer/buffer.b5827965564883997b673d703af54848b.log
Fri Feb 22 10:53:09 UTC 2019
Fri Feb 22 10:53:10 UTC 2019
Fri Feb 22 10:53:11 UTC 2019
Fri Feb 22 10:53:12 UTC 2019
Fri Feb 22 10:53:13 UTC 2019
Fri Feb 22 10:53:14 UTC 2019
Fri Feb 22 10:53:15 UTC 2019
Fri Feb 22 10:53:16 UTC 2019
```

We can see buffers being created, then disappearing. The LivenessProbe running
under these conditions has a ~50% chance of failing, despite fluentd being
perfectly happy.

I believe that check is probably ok for fluentd installs using large
amounts of buffers, in which case the liveness probe will be correct more
often than not, but fluentd installs that use buffering less intensively
will be negatively impacted by this.

My solution to fix this is to check the last updated time of buffering
_folders_ within `/var/log/fluentd_buffers`. These _do_ get updated when
buffers are created, and do not get deleted as buffers are emptied,
making them the perfect candidate for our use.

Here's an example with the `-d` flag for directories:
```
root@fluentd-eks-playground-jdc8m:/# while true; do date ; find /var/log/fluentd-buffers -type d -newer /tmp/marker-stuck -print -quit ; sleep 1 ; done
Fri Feb 22 10:57:51 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:52 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:53 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:54 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:55 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:56 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:57 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:58 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:57:59 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:00 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:01 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:02 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
Fri Feb 22 10:58:03 UTC 2019
/var/log/fluentd-buffers/kubernetes.system.buffer
```

And example of the directory being updated as new buffers come in:
```
root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 0
drwxr-xr-x 2 root root  6 Feb 22 11:17 .
drwxr-xr-x 3 root root 38 Feb 22 11:14 ..
root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 16K
drwxr-xr-x 2 root root  224 Feb 22 11:18 .
drwxr-xr-x 3 root root   38 Feb 22 11:14 ..
-rw-r--r-- 1 root root 1.8K Feb 22 11:18 buffer.b58279be6e21e8b29fc333a7d50096ed0.log
-rw-r--r-- 1 root root  215 Feb 22 11:18 buffer.b58279be6e21e8b29fc333a7d50096ed0.log.meta
-rw-r--r-- 1 root root  429 Feb 22 11:18 buffer.b58279be6f09bdfe047a96486a525ece2.log
-rw-r--r-- 1 root root  195 Feb 22 11:18 buffer.b58279be6f09bdfe047a96486a525ece2.log.meta
root@fluentd-eks-playground-jdc8m:/# ls -lah /var/log/fluentd-buffers/kubernetes.system.buffer
total 0
drwxr-xr-x 2 root root  6 Feb 22 11:18 .
drwxr-xr-x 3 root root 38 Feb 22 11:14 ..
```
2019-02-25 11:48:31 +00:00
Yu-Ju Hong
9c892243f6 GCE: update addon DaemonSets to select node OS
These DaemonSets supports only Linux today, so this change updates the
specs to reflect this limitation. The labels have recently been promoted
to GA. Using the beta labels for now until node-master version skew
problem no longer exists.
2019-01-23 09:01:40 -08:00
Jordan Liggitt
cc680273e8 Change add-on manifests to apps/v1 2018-12-19 17:30:59 -05:00
Ling Huang
85d8b5069b Add tolerations for Stackdriver Logging and Metadata Agents. 2018-10-12 11:15:33 -04:00
Ling Huang
d8da1baf48 Enable insertId generation, update Stackdriver Logging Agent image to 0.5-1.5.36-1-k8s and add priorityClassName for Metadata Agent. 2018-10-09 13:42:40 -04:00
Kubernetes Submit Queue
e2d6362c09
Merge pull request #67691 from loburm/security_fixes
Automatic merge from submit-queue (batch tested with PRs 67691, 68147). If you want to cherry-pick this change to another branch, please follow the instructions here: https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md.

Bump versions of components with latest security patches.

**What this PR does / why we need it**:
Upgrade versions of monitoring components used on GCP, to include latest security patches.

**Release note**:
```release-note
[fluentd-gcp-scaler addon] Bump fluentd-gcp-scaler to 0.4 to pick up security fixes.
[prometheus-to-sd addon] Bump prometheus-to-sd to 0.3.1 to pick up security fixes, bug fixes and new features.
[event-exporter addon] Bump event-exporter to 0.2.3 to pick up security fixes.
```
2018-09-05 09:49:31 -07:00
Arnold Szederjesi
fcdef3ffcc Put fluentd back to host network 2018-08-30 10:44:04 +02:00
Marian Lobur
ffa934a939 Bump versions of components with latest security patches. 2018-08-22 11:27:36 +02:00
Bryan Moyles
32c2bfadfd A large set of improvements to the Stackdriver components.
Metadata Agent Improvements
Bump metadata agent version to 0.2-0.0.21-1.
Expand the metadata agent's access to all API groups.
Remove metadata agent config maps in favor of command line flags.
Update the metadata agent's liveness probe to a new /healthz handler.

Logging Agent Improvements
Bump logging agent version to 0.2-1.5.33-1-k8s-1.
Appropriately set log severity for k8s_container.
Fix detect exceptions plugin to analyze message field instead of log field.
Fix detect exceptions plugin to analyze streams based on local resource id.
Disable the metadata agent for monitored resource construction in logging.
Disable timestamp adjustment in logs to optimize performance.
Reduce logging agent buffer chunk limit to 512k to optimize performance.
2018-08-06 11:26:35 -04:00
Daniel Kłobuszewski
7773f8f5eb Increase fluentd-gcp grace termination period to 1min
By default, all pods have 30s for graceful termination. This gives fluentd additional 30s to export logs when the node is shutting down.
2018-06-14 10:44:13 +02:00
Bryan Moyles
a0a7686e38 Use the logging agent's node name as the metadata agent URL. 2018-05-02 10:12:35 +02:00
Ling Huang
cbec62ada4 Add support to ingest log entries to Stackdriver against new "k8s_container" and "k8s_node" resources. 2018-04-06 08:47:19 -04:00
Mik Vyatskov
d6cef02a9d
Revert "Enable partial success in fluentd-gcp" 2018-03-29 11:48:01 +02:00
Mik Vyatskov
c8773044ea Enable partial success in fluentd-gcp
Signed-off-by: Mik Vyatskov <vmik@google.com>
2018-03-27 15:51:16 +02:00
Shyam Jeedigunta
123fa5c706 Revert "Increase fluentd rolling-upgrade maxUnavailable to large value"
This reverts commit 7dd6adc438.
2018-03-26 15:17:54 +02:00
Shyam Jeedigunta
7dd6adc438 Increase fluentd rolling-upgrade maxUnavailable to large value 2018-03-22 12:33:42 +01:00
Kubernetes Submit Queue
7e063329f3
Merge pull request #60722 from filbranden/fluentd1
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Remove mapping to /host/lib from fluentd-gcp container.

**What this PR does / why we need it**:

This mapping is no longer needed since fluentd-gcp v2.0.16, in which it started using a container image based on Debian Stretch, in which the systemd libraries already include support for all the supported
compression algorithms.

The `/run.sh` in the image no longer accesses `/host/lib` anyways, so let's stop mapping it here.

Related changes:
- fluentd-gcp on GoogleCloudPlatform/k8s-stackdriver#101
- fluentd-es on GoogleCloudPlatform/google-fluentd#80

/assign @timstclair 
/cc @crassirostris @bmoyles0117 

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
N/A

**Special notes for your reviewer**:
N/A

**Release note**:

```release-note
NONE
```
2018-03-16 03:38:28 -07:00
Bryan Moyles
a844523c20 Find most recent modified date for fluentd buffers recursively.
Due to updates in Fluent v0.14, the buffers directory modified date is no
longer updated when files inside the directory are changed. Therefore we
must find the most recent modified date recursively to fix liveness probe.
2018-03-12 15:28:55 -04:00
Filipe Brandenburger
cea4c98508 Remove mapping to /host/lib from fluentd-gcp container.
This mapping is no longer needed since fluentd-gcp v2.0.16, in which it
started using a container image based on Debian Stretch, in which the
systemd libraries already include support for all the supported
compression algorithms.

The /run.sh in the image no longer accesses /host/lib anyways, so let's
stop mapping it here.

Related changes:
- fluentd-gcp on GoogleCloudPlatform/k8s-stackdriver#101
- fluentd-es on GoogleCloudPlatform/google-fluentd#80
2018-03-02 10:20:08 -08:00
Bryan Moyles
84a86cffce Update to use Stackdriver Agent image.
Prometheus is enabled by default.
2018-02-26 14:05:33 -05:00
Daniel Kłobuszewski
a88ddac1e4 use prometheus-to-sd 0.2.4 and fluentd-gcp-image 2.0.16 2018-02-16 09:16:59 +01:00
Kubernetes Submit Queue
d3bacb914c
Merge pull request #59657 from x13n/manual-fluentd-gcp-scaler
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Enable scaling fluentd-gcp resources using ScalingPolicy.

See https://github.com/justinsb/scaler for more details about ScalingPolicy resource.

**What this PR does / why we need it**:
This is adding a way to override fluentd-gcp resources in a running cluster. The resources syncing for fluentd-gcp is decoupled from addon manager.

**Special notes for your reviewer**:

**Release note**:
```release-note
fluentd-gcp resources can be modified via a ScalingPolicy
```

cc @kawych @justinsb
2018-02-15 03:42:14 -08:00
Kubernetes Submit Queue
bc9c6df31d
Merge pull request #59103 from Random-Liu/upload-container-runtime-log
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Upload container runtime log to sd/es.

I've verified this in my environment. My stackdriver has an extra `container-runtime` entry for node log, and it collects container runtime daemon log correctly.

@yujuhong @feiskyer @crassirostris @piosz 
@kubernetes/sig-node-pr-reviews @kubernetes/sig-instrumentation-pr-reviews 
Signed-off-by: Lantao Liu <lantaol@google.com>

**Release note**:

```release-note
Container runtime daemon (e.g. dockerd) logs in GCE cluster will be uploaded to stackdriver and elasticsearch with tag `container-runtime`
```
2018-02-14 03:33:21 -08:00
Lantao Liu
8d920d095c Upload container runtime log to sd/es.
Signed-off-by: Lantao Liu <lantaol@google.com>
2018-02-13 18:25:02 +00:00
Kubernetes Submit Queue
7ef11bd964
Merge pull request #59237 from tanshanshan/addons1
Automatic merge from submit-queue (batch tested with PRs 59767, 56454, 59237, 59730, 55479). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Change critical pods’ template to use priority

**What this PR does / why we need it**:
Change critical pods’ template to use priority
Thanks.
**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
ref #57471

**Special notes for your reviewer**:

**Release note**:

```release-note

```
2018-02-12 15:44:36 -08:00
Daniel Kłobuszewski
2eb24f9ae1 Enable scaling fluentd-gcp resources using ScalingPolicy.
See https://github.com/justinsb/scaler for more details about ScalingPolicy resource.
2018-02-09 14:33:33 +01:00
tanshanshan
95b2b94b1b Change critical pods’ template to use priority 2018-02-08 15:06:27 +08:00
Tim Hockin
3586986416 Switch to k8s.gcr.io vanity domain
This is the 2nd attempt.  The previous was reverted while we figured out
the regional mirrors (oops).

New plan: k8s.gcr.io is a read-only facade that auto-detects your source
region (us, eu, or asia for now) and pulls from the closest.  To publish
an image, push k8s-staging.gcr.io and it will be synced to the regionals
automatically (similar to today).  For now the staging is an alias to
gcr.io/google_containers (the legacy URL).

When we move off of google-owned projects (working on it), then we just
do a one-time sync, and change the google-internal config, and nobody
outside should notice.

We can, in parallel, change the auto-sync into a manual sync - send a PR
to "promote" something from staging, and a bot activates it.  Nice and
visible, easy to keep track of.
2018-02-07 21:14:19 -08:00
Ross Light
6831581f1c Bump fluentd-gcp version 2018-01-12 10:16:13 -08:00
Daniel Kłobuszewski
dca74f17fd
Bump fluentd-gcp image used to 2.0.13 2018-01-08 14:54:26 +01:00
Daniel Kłobuszewski
2eded687be
Bump fluentd-gcp version 2018-01-03 11:46:13 +01:00
Tim Hockin
e9dd8a68f6 Revert k8s.gcr.io vanity domain
This reverts commit eba5b6092a.

Fixes https://github.com/kubernetes/kubernetes/issues/57526
2017-12-22 14:36:16 -08:00
Tim Hockin
eba5b6092a Use k8s.gcr.io vanity domain for container images 2017-12-18 09:18:34 -08:00
Daniel Kłobuszewski
d2cbc37c05
Bump fluentd-gcp version 2017-12-07 14:23:05 +01:00
Rohit Agarwal
ad05928c6e Add wildcard tolerations to kube-proxy.
fluend-gcp already has these tolerations. kube-proxy when it runs as a
static pod gets wildcard `NoExecute` toleration (all static pods get
that). So, added the same toleration to kube-proxy when it runs as a
daemonset. Also added wildcard `NoSchedule` toleration to kube-proxy.
2017-11-29 12:36:58 -08:00
Mik Vyatskov
e9322b929c Fix setting resources in fluentd-gcp plugin
Signed-off-by: Mik Vyatskov <vmik@google.com>
2017-11-22 12:40:50 +01:00
Lantao Liu
53d7494b9e Fix CRI fluentd config.
Signed-off-by: Lantao Liu <lantaol@google.com>
2017-11-10 20:55:56 +00:00
Lantao Liu
70a0cdfa8e Add CRI log format support in fluentd. 2017-10-30 06:25:52 +00:00
Kubernetes Submit Queue
949ec719c3
Merge pull request #54635 from loburm/prom-to-sd
Automatic merge from submit-queue (batch tested with PRs 54635, 54250, 54657, 54696, 54700). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Bump version of prometheus-to-sd to 0.2.2.

Bump version of prometheus-to-sd to improve logging, add pod_name and
pod_namespace flags and remove deprecated flags.

Fixes #54583 

```release-note
NONE
```
2017-10-27 14:38:21 -07:00
Kubernetes Submit Queue
fc8bfe2d89 Merge pull request #54395 from crassirostris/fluentd-gcp-rollback-host-networking
Automatic merge from submit-queue (batch tested with PRs 50776, 54395). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Move fluentd-gcp out of host network

Since metadata proxy doesn't filter service account after all, make fluentd-gcp addon run in its own network

This will mitigate the problem with port collision

```release-note
[fluentd-gcp addon] Fluentd now runs in its own network, not in the host one.
```
2017-10-27 03:09:25 -07:00
Marian Lobur
5b62eb29d2 Bump version of prometheus-to-sd to 0.2.2.
Bump version of prometheus-to-sd to improve logging, add pod_name and
pod_namespace flags and remove deprecated flags.
2017-10-26 15:54:54 +02:00