build-runtime-config was being called in verify-prereqs, which didn't
match how GCE called it, and didn't seem to actually work.
Instead call it just before the master configuration is built. Also
call it just before the node configuration is built, even though the
nodes don't _currently_ require the runtime_config.
To fix it, I just add openssl depedency on "generate-cert" state. It
should work on Debian-like and RedHat-Like systems. (and, Archlinux,
Opensuse, etc)
Fixed error :
$ sudo salt 'kubernetes-master' state.apply
----------
ID: kubernetes-cert
Function: cmd.script
Result: False
Comment: Command 'kubernetes-cert' run
Started: 06:57:06.634203
Duration: 208.719 ms
Changes:
----------
pid:
793
retcode:
1
stderr:
/tmpm24T3R.sh: line 22: openssl: command not found
chgrp: cannot access '/srv/kubernetes/server.key': No such file or directory
chgrp: cannot access '/srv/kubernetes/server.cert': No such file or directory
chmod: cannot access '/srv/kubernetes/server.key': No such file or directory
chmod: cannot access '/srv/kubernetes/server.cert': No such file or directory
stdout:
After applying my patch (success) :
----------
ID: kubernetes-cert
Function: cmd.script
Result: True
Comment: Command 'kubernetes-cert' run
Started: 07:17:04.172384
Duration: 1041.092 ms
Changes:
----------
pid:
1045
retcode:
0
stderr:
Generating a 4096 bit RSA private key
......................................................................++
...............................................................................++
writing new private key to '/srv/kubernetes/server.key'
-----
stdout:
----------
According to AWS, the ELB healthy threshold is "Number of consecutive health check successes before declaring an EC2 instance healthy." It has an unusual interaction with Kubernetes, since all nodes will enter either an unhealthy state or a healthy state together depending on the service's healthiness as a whole.
We have observed that if our service goes down for the unhealthy threshold (which is 2 checks at 30 second intervals = 60 seconds), then the ELB will stop serving traffic to all nodes in the cluster, and will wait for the healthy threshold (currently 10 * 30 = 300 seconds) AFTER the service is restored to add back the cluster nodes, meaning it remains unreachable for an extra 300 seconds.
With the new settings, the ELB will continue to timeout dead nodes after 60 seconds, but will restore healthy nodes after 20 seconds. The minimum value for healthyThreshold is 2, and the minimum value for interval is 5 seconds. I went for 10 seconds instead of the minimum sort of arbitrarily because I was not sure how much this value may affect the scalability of clusters in EC2, as it does put some extra load on the kube-proxy.