Norman metrics' labels have too high of a cardinality as they
include object key as a label. This is a problem because metrics
produces a data row for every combination of labels. This can
have a large impact on performance and is cautioned against in
the prometheus docs. Norman now uses lasso which has metrics with
matching functionality apart from the object key, so norman
metrics will be removed in favor of lasso metrics.
Problem:
cluster_check relies on reflect which is slow and causes a large
performance hit which is especially noticable on restart
Solution:
Add a new interface ObjectClusterName that objects can implement so
reflect isn't used for that type
To enhance operatability for the service using norman framework,
It's better to expose internal state as detail as possible.
This is the just starting point but at least which handler is executed
often and which handler with which key's execution is often failed
metrics is very useful to spot the place operator have to dig in when
something happened.
So this commit added 2 metrics.
1: handler execution total count
2: handler execution failure total count
Norman Condition automatically generate condition information based on
the error the handler would return in Condition.Do function.
handler function usually return 2 types of error. the error can be
ignored and the error can not be ignored.
According to current implementation, Condition.Do function generate
condition with error state even if handler return error can be ignored.
The error that can be ignored should be ignored in the context of
condition as well.
So this commit introduce new field which is Reason to ForgetError so
that the developer can put special reason other than Error when
ForgetError is expected to happen by expected procedure like
provisioning and wait for something and Norman Condintion respect to
this field when try to generate condition information based on error
This solution will help us to fix this rancher bug
https://github.com/rancher/rancher/issues/15907
With the built in prom. metrics provider, the k8s machinery doesnt
deregister metrics when controllers are removed. So over time as
things like clusters are created or removed the metrics are not
cleaned up. The metrics types for the cache and queue are also
very large. They can take ~1GB of RAM in a 100 cluster setup.
Also, Rancher is not exposing these stats so they are unobservable.
Some worload controllers need to watch resoruces in the mangement plane
and react to them. But, they should only react to resources that
correspond to their cluster. This adds framework support for that.