docs: Add tracing proposals doc

Create a document summarising the tracing design proposals
from PR #1937.

Fixes: #2061.

Signed-off-by: bin <bin@hyper.sh>
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
This commit is contained in:
James O. D. Hunt 2021-06-18 09:40:03 +01:00
parent 6b2ad64aea
commit 5b514177b0
3 changed files with 222 additions and 0 deletions

View File

@ -10,3 +10,7 @@ Kata Containers design documents:
- [Host cgroups](host-cgroups.md)
- [`Inotify` support](inotify.md)
- [Metrics(Kata 2.0)](kata-2-0-metrics.md)
---
- [Design proposals](proposals)

View File

@ -0,0 +1,5 @@
# Design proposals
Kata Containers design proposal documents:
- [Kata Containers tracing](tracing-proposals.md)

View File

@ -0,0 +1,213 @@
# Kata Tracing proposals
## Overview
This document summarises a set of proposals triggered by the
[tracing documentation PR][tracing-doc-pr].
## Required context
This section explains some terminology required to understand the proposals.
Further details can be found in the
[tracing documentation PR][tracing-doc-pr].
### Agent trace mode terminology
| Trace mode | Description | Use-case |
|-|-|-|
| Static | Trace agent from startup to shutdown | Entire lifespan |
| Dynamic | Toggle tracing on/off as desired | On-demand "snapshot" |
### Agent trace type terminology
| Trace type | Description | Use-case |
|-|-|-|
| isolated | traces all relate to single component | Observing lifespan |
| collated | traces "grouped" (runtime+agent) | Understanding component interaction |
### Container lifespan
| Lifespan | trace mode | trace type |
|-|-|-|
| short-lived | static | collated if possible, else isolated? |
| long-running | dynamic | collated? (to see interactions) |
## Original plan for agent
- Implement all trace types and trace modes for agent.
- Why?
- Maximum flexibility.
> **Counterargument:**
>
> Due to the intrusive nature of adding tracing, we have
> learnt that landing small incremental changes is simpler and quicker!
- Compatibility with [Kata 1.x tracing][kata-1x-tracing].
> **Counterargument:**
>
> Agent tracing in Kata 1.x was extremely awkward to setup (to the extent
> that it's unclear how many users actually used it!)
>
> This point, coupled with the new architecture for Kata 2.x, suggests
> that we may not need to supply the same set of tracing features (in fact
> they may not make sense)).
## Agent tracing proposals
### Agent tracing proposal 1: Don't implement dynamic trace mode
- All tracing will be static.
- Why?
- Because dynamic tracing will always be "partial"
> In fact, not only would it be only a "snapshot" of activity, it may not
> even be possible to create a complete "trace transaction". If this is
> true, the trace output would be partial and would appear "unstructured".
### Agent tracing proposal 2: Simplify handling of trace type
- Agent tracing will be "isolated" by default.
- Agent tracing will be "collated" if runtime tracing is also enabled.
- Why?
- Offers a graceful fallback for agent tracing if runtime tracing disabled.
- Simpler code!
## Questions to ask yourself (part 1)
- Are your containers long-running or short-lived?
- Would you ever need to turn on tracing "briefly"?
- If "yes", is a "partial trace" useful or useless?
> Likely to be considered useless as it is a partial snapshot.
> Alternative tracing methods may be more appropriate to dynamic
> OpenTelemetry tracing.
## Questions to ask yourself (part 2)
- Are you happy to stop a container to enable tracing?
If "no", dynamic tracing may be required.
- Would you ever want to trace the agent and the runtime "in isolation" at the
same time?
- If "yes", we need to fully implement `trace_mode=isolated`
> This seems unlikely though.
## Trace collection
The second set of proposals affect the way traces are collected.
### Motivation
Currently:
- The runtime sends trace spans to Jaeger directly.
- The agent will send trace spans to the [`trace-forwarder`][trace-forwarder] component.
- The trace forwarder will send trace spans to Jaeger.
Kata agent tracing overview:
```
+-------------------------------------------+
| Host |
| |
| +-----------+ |
| | Trace | |
| | Collector | |
| +-----+-----+ |
| ^ +--------------+ |
| | spans | Kata VM | |
| +-----+-----+ | | |
| | Kata | spans | +-----+ | |
| | Trace |<-----------------|Kata | | |
| | Forwarder | VSOCK | |Agent| | |
| +-----------+ Channel | +-----+ | |
| +--------------+ |
+-------------------------------------------+
```
Currently:
- If agent tracing is enabled but the trace forwarder is not running,
the agent will error.
- If the trace forwarder is started but Jaeger is not running,
the trace forwarder will error.
### Goals
- The runtime and agent should:
- Use the same trace collection implementation.
- Use the most the common configuration items.
- Kata should should support more trace collection software or `SaaS`
(for example `Zipkin`, `datadog`).
- Trace collection should not block normal runtime/agent operations
(for example if `vsock-exporter`/Jaeger is not running, Kata Containers should work normally).
### Trace collection proposals
#### Trace collection proposal 1: Send all spans to the trace forwarder as a span proxy
Kata runtime/agent all send spans to trace forwarder, and the trace forwarder,
acting as a tracing proxy, sends all spans to a tracing back-end, such as Jaeger or `datadog`.
**Pros:**
- Runtime/agent will be simple.
- Could update trace collection target while Kata Containers are running.
**Cons:**
- Requires the trace forwarder component to be running (that is a pressure to operation).
#### Trace collection proposal 2: Send spans to collector directly from runtime/agent
Send spans to collector directly from runtime/agent, this proposal need
network accessible to the collector.
**Pros:**
- No additional trace forwarder component needed.
**Cons:**
- Need more code/configuration to support all trace collectors.
## Future work
- We could add dynamic and fully isolated tracing at a later stage,
if required.
## Further details
- See the new [GitHub project](https://github.com/orgs/kata-containers/projects/28).
- [kata-containers-tracing-status](https://gist.github.com/jodh-intel/0ee54d41d2a803ba761e166136b42277) gist.
- [tracing documentation PR][tracing-doc-pr].
## Summary
### Time line
- 2021-07-01: A summary of the discussion was
[posted to the mail list](http://lists.katacontainers.io/pipermail/kata-dev/2021-July/001996.html).
- 2021-06-22: These proposals were
[discussed in the Kata Architecture Committee meeting](https://etherpad.opendev.org/p/Kata_Containers_2021_Architecture_Committee_Mtgs).
- 2021-06-18: These proposals where
[announced on the mailing list](http://lists.katacontainers.io/pipermail/kata-dev/2021-June/001980.html).
### Outcome
- Nobody opposed the agent proposals, so they are being implemented.
- The trace collection proposals are still being considered.
[kata-1x-tracing]: https://github.com/kata-containers/agent/blob/master/TRACING.md
[trace-forwarder]: /src/trace-forwarder
[tracing-doc-pr]: https://github.com/kata-containers/kata-containers/pull/1937