mirror of
https://github.com/kata-containers/kata-containers.git
synced 2025-05-03 22:17:35 +00:00
To make the code directory structure more clear: └── src ├── agent ├── libs │ └── logging ├── runtime ├── runtime-rs (to be added) └── tools ├── agent-ctl └── trace-forwarder Fixes: #3204 Signed-off-by: Peng Tao <bergwolf@hyper.sh>
214 lines
6.7 KiB
Markdown
214 lines
6.7 KiB
Markdown
# Kata Tracing proposals
|
|
|
|
## Overview
|
|
|
|
This document summarises a set of proposals triggered by the
|
|
[tracing documentation PR][tracing-doc-pr].
|
|
|
|
## Required context
|
|
|
|
This section explains some terminology required to understand the proposals.
|
|
Further details can be found in the
|
|
[tracing documentation PR][tracing-doc-pr].
|
|
|
|
### Agent trace mode terminology
|
|
|
|
| Trace mode | Description | Use-case |
|
|
|-|-|-|
|
|
| Static | Trace agent from startup to shutdown | Entire lifespan |
|
|
| Dynamic | Toggle tracing on/off as desired | On-demand "snapshot" |
|
|
|
|
### Agent trace type terminology
|
|
|
|
| Trace type | Description | Use-case |
|
|
|-|-|-|
|
|
| isolated | traces all relate to single component | Observing lifespan |
|
|
| collated | traces "grouped" (runtime+agent) | Understanding component interaction |
|
|
|
|
### Container lifespan
|
|
|
|
| Lifespan | trace mode | trace type |
|
|
|-|-|-|
|
|
| short-lived | static | collated if possible, else isolated? |
|
|
| long-running | dynamic | collated? (to see interactions) |
|
|
|
|
## Original plan for agent
|
|
|
|
- Implement all trace types and trace modes for agent.
|
|
|
|
- Why?
|
|
- Maximum flexibility.
|
|
|
|
> **Counterargument:**
|
|
>
|
|
> Due to the intrusive nature of adding tracing, we have
|
|
> learnt that landing small incremental changes is simpler and quicker!
|
|
|
|
- Compatibility with [Kata 1.x tracing][kata-1x-tracing].
|
|
|
|
> **Counterargument:**
|
|
>
|
|
> Agent tracing in Kata 1.x was extremely awkward to setup (to the extent
|
|
> that it's unclear how many users actually used it!)
|
|
>
|
|
> This point, coupled with the new architecture for Kata 2.x, suggests
|
|
> that we may not need to supply the same set of tracing features (in fact
|
|
> they may not make sense)).
|
|
|
|
## Agent tracing proposals
|
|
|
|
### Agent tracing proposal 1: Don't implement dynamic trace mode
|
|
|
|
- All tracing will be static.
|
|
|
|
- Why?
|
|
- Because dynamic tracing will always be "partial"
|
|
|
|
> In fact, not only would it be only a "snapshot" of activity, it may not
|
|
> even be possible to create a complete "trace transaction". If this is
|
|
> true, the trace output would be partial and would appear "unstructured".
|
|
|
|
### Agent tracing proposal 2: Simplify handling of trace type
|
|
|
|
- Agent tracing will be "isolated" by default.
|
|
- Agent tracing will be "collated" if runtime tracing is also enabled.
|
|
|
|
- Why?
|
|
- Offers a graceful fallback for agent tracing if runtime tracing disabled.
|
|
- Simpler code!
|
|
|
|
## Questions to ask yourself (part 1)
|
|
|
|
- Are your containers long-running or short-lived?
|
|
|
|
- Would you ever need to turn on tracing "briefly"?
|
|
- If "yes", is a "partial trace" useful or useless?
|
|
|
|
> Likely to be considered useless as it is a partial snapshot.
|
|
> Alternative tracing methods may be more appropriate to dynamic
|
|
> OpenTelemetry tracing.
|
|
|
|
## Questions to ask yourself (part 2)
|
|
|
|
- Are you happy to stop a container to enable tracing?
|
|
If "no", dynamic tracing may be required.
|
|
|
|
- Would you ever want to trace the agent and the runtime "in isolation" at the
|
|
same time?
|
|
- If "yes", we need to fully implement `trace_mode=isolated`
|
|
|
|
> This seems unlikely though.
|
|
|
|
## Trace collection
|
|
|
|
The second set of proposals affect the way traces are collected.
|
|
|
|
### Motivation
|
|
|
|
Currently:
|
|
|
|
- The runtime sends trace spans to Jaeger directly.
|
|
- The agent will send trace spans to the [`trace-forwarder`][trace-forwarder] component.
|
|
- The trace forwarder will send trace spans to Jaeger.
|
|
|
|
Kata agent tracing overview:
|
|
|
|
```
|
|
+-------------------------------------------+
|
|
| Host |
|
|
| |
|
|
| +-----------+ |
|
|
| | Trace | |
|
|
| | Collector | |
|
|
| +-----+-----+ |
|
|
| ^ +--------------+ |
|
|
| | spans | Kata VM | |
|
|
| +-----+-----+ | | |
|
|
| | Kata | spans | +-----+ | |
|
|
| | Trace |<-----------------|Kata | | |
|
|
| | Forwarder | VSOCK | |Agent| | |
|
|
| +-----------+ Channel | +-----+ | |
|
|
| +--------------+ |
|
|
+-------------------------------------------+
|
|
```
|
|
|
|
Currently:
|
|
|
|
- If agent tracing is enabled but the trace forwarder is not running,
|
|
the agent will error.
|
|
|
|
- If the trace forwarder is started but Jaeger is not running,
|
|
the trace forwarder will error.
|
|
|
|
### Goals
|
|
|
|
- The runtime and agent should:
|
|
- Use the same trace collection implementation.
|
|
- Use the most the common configuration items.
|
|
|
|
- Kata should should support more trace collection software or `SaaS`
|
|
(for example `Zipkin`, `datadog`).
|
|
|
|
- Trace collection should not block normal runtime/agent operations
|
|
(for example if `vsock-exporter`/Jaeger is not running, Kata Containers should work normally).
|
|
|
|
### Trace collection proposals
|
|
|
|
#### Trace collection proposal 1: Send all spans to the trace forwarder as a span proxy
|
|
|
|
Kata runtime/agent all send spans to trace forwarder, and the trace forwarder,
|
|
acting as a tracing proxy, sends all spans to a tracing back-end, such as Jaeger or `datadog`.
|
|
|
|
**Pros:**
|
|
|
|
- Runtime/agent will be simple.
|
|
- Could update trace collection target while Kata Containers are running.
|
|
|
|
**Cons:**
|
|
|
|
- Requires the trace forwarder component to be running (that is a pressure to operation).
|
|
|
|
#### Trace collection proposal 2: Send spans to collector directly from runtime/agent
|
|
|
|
Send spans to collector directly from runtime/agent, this proposal need
|
|
network accessible to the collector.
|
|
|
|
**Pros:**
|
|
|
|
- No additional trace forwarder component needed.
|
|
|
|
**Cons:**
|
|
|
|
- Need more code/configuration to support all trace collectors.
|
|
|
|
## Future work
|
|
|
|
- We could add dynamic and fully isolated tracing at a later stage,
|
|
if required.
|
|
|
|
## Further details
|
|
|
|
- See the new [GitHub project](https://github.com/orgs/kata-containers/projects/28).
|
|
- [kata-containers-tracing-status](https://gist.github.com/jodh-intel/0ee54d41d2a803ba761e166136b42277) gist.
|
|
- [tracing documentation PR][tracing-doc-pr].
|
|
|
|
## Summary
|
|
|
|
### Time line
|
|
|
|
- 2021-07-01: A summary of the discussion was
|
|
[posted to the mail list](http://lists.katacontainers.io/pipermail/kata-dev/2021-July/001996.html).
|
|
- 2021-06-22: These proposals were
|
|
[discussed in the Kata Architecture Committee meeting](https://etherpad.opendev.org/p/Kata_Containers_2021_Architecture_Committee_Mtgs).
|
|
- 2021-06-18: These proposals where
|
|
[announced on the mailing list](http://lists.katacontainers.io/pipermail/kata-dev/2021-June/001980.html).
|
|
|
|
### Outcome
|
|
|
|
- Nobody opposed the agent proposals, so they are being implemented.
|
|
- The trace collection proposals are still being considered.
|
|
|
|
[kata-1x-tracing]: https://github.com/kata-containers/agent/blob/master/TRACING.md
|
|
[trace-forwarder]: /src/tools/trace-forwarder
|
|
[tracing-doc-pr]: https://github.com/kata-containers/kata-containers/pull/1937
|