# Kata Tracing proposals ## Overview This document summarises a set of proposals triggered by the [tracing documentation PR][tracing-doc-pr]. ## Required context This section explains some terminology required to understand the proposals. Further details can be found in the [tracing documentation PR][tracing-doc-pr]. ### Agent trace mode terminology | Trace mode | Description | Use-case | |-|-|-| | Static | Trace agent from startup to shutdown | Entire lifespan | | Dynamic | Toggle tracing on/off as desired | On-demand "snapshot" | ### Agent trace type terminology | Trace type | Description | Use-case | |-|-|-| | isolated | traces all relate to single component | Observing lifespan | | collated | traces "grouped" (runtime+agent) | Understanding component interaction | ### Container lifespan | Lifespan | trace mode | trace type | |-|-|-| | short-lived | static | collated if possible, else isolated? | | long-running | dynamic | collated? (to see interactions) | ## Original plan for agent - Implement all trace types and trace modes for agent. - Why? - Maximum flexibility. > **Counterargument:** > > Due to the intrusive nature of adding tracing, we have > learnt that landing small incremental changes is simpler and quicker! - Compatibility with [Kata 1.x tracing][kata-1x-tracing]. > **Counterargument:** > > Agent tracing in Kata 1.x was extremely awkward to setup (to the extent > that it's unclear how many users actually used it!) > > This point, coupled with the new architecture for Kata 2.x, suggests > that we may not need to supply the same set of tracing features (in fact > they may not make sense)). ## Agent tracing proposals ### Agent tracing proposal 1: Don't implement dynamic trace mode - All tracing will be static. - Why? - Because dynamic tracing will always be "partial" > In fact, not only would it be only a "snapshot" of activity, it may not > even be possible to create a complete "trace transaction". If this is > true, the trace output would be partial and would appear "unstructured". ### Agent tracing proposal 2: Simplify handling of trace type - Agent tracing will be "isolated" by default. - Agent tracing will be "collated" if runtime tracing is also enabled. - Why? - Offers a graceful fallback for agent tracing if runtime tracing disabled. - Simpler code! ## Questions to ask yourself (part 1) - Are your containers long-running or short-lived? - Would you ever need to turn on tracing "briefly"? - If "yes", is a "partial trace" useful or useless? > Likely to be considered useless as it is a partial snapshot. > Alternative tracing methods may be more appropriate to dynamic > OpenTelemetry tracing. ## Questions to ask yourself (part 2) - Are you happy to stop a container to enable tracing? If "no", dynamic tracing may be required. - Would you ever want to trace the agent and the runtime "in isolation" at the same time? - If "yes", we need to fully implement `trace_mode=isolated` > This seems unlikely though. ## Trace collection The second set of proposals affect the way traces are collected. ### Motivation Currently: - The runtime sends trace spans to Jaeger directly. - The agent will send trace spans to the [`trace-forwarder`][trace-forwarder] component. - The trace forwarder will send trace spans to Jaeger. Kata agent tracing overview: ``` +-------------------------------------------+ | Host | | | | +-----------+ | | | Trace | | | | Collector | | | +-----+-----+ | | ^ +--------------+ | | | spans | Kata VM | | | +-----+-----+ | | | | | Kata | spans | +-----+ | | | | Trace |<-----------------|Kata | | | | | Forwarder | VSOCK | |Agent| | | | +-----------+ Channel | +-----+ | | | +--------------+ | +-------------------------------------------+ ``` Currently: - If agent tracing is enabled but the trace forwarder is not running, the agent will error. - If the trace forwarder is started but Jaeger is not running, the trace forwarder will error. ### Goals - The runtime and agent should: - Use the same trace collection implementation. - Use the most the common configuration items. - Kata should should support more trace collection software or `SaaS` (for example `Zipkin`, `datadog`). - Trace collection should not block normal runtime/agent operations (for example if `vsock-exporter`/Jaeger is not running, Kata Containers should work normally). ### Trace collection proposals #### Trace collection proposal 1: Send all spans to the trace forwarder as a span proxy Kata runtime/agent all send spans to trace forwarder, and the trace forwarder, acting as a tracing proxy, sends all spans to a tracing back-end, such as Jaeger or `datadog`. **Pros:** - Runtime/agent will be simple. - Could update trace collection target while Kata Containers are running. **Cons:** - Requires the trace forwarder component to be running (that is a pressure to operation). #### Trace collection proposal 2: Send spans to collector directly from runtime/agent Send spans to collector directly from runtime/agent, this proposal need network accessible to the collector. **Pros:** - No additional trace forwarder component needed. **Cons:** - Need more code/configuration to support all trace collectors. ## Future work - We could add dynamic and fully isolated tracing at a later stage, if required. ## Further details - See the new [GitHub project](https://github.com/orgs/kata-containers/projects/28). - [kata-containers-tracing-status](https://gist.github.com/jodh-intel/0ee54d41d2a803ba761e166136b42277) gist. - [tracing documentation PR][tracing-doc-pr]. ## Summary ### Time line - 2021-07-01: A summary of the discussion was [posted to the mail list](http://lists.katacontainers.io/pipermail/kata-dev/2021-July/001996.html). - 2021-06-22: These proposals were [discussed in the Kata Architecture Committee meeting](https://etherpad.opendev.org/p/Kata_Containers_2021_Architecture_Committee_Mtgs). - 2021-06-18: These proposals where [announced on the mailing list](http://lists.katacontainers.io/pipermail/kata-dev/2021-June/001980.html). ### Outcome - Nobody opposed the agent proposals, so they are being implemented. - The trace collection proposals are still being considered. [kata-1x-tracing]: https://github.com/kata-containers/agent/blob/master/TRACING.md [trace-forwarder]: /src/tools/trace-forwarder [tracing-doc-pr]: https://github.com/kata-containers/kata-containers/pull/1937