mirror of
https://github.com/kubeshark/kubeshark.git
synced 2026-03-18 18:42:44 +00:00
Compare commits
5 Commits
update/rea
...
add-kubesh
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
710345c628 | ||
|
|
963b3e4ac2 | ||
|
|
b2813e02bd | ||
|
|
707d7351b6 | ||
|
|
23c86be773 |
102
CLAUDE.md
Normal file
102
CLAUDE.md
Normal file
@@ -0,0 +1,102 @@
|
||||
Do not include any Claude/AI attribution (Co-Authored-By lines, badges, etc.) in commit messages or pull request descriptions.
|
||||
|
||||
## Skills
|
||||
|
||||
Kubeshark is building an ecosystem of open-source AI skills that work with the Kubeshark MCP.
|
||||
Skills live in the `skills/` directory at the root of this repo.
|
||||
|
||||
### What is a skill?
|
||||
|
||||
A skill is a SKILL.md file (with optional reference docs) that teaches an AI agent a domain-specific
|
||||
methodology. The Kubeshark MCP provides the tools (snapshot creation, API call queries, PCAP export,
|
||||
etc.) — a skill tells the agent *how* to use those tools for a specific job.
|
||||
|
||||
### Skill structure
|
||||
|
||||
```
|
||||
skills/
|
||||
└── <skill-name>/
|
||||
├── SKILL.md # Required. YAML frontmatter + markdown instructions.
|
||||
└── references/ # Optional. Reference docs loaded on demand.
|
||||
└── *.md
|
||||
```
|
||||
|
||||
### SKILL.md format
|
||||
|
||||
Every SKILL.md starts with YAML frontmatter:
|
||||
|
||||
```yaml
|
||||
---
|
||||
name: skill-name
|
||||
description: >
|
||||
When to trigger this skill. Be specific about user intents, keywords, and contexts.
|
||||
The description is the primary mechanism for AI agents to decide whether to load the skill.
|
||||
---
|
||||
```
|
||||
|
||||
The body is markdown instructions that define the methodology: prerequisites, workflows,
|
||||
tool usage patterns, output guidelines, and reference pointers.
|
||||
|
||||
### Guidelines for writing skills
|
||||
|
||||
- Keep SKILL.md under 500 lines. Put detailed references in `references/` with clear pointers.
|
||||
- Use imperative tone ("Check data boundaries", "Create a snapshot").
|
||||
- Reference Kubeshark MCP tools by exact name (e.g., `create_snapshot`, `list_api_calls`).
|
||||
- Include realistic example tool responses so the agent knows what to expect.
|
||||
- Explain *why* things matter, not just *what* to do — the agent is smart and benefits from context.
|
||||
- Include a Setup Reference section with MCP configuration for Claude Code and Claude Desktop.
|
||||
- The description frontmatter should be "pushy" — include trigger keywords generously so the skill
|
||||
activates when needed. Better to over-trigger than under-trigger.
|
||||
|
||||
### Kubeshark MCP tools available to skills
|
||||
|
||||
**Cluster management**: `check_kubeshark_status`, `start_kubeshark`, `stop_kubeshark`
|
||||
**Inventory**: `list_workloads`
|
||||
**L7 API**: `list_api_calls`, `get_api_call`, `get_api_stats`
|
||||
**L4 flows**: `list_l4_flows`, `get_l4_flow_summary`
|
||||
**Snapshots**: `get_data_boundaries`, `create_snapshot`, `get_snapshot`, `list_snapshots`, `start_snapshot_dissection`
|
||||
**PCAP**: `export_snapshot_pcap`, `resolve_workload`
|
||||
**Cloud storage**: `get_cloud_storage_status`, `upload_snapshot_to_cloud`, `download_snapshot_from_cloud`
|
||||
**Dissection**: `get_dissection_status`, `enable_dissection`, `disable_dissection`
|
||||
|
||||
### KFL (Kubeshark Filter Language)
|
||||
|
||||
KFL2 is built on CEL (Common Expression Language). Skills that involve traffic filtering should
|
||||
reference KFL. Key concepts:
|
||||
|
||||
- Display filter (post-capture), not capture filter
|
||||
- Fields: `src.ip`, `dst.ip`, `src.pod.name`, `dst.pod.namespace`, `src.service.name`, etc.
|
||||
- Protocol booleans: `http`, `dns`, `redis`, `kafka`, `tls`, `grpc`, `amqp`, `ws`
|
||||
- HTTP fields: `url`, `method`, `status_code`, `path`, `request.headers`, `response.headers`,
|
||||
`request_body_size`, `response_body_size`, `elapsed_time` (microseconds)
|
||||
- DNS fields: `dns_questions`, `dns_answers`, `dns_question_types`
|
||||
- Operators: `==`, `!=`, `<`, `>`, `&&`, `||`, `in`
|
||||
- String functions: `.contains()`, `.startsWith()`, `.endsWith()`, `.matches()` (regex)
|
||||
- Collection: `size()`, `[index]`, `[key]`
|
||||
- Full reference: https://docs.kubeshark.com/en/v2/kfl2
|
||||
|
||||
### Key Kubeshark concepts for skill authors
|
||||
|
||||
- **eBPF capture**: Kernel-level, no sidecars/proxies. Decrypts TLS without private keys.
|
||||
- **Protocols**: HTTP, gRPC, GraphQL, WebSocket, Kafka, Redis, AMQP, DNS, and more.
|
||||
- **Raw capture**: FIFO buffer per node. Must be enabled for retrospective analysis.
|
||||
- **Snapshots**: Immutable freeze of traffic in a time window. Includes raw capture files,
|
||||
K8s pod events, and eBPF cgroup events.
|
||||
- **Dissection**: The "indexing" step. Reconstructs raw packets into structured L7 API calls.
|
||||
Think of it like a search engine indexing web pages — without dissection you have PCAPs,
|
||||
with dissection you have a queryable database. Kubeshark is the search engine for network traffic.
|
||||
- **Cloud storage**: Snapshots can be uploaded to S3/GCS/Azure and downloaded to any cluster.
|
||||
A production snapshot can be analyzed on a local KinD cluster.
|
||||
|
||||
### Current skills
|
||||
|
||||
- `skills/network-rca/` — Network Root Cause Analysis. Retrospective traffic analysis via
|
||||
snapshots, dissection, KFL queries, PCAP extraction, trend comparison.
|
||||
- `skills/kfl/` — KFL2 (Kubeshark Filter Language) expert. Complete reference for writing,
|
||||
debugging, and optimizing CEL-based traffic filters across all supported protocols.
|
||||
|
||||
### Planned skills (not yet created)
|
||||
|
||||
- `skills/api-security/` — OWASP API Top 10 assessment against live or snapshot traffic.
|
||||
- `skills/incident-response/` — 7-phase forensic incident investigation methodology.
|
||||
- `skills/network-engineering/` — Real-time traffic analysis, latency debugging, dependency mapping.
|
||||
31
README.md
31
README.md
@@ -21,16 +21,13 @@ Kubeshark captures cluster-wide network traffic at the speed and scale of Kubern
|
||||
|
||||
Network data is available to **AI agents via [MCP](https://docs.kubeshark.com/en/mcp)** and to **human operators via a [dashboard](https://docs.kubeshark.com/en/v2)**.
|
||||
|
||||
**Kubeshark captures, processes, and retains cluster-wide network traffic:**
|
||||
|
||||
- **PCAP Retention** — continuous raw packet capture with point-in-time snapshots, exportable for Wireshark ([Snapshots →](https://docs.kubeshark.com/en/v2/traffic_snapshots))
|
||||
- **L7 API Dissection** — real-time request/response matching with full payload parsing: HTTP, gRPC, GraphQL, Redis, Kafka, DNS ([API dissection →](https://docs.kubeshark.com/en/v2/l7_api_dissection))
|
||||
- **Kubernetes Context** — every packet and API call resolved to pod, service, namespace, and node
|
||||
|
||||
**Additional benefits:**
|
||||
**What's captured, cluster-wide:**
|
||||
|
||||
- **L4 Packets & TCP Metrics** — retransmissions, RTT, window saturation, connection lifecycle, packet loss across every node-to-node path ([TCP insights →](https://docs.kubeshark.com/en/mcp/tcp_insights))
|
||||
- **L7 API Calls** — real-time request/response matching with full payload parsing: HTTP, gRPC, GraphQL, Redis, Kafka, DNS ([API dissection →](https://docs.kubeshark.com/en/v2/l7_api_dissection))
|
||||
- **Decrypted TLS** — eBPF-based TLS decryption without key management
|
||||
- **L4 TCP Insights** — retransmissions, RTT, window saturation, connection lifecycle, packet loss across every node-to-node path ([TCP insights →](https://docs.kubeshark.com/en/mcp/tcp_insights))
|
||||
- **Kubernetes Context** — every packet and API call resolved to pod, service, namespace, and node
|
||||
- **PCAP Retention** — point-in-time raw packet snapshots, exportable for Wireshark ([Snapshots →](https://docs.kubeshark.com/en/v2/traffic_snapshots))
|
||||
|
||||

|
||||
|
||||
@@ -81,16 +78,6 @@ Cluster-wide request/response matching with full payloads, parsed according to p
|
||||
|
||||
[Learn more →](https://docs.kubeshark.com/en/v2/l7_api_dissection)
|
||||
|
||||
### Cluster-wide PCAP
|
||||
|
||||
Generate a cluster-wide PCAP file from any point in time. Filter by time range, specific nodes, and BPF expressions (e.g. `net`, `ip`, `port`, `host`) to capture exactly the traffic you need — across the entire cluster, in a single file. Download and analyze with Wireshark, tshark, or any PCAP-compatible tool — or let your AI agent download and analyze programmatically via MCP.
|
||||
|
||||
Store snapshots locally or in S3/Azure Blob for long-term retention.
|
||||
|
||||

|
||||
|
||||
[Snapshots guide →](https://docs.kubeshark.com/en/v2/traffic_snapshots)
|
||||
|
||||
### L4/L7 Workload Map
|
||||
|
||||
Cluster-wide view of service communication: dependencies, traffic flow, and anomalies across all nodes and namespaces.
|
||||
@@ -99,6 +86,14 @@ Cluster-wide view of service communication: dependencies, traffic flow, and anom
|
||||
|
||||
[Learn more →](https://docs.kubeshark.com/en/v2/service_map)
|
||||
|
||||
### Traffic Retention
|
||||
|
||||
Continuous raw packet capture with point-in-time snapshots. Export PCAP files for offline analysis with Wireshark or other tools.
|
||||
|
||||

|
||||
|
||||
[Snapshots guide →](https://docs.kubeshark.com/en/v2/traffic_snapshots)
|
||||
|
||||
---
|
||||
|
||||
## Features
|
||||
|
||||
@@ -153,6 +153,7 @@ func CreateDefaultConfig() ConfigStruct {
|
||||
},
|
||||
Dashboard: configStructs.DashboardConfig{
|
||||
CompleteStreamingEnabled: true,
|
||||
ClusterWideMapEnabled: false,
|
||||
},
|
||||
Capture: configStructs.CaptureConfig{
|
||||
Dissection: configStructs.DissectionConfig{
|
||||
|
||||
@@ -202,6 +202,7 @@ type RoutingConfig struct {
|
||||
type DashboardConfig struct {
|
||||
StreamingType string `yaml:"streamingType" json:"streamingType" default:"connect-rpc"`
|
||||
CompleteStreamingEnabled bool `yaml:"completeStreamingEnabled" json:"completeStreamingEnabled" default:"true"`
|
||||
ClusterWideMapEnabled bool `yaml:"clusterWideMapEnabled" json:"clusterWideMapEnabled" default:"false"`
|
||||
}
|
||||
|
||||
type FrontRoutingConfig struct {
|
||||
@@ -209,9 +210,9 @@ type FrontRoutingConfig struct {
|
||||
}
|
||||
|
||||
type ReleaseConfig struct {
|
||||
Repo string `yaml:"repo" json:"repo" default:"https://helm.kubeshark.com"`
|
||||
Name string `yaml:"name" json:"name" default:"kubeshark"`
|
||||
Namespace string `yaml:"namespace" json:"namespace" default:"default"`
|
||||
Repo string `yaml:"repo" json:"repo" default:"https://helm.kubeshark.com"`
|
||||
Name string `yaml:"name" json:"name" default:"kubeshark"`
|
||||
Namespace string `yaml:"namespace" json:"namespace" default:"default"`
|
||||
HelmChartPath string `yaml:"helmChartPath" json:"helmChartPath" default:""`
|
||||
}
|
||||
|
||||
@@ -411,7 +412,6 @@ type TapConfig struct {
|
||||
Gitops GitopsConfig `yaml:"gitops" json:"gitops"`
|
||||
Sentry SentryConfig `yaml:"sentry" json:"sentry"`
|
||||
DefaultFilter string `yaml:"defaultFilter" json:"defaultFilter" default:""`
|
||||
LiveConfigMapChangesDisabled bool `yaml:"liveConfigMapChangesDisabled" json:"liveConfigMapChangesDisabled" default:"false"`
|
||||
GlobalFilter string `yaml:"globalFilter" json:"globalFilter" default:""`
|
||||
EnabledDissectors []string `yaml:"enabledDissectors" json:"enabledDissectors"`
|
||||
PortMapping PortMapping `yaml:"portMapping" json:"portMapping"`
|
||||
|
||||
@@ -232,7 +232,6 @@ Example for overriding image names:
|
||||
| `tap.sentry.enabled` | Enable sending of error logs to Sentry | `false` |
|
||||
| `tap.sentry.environment` | Sentry environment to label error logs with | `production` |
|
||||
| `tap.defaultFilter` | Sets the default dashboard KFL filter (e.g. `http`). By default, this value is set to filter out noisy protocols such as DNS, UDP, ICMP and TCP. The user can easily change this, **temporarily**, in the Dashboard. For a permanent change, you should change this value in the `values.yaml` or `config.yaml` file. | `""` |
|
||||
| `tap.liveConfigMapChangesDisabled` | If set to `true`, all user functionality (scripting, targeting settings, global & default KFL modification, traffic recording, traffic capturing on/off, protocol dissectors) involving dynamic ConfigMap changes from UI will be disabled | `false` |
|
||||
| `tap.globalFilter` | Prepends to any KFL filter and can be used to limit what is visible in the dashboard. For example, `redact("request.headers.Authorization")` will redact the appropriate field. Another example `!dns` will not show any DNS traffic. | `""` |
|
||||
| `tap.metrics.port` | Pod port used to expose Prometheus metrics | `49100` |
|
||||
| `tap.enabledDissectors` | This is an array of strings representing the list of supported protocols. Remove or comment out redundant protocols (e.g., dns).| The default list excludes: `udp` and `tcp` |
|
||||
|
||||
@@ -95,7 +95,85 @@ helm install kubeshark kubeshark/kubeshark \
|
||||
|
||||
### Example: IRSA (recommended for EKS)
|
||||
|
||||
Create a ConfigMap with bucket configuration:
|
||||
[IAM Roles for Service Accounts (IRSA)](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html) lets EKS pods assume an IAM role without static credentials. EKS injects a short-lived token into the pod automatically.
|
||||
|
||||
**Prerequisites:**
|
||||
|
||||
1. Your EKS cluster must have an [OIDC provider](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) associated with it.
|
||||
2. An IAM role with a trust policy that allows the Kubeshark service account to assume it.
|
||||
|
||||
**Step 1 — Create an IAM policy scoped to your bucket:**
|
||||
|
||||
```json
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"s3:GetObject",
|
||||
"s3:PutObject",
|
||||
"s3:DeleteObject",
|
||||
"s3:GetObjectVersion",
|
||||
"s3:DeleteObjectVersion",
|
||||
"s3:ListBucket",
|
||||
"s3:ListBucketVersions",
|
||||
"s3:GetBucketLocation",
|
||||
"s3:GetBucketVersioning"
|
||||
],
|
||||
"Resource": [
|
||||
"arn:aws:s3:::my-kubeshark-snapshots",
|
||||
"arn:aws:s3:::my-kubeshark-snapshots/*"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
> For read-only access, remove `s3:PutObject`, `s3:DeleteObject`, and `s3:DeleteObjectVersion`.
|
||||
|
||||
**Step 2 — Create an IAM role with IRSA trust policy:**
|
||||
|
||||
```bash
|
||||
# Get your cluster's OIDC provider URL
|
||||
OIDC_PROVIDER=$(aws eks describe-cluster --name CLUSTER_NAME \
|
||||
--query "cluster.identity.oidc.issuer" --output text | sed 's|https://||')
|
||||
|
||||
# Create a trust policy
|
||||
# The default K8s SA name is "<release-name>-service-account" (e.g. "kubeshark-service-account")
|
||||
cat > trust-policy.json <<EOF
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Principal": {
|
||||
"Federated": "arn:aws:iam::ACCOUNT_ID:oidc-provider/${OIDC_PROVIDER}"
|
||||
},
|
||||
"Action": "sts:AssumeRoleWithWebIdentity",
|
||||
"Condition": {
|
||||
"StringEquals": {
|
||||
"${OIDC_PROVIDER}:sub": "system:serviceaccount:NAMESPACE:kubeshark-service-account",
|
||||
"${OIDC_PROVIDER}:aud": "sts.amazonaws.com"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
EOF
|
||||
|
||||
# Create the role and attach your policy
|
||||
aws iam create-role \
|
||||
--role-name KubesharkS3Role \
|
||||
--assume-role-policy-document file://trust-policy.json
|
||||
|
||||
aws iam put-role-policy \
|
||||
--role-name KubesharkS3Role \
|
||||
--policy-name KubesharkSnapshotsBucketAccess \
|
||||
--policy-document file://bucket-policy.json
|
||||
```
|
||||
|
||||
**Step 3 — Create a ConfigMap with bucket configuration:**
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
@@ -107,10 +185,12 @@ data:
|
||||
SNAPSHOT_AWS_REGION: us-east-1
|
||||
```
|
||||
|
||||
Set Helm values:
|
||||
**Step 4 — Set Helm values with `tap.annotations` to annotate the service account:**
|
||||
|
||||
```yaml
|
||||
tap:
|
||||
annotations:
|
||||
eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT_ID:role/KubesharkS3Role
|
||||
snapshots:
|
||||
cloud:
|
||||
provider: "s3"
|
||||
@@ -118,7 +198,17 @@ tap:
|
||||
- kubeshark-s3-config
|
||||
```
|
||||
|
||||
The hub pod's service account must be annotated for IRSA with an IAM role that has S3 access to the bucket.
|
||||
Or via `--set`:
|
||||
|
||||
```bash
|
||||
helm install kubeshark kubeshark/kubeshark \
|
||||
--set tap.snapshots.cloud.provider=s3 \
|
||||
--set tap.snapshots.cloud.s3.bucket=my-kubeshark-snapshots \
|
||||
--set tap.snapshots.cloud.s3.region=us-east-1 \
|
||||
--set tap.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::ACCOUNT_ID:role/KubesharkS3Role
|
||||
```
|
||||
|
||||
No `accessKey`/`secretKey` is needed — EKS injects credentials automatically via the IRSA token.
|
||||
|
||||
### Example: Static Credentials
|
||||
|
||||
|
||||
@@ -26,15 +26,15 @@ spec:
|
||||
- env:
|
||||
- name: REACT_APP_AUTH_ENABLED
|
||||
value: '{{- if or (and .Values.cloudLicenseEnabled (not (empty .Values.license))) (not .Values.internetConnectivity) -}}
|
||||
{{ (and .Values.tap.auth.enabled (eq .Values.tap.auth.type "dex")) | ternary true false }}
|
||||
{{ (default false .Values.demoModeEnabled) | ternary true ((and .Values.tap.auth.enabled (eq .Values.tap.auth.type "dex")) | ternary true false) }}
|
||||
{{- else -}}
|
||||
{{ .Values.cloudLicenseEnabled | ternary "true" .Values.tap.auth.enabled }}
|
||||
{{ .Values.cloudLicenseEnabled | ternary "true" ((default false .Values.demoModeEnabled) | ternary "true" .Values.tap.auth.enabled) }}
|
||||
{{- end }}'
|
||||
- name: REACT_APP_AUTH_TYPE
|
||||
value: '{{- if and .Values.cloudLicenseEnabled (not (eq .Values.tap.auth.type "dex")) -}}
|
||||
default
|
||||
{{- else -}}
|
||||
{{ .Values.tap.auth.type }}
|
||||
{{ (default false .Values.demoModeEnabled) | ternary "default" .Values.tap.auth.type }}
|
||||
{{- end }}'
|
||||
- name: REACT_APP_COMPLETE_STREAMING_ENABLED
|
||||
value: '{{- if and (hasKey .Values.tap "dashboard") (hasKey .Values.tap.dashboard "completeStreamingEnabled") -}}
|
||||
@@ -55,30 +55,22 @@ spec:
|
||||
false
|
||||
{{- end }}'
|
||||
- name: REACT_APP_SCRIPTING_DISABLED
|
||||
value: '{{- if .Values.tap.liveConfigMapChangesDisabled -}}
|
||||
{{- if .Values.demoModeEnabled -}}
|
||||
{{ .Values.demoModeEnabled | ternary false true }}
|
||||
{{- else -}}
|
||||
true
|
||||
{{- end }}
|
||||
{{- else -}}
|
||||
false
|
||||
{{- end }}'
|
||||
value: '{{ default false .Values.demoModeEnabled }}'
|
||||
- name: REACT_APP_TARGETED_PODS_UPDATE_DISABLED
|
||||
value: '{{ .Values.tap.liveConfigMapChangesDisabled }}'
|
||||
value: '{{ default false .Values.demoModeEnabled }}'
|
||||
- name: REACT_APP_PRESET_FILTERS_CHANGING_ENABLED
|
||||
value: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "false" "true" }}'
|
||||
value: '{{ not (default false .Values.demoModeEnabled) }}'
|
||||
- name: REACT_APP_BPF_OVERRIDE_DISABLED
|
||||
value: '{{ eq .Values.tap.packetCapture "af_packet" | ternary "false" "true" }}'
|
||||
- name: REACT_APP_RECORDING_DISABLED
|
||||
value: '{{ .Values.tap.liveConfigMapChangesDisabled }}'
|
||||
value: '{{ default false .Values.demoModeEnabled }}'
|
||||
- name: REACT_APP_DISSECTION_ENABLED
|
||||
value: '{{ .Values.tap.capture.dissection.enabled | ternary "true" "false" }}'
|
||||
- name: REACT_APP_DISSECTION_CONTROL_ENABLED
|
||||
value: '{{- if and .Values.tap.liveConfigMapChangesDisabled (not .Values.tap.capture.dissection.enabled) -}}
|
||||
value: '{{- if and (not .Values.demoModeEnabled) (not .Values.tap.capture.dissection.enabled) -}}
|
||||
true
|
||||
{{- else -}}
|
||||
{{ not .Values.tap.liveConfigMapChangesDisabled | ternary "true" "false" }}
|
||||
{{ not (default false .Values.demoModeEnabled) | ternary false true }}
|
||||
{{- end -}}'
|
||||
- name: 'REACT_APP_CLOUD_LICENSE_ENABLED'
|
||||
value: '{{- if or (and .Values.cloudLicenseEnabled (not (empty .Values.license))) (not .Values.internetConnectivity) -}}
|
||||
@@ -91,7 +83,13 @@ spec:
|
||||
- name: REACT_APP_BETA_ENABLED
|
||||
value: '{{ default false .Values.betaEnabled | ternary "true" "false" }}'
|
||||
- name: REACT_APP_DISSECTORS_UPDATING_ENABLED
|
||||
value: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "false" "true" }}'
|
||||
value: '{{ not (default false .Values.demoModeEnabled) }}'
|
||||
- name: REACT_APP_SNAPSHOTS_UPDATING_ENABLED
|
||||
value: '{{ not (default false .Values.demoModeEnabled) }}'
|
||||
- name: REACT_APP_DEMO_MODE_ENABLED
|
||||
value: '{{ default false .Values.demoModeEnabled }}'
|
||||
- name: REACT_APP_CLUSTER_WIDE_MAP_ENABLED
|
||||
value: '{{ default false (((.Values).tap).dashboard).clusterWideMapEnabled }}'
|
||||
- name: REACT_APP_RAW_CAPTURE_ENABLED
|
||||
value: '{{ .Values.tap.capture.raw.enabled | ternary "true" "false" }}'
|
||||
- name: REACT_APP_SENTRY_ENABLED
|
||||
|
||||
@@ -19,14 +19,14 @@ data:
|
||||
INGRESS_HOST: '{{ .Values.tap.ingress.host }}'
|
||||
PROXY_FRONT_PORT: '{{ .Values.tap.proxy.front.port }}'
|
||||
AUTH_ENABLED: '{{- if and .Values.cloudLicenseEnabled (not (empty .Values.license)) -}}
|
||||
{{ and .Values.tap.auth.enabled (eq .Values.tap.auth.type "dex") | ternary true false }}
|
||||
{{ (default false .Values.demoModeEnabled) | ternary true ((and .Values.tap.auth.enabled (eq .Values.tap.auth.type "dex")) | ternary true false) }}
|
||||
{{- else -}}
|
||||
{{ .Values.cloudLicenseEnabled | ternary "true" (.Values.tap.auth.enabled | ternary "true" "") }}
|
||||
{{ .Values.cloudLicenseEnabled | ternary "true" ((default false .Values.demoModeEnabled) | ternary "true" .Values.tap.auth.enabled) }}
|
||||
{{- end }}'
|
||||
AUTH_TYPE: '{{- if and .Values.cloudLicenseEnabled (not (eq .Values.tap.auth.type "dex")) -}}
|
||||
default
|
||||
{{- else -}}
|
||||
{{ .Values.tap.auth.type }}
|
||||
{{ (default false .Values.demoModeEnabled) | ternary "default" .Values.tap.auth.type }}
|
||||
{{- end }}'
|
||||
AUTH_SAML_IDP_METADATA_URL: '{{ .Values.tap.auth.saml.idpMetadataUrl }}'
|
||||
AUTH_SAML_ROLE_ATTRIBUTE: '{{ .Values.tap.auth.saml.roleAttribute }}'
|
||||
@@ -44,22 +44,14 @@ data:
|
||||
false
|
||||
{{- end }}'
|
||||
TELEMETRY_DISABLED: '{{ not .Values.internetConnectivity | ternary "true" (not .Values.tap.telemetry.enabled | ternary "true" "false") }}'
|
||||
SCRIPTING_DISABLED: '{{- if .Values.tap.liveConfigMapChangesDisabled -}}
|
||||
{{- if .Values.demoModeEnabled -}}
|
||||
{{ .Values.demoModeEnabled | ternary false true }}
|
||||
{{- else -}}
|
||||
true
|
||||
{{- end }}
|
||||
{{- else -}}
|
||||
false
|
||||
{{- end }}'
|
||||
TARGETED_PODS_UPDATE_DISABLED: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "true" "" }}'
|
||||
PRESET_FILTERS_CHANGING_ENABLED: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "false" "true" }}'
|
||||
RECORDING_DISABLED: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "true" "" }}'
|
||||
DISSECTION_CONTROL_ENABLED: '{{- if and .Values.tap.liveConfigMapChangesDisabled (not .Values.tap.capture.dissection.enabled) -}}
|
||||
SCRIPTING_DISABLED: '{{ default false .Values.demoModeEnabled }}'
|
||||
TARGETED_PODS_UPDATE_DISABLED: '{{ default false .Values.demoModeEnabled }}'
|
||||
PRESET_FILTERS_CHANGING_ENABLED: '{{ not (default false .Values.demoModeEnabled) }}'
|
||||
RECORDING_DISABLED: '{{ (default false .Values.demoModeEnabled) | ternary true false }}'
|
||||
DISSECTION_CONTROL_ENABLED: '{{- if and (not .Values.demoModeEnabled) (not .Values.tap.capture.dissection.enabled) -}}
|
||||
true
|
||||
{{- else -}}
|
||||
{{ not .Values.tap.liveConfigMapChangesDisabled | ternary "true" "false" }}
|
||||
{{ (default false .Values.demoModeEnabled) | ternary false true }}
|
||||
{{- end }}'
|
||||
GLOBAL_FILTER: {{ include "kubeshark.escapeDoubleQuotes" .Values.tap.globalFilter | quote }}
|
||||
DEFAULT_FILTER: {{ include "kubeshark.escapeDoubleQuotes" .Values.tap.defaultFilter | quote }}
|
||||
@@ -76,7 +68,9 @@ data:
|
||||
DUPLICATE_TIMEFRAME: '{{ .Values.tap.misc.duplicateTimeframe }}'
|
||||
ENABLED_DISSECTORS: '{{ gt (len .Values.tap.enabledDissectors) 0 | ternary (join "," .Values.tap.enabledDissectors) "" }}'
|
||||
CUSTOM_MACROS: '{{ toJson .Values.tap.customMacros }}'
|
||||
DISSECTORS_UPDATING_ENABLED: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "false" "true" }}'
|
||||
DISSECTORS_UPDATING_ENABLED: '{{ not (default false .Values.demoModeEnabled) }}'
|
||||
SNAPSHOTS_UPDATING_ENABLED: '{{ not (default false .Values.demoModeEnabled) }}'
|
||||
DEMO_MODE_ENABLED: '{{ default false .Values.demoModeEnabled }}'
|
||||
DETECT_DUPLICATES: '{{ .Values.tap.misc.detectDuplicates | ternary "true" "false" }}'
|
||||
PCAP_DUMP_ENABLE: '{{ .Values.pcapdump.enabled }}'
|
||||
PCAP_TIME_INTERVAL: '{{ .Values.pcapdump.timeInterval }}'
|
||||
|
||||
@@ -185,6 +185,7 @@ tap:
|
||||
dashboard:
|
||||
streamingType: connect-rpc
|
||||
completeStreamingEnabled: true
|
||||
clusterWideMapEnabled: false
|
||||
telemetry:
|
||||
enabled: true
|
||||
resourceGuard:
|
||||
@@ -197,7 +198,6 @@ tap:
|
||||
enabled: false
|
||||
environment: production
|
||||
defaultFilter: ""
|
||||
liveConfigMapChangesDisabled: false
|
||||
globalFilter: ""
|
||||
enabledDissectors:
|
||||
- amqp
|
||||
|
||||
331
skills/kfl/SKILL.md
Normal file
331
skills/kfl/SKILL.md
Normal file
@@ -0,0 +1,331 @@
|
||||
---
|
||||
name: kfl
|
||||
description: >
|
||||
KFL2 (Kubeshark Filter Language) expert. Use this skill whenever the user needs to
|
||||
write, debug, or optimize KFL filters for Kubeshark traffic queries. Trigger on any
|
||||
mention of KFL, CEL filters, traffic filtering, display filters, query syntax,
|
||||
filter expressions, "how do I filter", "show me only", "find traffic where",
|
||||
protocol-specific queries (HTTP status codes, DNS lookups, Redis commands, Kafka topics),
|
||||
Kubernetes-aware filtering (by namespace, pod, service, label, annotation),
|
||||
L4 connection/flow filters, capture source filters, time-based queries, or any
|
||||
request to slice/search/narrow network traffic in Kubeshark. Also trigger when other
|
||||
skills need help constructing filters — KFL is the query language for all Kubeshark
|
||||
traffic analysis.
|
||||
---
|
||||
|
||||
# KFL2 — Kubeshark Filter Language
|
||||
|
||||
You are a KFL2 expert. KFL2 is built on Google's CEL (Common Expression Language)
|
||||
and is the query language for all Kubeshark traffic analysis. It operates as a
|
||||
**display filter** — it doesn't affect what's captured, only what you see.
|
||||
|
||||
Think of KFL the way you think of SQL for databases or Google search syntax for
|
||||
the web. Kubeshark captures and indexes all cluster traffic; KFL is how you
|
||||
search it.
|
||||
|
||||
For the complete variable and field reference, see `references/kfl2-reference.md`.
|
||||
|
||||
## Core Syntax
|
||||
|
||||
KFL expressions are boolean CEL expressions. An empty filter matches everything.
|
||||
|
||||
### Operators
|
||||
|
||||
| Category | Operators |
|
||||
|----------|-----------|
|
||||
| Comparison | `==`, `!=`, `<`, `<=`, `>`, `>=` |
|
||||
| Logical | `&&`, `\|\|`, `!` |
|
||||
| Arithmetic | `+`, `-`, `*`, `/`, `%` |
|
||||
| Membership | `in` |
|
||||
| Ternary | `condition ? true_val : false_val` |
|
||||
|
||||
### String Functions
|
||||
|
||||
```
|
||||
str.contains(substring) // Substring search
|
||||
str.startsWith(prefix) // Prefix match
|
||||
str.endsWith(suffix) // Suffix match
|
||||
str.matches(regex) // Regex match
|
||||
size(str) // String length
|
||||
```
|
||||
|
||||
### Collection Functions
|
||||
|
||||
```
|
||||
size(collection) // List/map/string length
|
||||
key in map // Key existence
|
||||
map[key] // Value access
|
||||
map_get(map, key, default) // Safe access with default
|
||||
value in list // List membership
|
||||
```
|
||||
|
||||
### Time Functions
|
||||
|
||||
```
|
||||
timestamp("2026-03-14T22:00:00Z") // Parse ISO timestamp
|
||||
duration("5m") // Parse duration
|
||||
now() // Current time (snapshot at filter creation)
|
||||
```
|
||||
|
||||
## Protocol Detection
|
||||
|
||||
Boolean flags that indicate which protocol was detected. Use these as the first
|
||||
filter term — they're fast and narrow the search space immediately.
|
||||
|
||||
| Flag | Protocol | Flag | Protocol |
|
||||
|------|----------|------|----------|
|
||||
| `http` | HTTP/1.1, HTTP/2 | `redis` | Redis |
|
||||
| `dns` | DNS | `kafka` | Kafka |
|
||||
| `tls` | TLS/SSL | `amqp` | AMQP |
|
||||
| `tcp` | TCP | `ldap` | LDAP |
|
||||
| `udp` | UDP | `ws` | WebSocket |
|
||||
| `sctp` | SCTP | `gql` | GraphQL (v1+v2) |
|
||||
| `icmp` | ICMP | `gqlv1` / `gqlv2` | GraphQL version-specific |
|
||||
| `radius` | RADIUS | `conn` / `flow` | L4 connection/flow tracking |
|
||||
| `diameter` | Diameter | `tcp_conn` / `udp_conn` | Transport-specific connections |
|
||||
|
||||
## Kubernetes Context
|
||||
|
||||
The most common starting point. Filter by where traffic originates or terminates.
|
||||
|
||||
### Pod and Service Fields
|
||||
|
||||
```
|
||||
src.pod.name == "orders-594487879c-7ddxf"
|
||||
dst.pod.namespace == "production"
|
||||
src.service.name == "api-gateway"
|
||||
dst.service.namespace == "payments"
|
||||
```
|
||||
|
||||
Pod fields fall back to service data when pod info is unavailable, so
|
||||
`dst.pod.namespace` works even for service-level entries.
|
||||
|
||||
### Aggregate Collections
|
||||
|
||||
Match against any direction (src or dst):
|
||||
|
||||
```
|
||||
"production" in namespaces // Any namespace match
|
||||
"orders" in pods // Any pod name match
|
||||
"api-gateway" in services // Any service name match
|
||||
```
|
||||
|
||||
### Labels and Annotations
|
||||
|
||||
```
|
||||
local_labels["app"] == "checkout"
|
||||
remote_labels["version"] == "canary"
|
||||
"tier" in local_labels // Label existence
|
||||
map_get(local_labels, "env", "") == "prod" // Safe access
|
||||
```
|
||||
|
||||
### Node and Process
|
||||
|
||||
```
|
||||
node_name == "ip-10-0-25-170.ec2.internal"
|
||||
local_process_name == "nginx"
|
||||
remote_process_name.contains("postgres")
|
||||
```
|
||||
|
||||
### DNS Resolution
|
||||
|
||||
```
|
||||
src.dns == "api.example.com"
|
||||
dst.dns.contains("redis")
|
||||
```
|
||||
|
||||
## HTTP Filtering
|
||||
|
||||
HTTP is the most common protocol for API-level investigation.
|
||||
|
||||
### Fields
|
||||
|
||||
| Field | Type | Example |
|
||||
|-------|------|---------|
|
||||
| `method` | string | `"GET"`, `"POST"`, `"PUT"`, `"DELETE"` |
|
||||
| `url` | string | Full path + query: `"/api/users?id=123"` |
|
||||
| `path` | string | Path only: `"/api/users"` |
|
||||
| `status_code` | int | `200`, `404`, `500` |
|
||||
| `http_version` | string | `"HTTP/1.1"`, `"HTTP/2"` |
|
||||
| `request.headers` | map | `request.headers["content-type"]` |
|
||||
| `response.headers` | map | `response.headers["server"]` |
|
||||
| `request.cookies` | map | `request.cookies["session"]` |
|
||||
| `response.cookies` | map | `response.cookies["token"]` |
|
||||
| `query_string` | map | `query_string["id"]` |
|
||||
| `request_body_size` | int | Request body bytes |
|
||||
| `response_body_size` | int | Response body bytes |
|
||||
| `elapsed_time` | int | Duration in **microseconds** |
|
||||
|
||||
### Common Patterns
|
||||
|
||||
```
|
||||
// Error investigation
|
||||
http && status_code >= 500 // Server errors
|
||||
http && status_code == 429 // Rate limiting
|
||||
http && status_code >= 400 && status_code < 500 // Client errors
|
||||
|
||||
// Endpoint targeting
|
||||
http && method == "POST" && path.contains("/orders")
|
||||
http && url.matches(".*/api/v[0-9]+/users.*")
|
||||
|
||||
// Performance
|
||||
http && elapsed_time > 5000000 // > 5 seconds
|
||||
http && response_body_size > 1000000 // > 1MB responses
|
||||
|
||||
// Header inspection
|
||||
http && "authorization" in request.headers
|
||||
http && request.headers["content-type"] == "application/json"
|
||||
|
||||
// GraphQL (subset of HTTP)
|
||||
gql && method == "POST" && status_code >= 400
|
||||
```
|
||||
|
||||
## DNS Filtering
|
||||
|
||||
DNS issues are often the hidden root cause of outages.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `dns_questions` | []string | Question domain names |
|
||||
| `dns_answers` | []string | Answer domain names |
|
||||
| `dns_question_types` | []string | Record types: A, AAAA, CNAME, MX, TXT, SRV, PTR |
|
||||
| `dns_request` | bool | Is request |
|
||||
| `dns_response` | bool | Is response |
|
||||
| `dns_request_length` | int | Request size |
|
||||
| `dns_response_length` | int | Response size |
|
||||
|
||||
```
|
||||
dns && "api.external-service.com" in dns_questions
|
||||
dns && dns_response && status_code != 0 // Failed lookups
|
||||
dns && "A" in dns_question_types // A record queries
|
||||
dns && size(dns_questions) > 1 // Multi-question
|
||||
```
|
||||
|
||||
## Database and Messaging Protocols
|
||||
|
||||
### Redis
|
||||
|
||||
```
|
||||
redis && redis_type == "GET" // Command type
|
||||
redis && redis_key.startsWith("session:") // Key pattern
|
||||
redis && redis_command.contains("DEL") // Command search
|
||||
redis && redis_total_size > 10000 // Large operations
|
||||
```
|
||||
|
||||
### Kafka
|
||||
|
||||
```
|
||||
kafka && kafka_api_key_name == "PRODUCE" // Produce operations
|
||||
kafka && kafka_client_id == "payment-processor" // Client filtering
|
||||
kafka && kafka_request_summary.contains("orders") // Topic filtering
|
||||
kafka && kafka_size > 10000 // Large messages
|
||||
```
|
||||
|
||||
### AMQP
|
||||
|
||||
```
|
||||
amqp && amqp_method == "basic.publish"
|
||||
amqp && amqp_summary.contains("order")
|
||||
```
|
||||
|
||||
### LDAP
|
||||
|
||||
```
|
||||
ldap && ldap_type == "bind" // Bind requests
|
||||
ldap && ldap_summary.contains("search")
|
||||
```
|
||||
|
||||
## Transport Layer (L4)
|
||||
|
||||
### TCP/UDP Fields
|
||||
|
||||
```
|
||||
tcp && tcp_error_type != "" // TCP errors
|
||||
udp && udp_length > 1000 // Large UDP packets
|
||||
```
|
||||
|
||||
### Connection Tracking
|
||||
|
||||
```
|
||||
conn && conn_state == "open" // Active connections
|
||||
conn && conn_local_bytes > 1000000 // High-volume
|
||||
conn && "HTTP" in conn_l7_detected // L7 protocol detection
|
||||
tcp_conn && conn_state == "closed" // Closed TCP connections
|
||||
```
|
||||
|
||||
### Flow Tracking (with Rate Metrics)
|
||||
|
||||
```
|
||||
flow && flow_local_pps > 1000 // High packet rate
|
||||
flow && flow_local_bps > 1000000 // High bandwidth
|
||||
flow && flow_state == "closed" && "TLS" in flow_l7_detected
|
||||
tcp_flow && flow_local_bps > 5000000 // High-throughput TCP
|
||||
```
|
||||
|
||||
## Network Layer
|
||||
|
||||
```
|
||||
src.ip == "10.0.53.101"
|
||||
dst.ip.startsWith("192.168.")
|
||||
src.port == 8080
|
||||
dst.port >= 8000 && dst.port <= 9000
|
||||
```
|
||||
|
||||
## Capture Source
|
||||
|
||||
Filter by how traffic was captured:
|
||||
|
||||
```
|
||||
capture_source == "ebpf" // eBPF captured
|
||||
capture_source == "ebpf_tls" // TLS decryption via eBPF
|
||||
capture_source == "af_packet" // AF_PACKET captured
|
||||
capture_backend == "ebpf" // eBPF backend family
|
||||
```
|
||||
|
||||
## Time-Based Filtering
|
||||
|
||||
```
|
||||
timestamp > timestamp("2026-03-14T22:00:00Z")
|
||||
timestamp >= timestamp("2026-03-14T22:00:00Z") && timestamp <= timestamp("2026-03-14T23:00:00Z")
|
||||
timestamp > now() - duration("5m") // Last 5 minutes
|
||||
elapsed_time > 2000000 // Older than 2 seconds
|
||||
```
|
||||
|
||||
## Building Filters: Progressive Narrowing
|
||||
|
||||
The most effective investigation technique — start broad, add constraints:
|
||||
|
||||
```
|
||||
// Step 1: Protocol + namespace
|
||||
http && dst.pod.namespace == "production"
|
||||
|
||||
// Step 2: Add error condition
|
||||
http && dst.pod.namespace == "production" && status_code >= 500
|
||||
|
||||
// Step 3: Narrow to service
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service"
|
||||
|
||||
// Step 4: Narrow to endpoint
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge")
|
||||
|
||||
// Step 5: Add timing
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge") && elapsed_time > 2000000
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Protocol flags first** — `http && ...` is faster than `... && http`
|
||||
2. **`startsWith`/`endsWith` over `contains`** — prefix/suffix checks are faster
|
||||
3. **Specific ports before string ops** — `dst.port == 80` is cheaper than `url.contains(...)`
|
||||
4. **Use `map_get` for labels** — avoids errors on missing keys
|
||||
5. **Keep filters simple** — CEL short-circuits on `&&`, so put cheap checks first
|
||||
|
||||
## Type Safety
|
||||
|
||||
KFL2 is statically typed. Common gotchas:
|
||||
|
||||
- `status_code` is `int`, not string — use `status_code == 200`, not `"200"`
|
||||
- `elapsed_time` is in **microseconds** — 5 seconds = `5000000`
|
||||
- `timestamp` requires `timestamp()` function — not a raw string
|
||||
- Map access on missing keys errors — use `key in map` or `map_get()` first
|
||||
- List membership uses `value in list` — not `list.contains(value)`
|
||||
375
skills/kfl/references/kfl2-reference.md
Normal file
375
skills/kfl/references/kfl2-reference.md
Normal file
@@ -0,0 +1,375 @@
|
||||
# KFL2 Complete Variable and Field Reference
|
||||
|
||||
This is the exhaustive reference for every variable available in KFL2 filters.
|
||||
KFL2 is built on Google's CEL (Common Expression Language) and evaluates against
|
||||
Kubeshark's protobuf-based `BaseEntry` structure.
|
||||
|
||||
## Network-Level Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `src.ip` | string | Source IP address | `"10.0.53.101"` |
|
||||
| `dst.ip` | string | Destination IP address | `"192.168.1.1"` |
|
||||
| `src.port` | int | Source port number | `43210` |
|
||||
| `dst.port` | int | Destination port number | `8080` |
|
||||
| `protocol` | string | Detected protocol type | `"HTTP"`, `"DNS"` |
|
||||
|
||||
## Identity and Metadata Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `id` | int | BaseEntry unique identifier (assigned by sniffer) |
|
||||
| `node_id` | string | Node identifier (assigned by hub) |
|
||||
| `index` | int | Entry index for stream uniqueness |
|
||||
| `stream` | string | Stream identifier (hex string) |
|
||||
| `timestamp` | timestamp | Event time (UTC), use with `timestamp()` function |
|
||||
| `elapsed_time` | int | Age since timestamp in microseconds |
|
||||
| `worker` | string | Worker identifier |
|
||||
|
||||
## Cross-Reference Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `conn_id` | int | L7 to L4 connection cross-reference ID |
|
||||
| `flow_id` | int | L7 to L4 flow cross-reference ID |
|
||||
| `has_pcap` | bool | Whether PCAP data is available for this entry |
|
||||
|
||||
## Capture Source Variables
|
||||
|
||||
| Variable | Type | Description | Values |
|
||||
|----------|------|-------------|--------|
|
||||
| `capture_source` | string | Canonical capture source | `"unspecified"`, `"af_packet"`, `"ebpf"`, `"ebpf_tls"` |
|
||||
| `capture_backend` | string | Backend family | `"af_packet"`, `"ebpf"` |
|
||||
| `capture_source_code` | int | Numeric enum | 0=unspecified, 1=af_packet, 2=ebpf, 3=ebpf_tls |
|
||||
| `capture` | map | Nested map access | `capture["source"]`, `capture["backend"]` |
|
||||
|
||||
## Protocol Detection Flags
|
||||
|
||||
Boolean variables indicating detected protocol. Use as first filter term for performance.
|
||||
|
||||
| Variable | Protocol | Variable | Protocol |
|
||||
|----------|----------|----------|----------|
|
||||
| `http` | HTTP/1.1, HTTP/2 | `redis` | Redis |
|
||||
| `dns` | DNS | `kafka` | Kafka |
|
||||
| `tls` | TLS/SSL handshake | `amqp` | AMQP messaging |
|
||||
| `tcp` | TCP transport | `ldap` | LDAP directory |
|
||||
| `udp` | UDP transport | `ws` | WebSocket |
|
||||
| `sctp` | SCTP streaming | `gql` | GraphQL (v1 or v2) |
|
||||
| `icmp` | ICMP | `gqlv1` | GraphQL v1 only |
|
||||
| `radius` | RADIUS auth | `gqlv2` | GraphQL v2 only |
|
||||
| `diameter` | Diameter | `conn` | L4 connection tracking |
|
||||
| `flow` | L4 flow tracking | `tcp_conn` | TCP connection tracking |
|
||||
| `tcp_flow` | TCP flow tracking | `udp_conn` | UDP connection tracking |
|
||||
| `udp_flow` | UDP flow tracking | | |
|
||||
|
||||
## HTTP Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `method` | string | HTTP method | `"GET"`, `"POST"`, `"PUT"`, `"DELETE"`, `"PATCH"` |
|
||||
| `url` | string | Full URL path and query string | `"/api/users?id=123"` |
|
||||
| `path` | string | URL path component (no query) | `"/api/users"` |
|
||||
| `status_code` | int | HTTP response status code | `200`, `404`, `500` |
|
||||
| `http_version` | string | HTTP protocol version | `"HTTP/1.1"`, `"HTTP/2"` |
|
||||
| `query_string` | map[string]string | Parsed URL query parameters | `query_string["id"]` → `"123"` |
|
||||
| `request.headers` | map[string]string | Request HTTP headers | `request.headers["content-type"]` |
|
||||
| `response.headers` | map[string]string | Response HTTP headers | `response.headers["server"]` |
|
||||
| `request.cookies` | map[string]string | Request cookies | `request.cookies["session"]` |
|
||||
| `response.cookies` | map[string]string | Response cookies | `response.cookies["token"]` |
|
||||
| `request_headers_size` | int | Request headers size in bytes | |
|
||||
| `request_body_size` | int | Request body size in bytes | |
|
||||
| `response_headers_size` | int | Response headers size in bytes | |
|
||||
| `response_body_size` | int | Response body size in bytes | |
|
||||
|
||||
GraphQL requests have `gql` (or `gqlv1`/`gqlv2`) set to true and all HTTP
|
||||
variables available.
|
||||
|
||||
## DNS Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `dns_questions` | []string | Question domain names (request + response) | `["example.com"]` |
|
||||
| `dns_answers` | []string | Answer domain names | `["1.2.3.4"]` |
|
||||
| `dns_question_types` | []string | Record types in questions | `["A"]`, `["AAAA"]`, `["CNAME"]` |
|
||||
| `dns_request` | bool | Is DNS request message | |
|
||||
| `dns_response` | bool | Is DNS response message | |
|
||||
| `dns_request_length` | int | DNS request size in bytes (0 if absent) | |
|
||||
| `dns_response_length` | int | DNS response size in bytes (0 if absent) | |
|
||||
| `dns_total_size` | int | Sum of request + response sizes | |
|
||||
|
||||
Supported question types: A, AAAA, NS, CNAME, SOA, MX, TXT, SRV, PTR, ANY.
|
||||
|
||||
## TLS Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `tls` | bool | TLS payload detected | |
|
||||
| `tls_summary` | string | TLS handshake summary | `"ClientHello"`, `"ServerHello"` |
|
||||
| `tls_info` | string | TLS connection details | `"TLS 1.3, AES-256-GCM"` |
|
||||
| `tls_request_size` | int | TLS request size in bytes | |
|
||||
| `tls_response_size` | int | TLS response size in bytes | |
|
||||
| `tls_total_size` | int | Sum of request + response (computed if not provided) | |
|
||||
|
||||
## TCP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `tcp` | bool | TCP payload detected |
|
||||
| `tcp_method` | string | TCP method information |
|
||||
| `tcp_payload` | bytes | Raw TCP payload data |
|
||||
| `tcp_error_type` | string | TCP error type (empty if none) |
|
||||
| `tcp_error_message` | string | TCP error message (empty if none) |
|
||||
|
||||
## UDP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `udp` | bool | UDP payload detected |
|
||||
| `udp_length` | int | UDP packet length |
|
||||
| `udp_checksum` | int | UDP checksum value |
|
||||
| `udp_payload` | bytes | Raw UDP payload data |
|
||||
|
||||
## SCTP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `sctp` | bool | SCTP payload detected |
|
||||
| `sctp_checksum` | int | SCTP checksum value |
|
||||
| `sctp_chunk_type` | string | SCTP chunk type |
|
||||
| `sctp_length` | int | SCTP chunk length |
|
||||
|
||||
## ICMP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `icmp` | bool | ICMP payload detected |
|
||||
| `icmp_type` | string | ICMP type code |
|
||||
| `icmp_version` | int | ICMP version (4 or 6) |
|
||||
| `icmp_length` | int | ICMP message length |
|
||||
|
||||
## WebSocket Variables
|
||||
|
||||
| Variable | Type | Description | Values |
|
||||
|----------|------|-------------|--------|
|
||||
| `ws` | bool | WebSocket payload detected | |
|
||||
| `ws_opcode` | string | WebSocket operation code | `"text"`, `"binary"`, `"close"`, `"ping"`, `"pong"` |
|
||||
| `ws_request` | bool | Is WebSocket request | |
|
||||
| `ws_response` | bool | Is WebSocket response | |
|
||||
| `ws_request_payload_data` | string | Request payload (safely truncated) | |
|
||||
| `ws_request_payload_length` | int | Request payload length in bytes | |
|
||||
| `ws_response_payload_length` | int | Response payload length in bytes | |
|
||||
|
||||
## Redis Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `redis` | bool | Redis payload detected | |
|
||||
| `redis_type` | string | Redis command verb | `"GET"`, `"SET"`, `"DEL"`, `"HGET"` |
|
||||
| `redis_command` | string | Full Redis command line | `"GET session:1234"` |
|
||||
| `redis_key` | string | Key (truncated to 64 bytes) | `"session:1234"` |
|
||||
| `redis_request_size` | int | Request size (0 if absent) | |
|
||||
| `redis_response_size` | int | Response size (0 if absent) | |
|
||||
| `redis_total_size` | int | Sum of request + response | |
|
||||
|
||||
## Kafka Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `kafka` | bool | Kafka payload detected | |
|
||||
| `kafka_api_key` | int | Kafka API key number | 0=FETCH, 1=PRODUCE |
|
||||
| `kafka_api_key_name` | string | Human-readable API operation | `"PRODUCE"`, `"FETCH"` |
|
||||
| `kafka_client_id` | string | Kafka client identifier | `"payment-processor"` |
|
||||
| `kafka_size` | int | Message size (request preferred, else response) | |
|
||||
| `kafka_request` | bool | Is Kafka request | |
|
||||
| `kafka_response` | bool | Is Kafka response | |
|
||||
| `kafka_request_summary` | string | Request summary/topic | `"orders-topic"` |
|
||||
| `kafka_request_size` | int | Request size (0 if absent) | |
|
||||
| `kafka_response_size` | int | Response size (0 if absent) | |
|
||||
|
||||
## AMQP Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `amqp` | bool | AMQP payload detected | |
|
||||
| `amqp_method` | string | AMQP method name | `"basic.publish"`, `"channel.open"` |
|
||||
| `amqp_summary` | string | Operation summary | |
|
||||
| `amqp_request` | bool | Is AMQP request | |
|
||||
| `amqp_response` | bool | Is AMQP response | |
|
||||
| `amqp_request_length` | int | Request length (0 if absent) | |
|
||||
| `amqp_response_length` | int | Response length (0 if absent) | |
|
||||
| `amqp_total_size` | int | Sum of request + response | |
|
||||
|
||||
## LDAP Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `ldap` | bool | LDAP payload detected |
|
||||
| `ldap_type` | string | LDAP operation type (request preferred) |
|
||||
| `ldap_summary` | string | Operation summary |
|
||||
| `ldap_request` | bool | Is LDAP request |
|
||||
| `ldap_response` | bool | Is LDAP response |
|
||||
| `ldap_request_length` | int | Request length (0 if absent) |
|
||||
| `ldap_response_length` | int | Response length (0 if absent) |
|
||||
| `ldap_total_size` | int | Sum of request + response |
|
||||
|
||||
## RADIUS Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `radius` | bool | RADIUS payload detected | |
|
||||
| `radius_code` | int | RADIUS code (request preferred) | |
|
||||
| `radius_code_name` | string | Code name | `"Access-Request"` |
|
||||
| `radius_request` | bool | Is RADIUS request | |
|
||||
| `radius_response` | bool | Is RADIUS response | |
|
||||
| `radius_request_authenticator` | string | Request authenticator (hex) | |
|
||||
| `radius_request_length` | int | Request size (0 if absent) | |
|
||||
| `radius_response_length` | int | Response size (0 if absent) | |
|
||||
| `radius_total_size` | int | Sum of request + response | |
|
||||
|
||||
## Diameter Variables
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `diameter` | bool | Diameter payload detected |
|
||||
| `diameter_method` | string | Method name (request preferred) |
|
||||
| `diameter_summary` | string | Operation summary |
|
||||
| `diameter_request` | bool | Is Diameter request |
|
||||
| `diameter_response` | bool | Is Diameter response |
|
||||
| `diameter_request_length` | int | Request size (0 if absent) |
|
||||
| `diameter_response_length` | int | Response size (0 if absent) |
|
||||
| `diameter_total_size` | int | Sum of request + response |
|
||||
|
||||
## L4 Connection Tracking Variables
|
||||
|
||||
| Variable | Type | Description | Example |
|
||||
|----------|------|-------------|---------|
|
||||
| `conn` | bool | Connection tracking entry | |
|
||||
| `conn_state` | string | Connection state | `"open"`, `"in_progress"`, `"closed"` |
|
||||
| `conn_local_pkts` | int | Packets from local peer | |
|
||||
| `conn_local_bytes` | int | Bytes from local peer | |
|
||||
| `conn_remote_pkts` | int | Packets from remote peer | |
|
||||
| `conn_remote_bytes` | int | Bytes from remote peer | |
|
||||
| `conn_l7_detected` | []string | L7 protocols detected on connection | `["HTTP", "TLS"]` |
|
||||
| `conn_group_id` | int | Connection group identifier | |
|
||||
|
||||
## L4 Flow Tracking Variables
|
||||
|
||||
Flows extend connections with rate metrics (packets/bytes per second).
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `flow` | bool | Flow tracking entry |
|
||||
| `flow_state` | string | Flow state (`"open"`, `"in_progress"`, `"closed"`) |
|
||||
| `flow_local_pkts` | int | Packets from local peer |
|
||||
| `flow_local_bytes` | int | Bytes from local peer |
|
||||
| `flow_remote_pkts` | int | Packets from remote peer |
|
||||
| `flow_remote_bytes` | int | Bytes from remote peer |
|
||||
| `flow_local_pps` | int | Local packets per second |
|
||||
| `flow_local_bps` | int | Local bytes per second |
|
||||
| `flow_remote_pps` | int | Remote packets per second |
|
||||
| `flow_remote_bps` | int | Remote bytes per second |
|
||||
| `flow_l7_detected` | []string | L7 protocols detected on flow |
|
||||
| `flow_group_id` | int | Flow group identifier |
|
||||
|
||||
## Kubernetes Variables
|
||||
|
||||
### Pod and Service (Directional)
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `src.pod.name` | string | Source pod name |
|
||||
| `src.pod.namespace` | string | Source pod namespace |
|
||||
| `dst.pod.name` | string | Destination pod name |
|
||||
| `dst.pod.namespace` | string | Destination pod namespace |
|
||||
| `src.service.name` | string | Source service name |
|
||||
| `src.service.namespace` | string | Source service namespace |
|
||||
| `dst.service.name` | string | Destination service name |
|
||||
| `dst.service.namespace` | string | Destination service namespace |
|
||||
|
||||
**Fallback behavior**: Pod namespace/name fields automatically fall back to
|
||||
service data when pod info is unavailable. This means `dst.pod.namespace` works
|
||||
even when only service-level resolution exists.
|
||||
|
||||
### Aggregate Collections (Non-Directional)
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `namespaces` | []string | All namespaces (src + dst, pod + service) |
|
||||
| `pods` | []string | All pod names (src + dst) |
|
||||
| `services` | []string | All service names (src + dst) |
|
||||
|
||||
### Labels and Annotations
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `local_labels` | map[string]string | Kubernetes labels of local peer |
|
||||
| `local_annotations` | map[string]string | Kubernetes annotations of local peer |
|
||||
| `remote_labels` | map[string]string | Kubernetes labels of remote peer |
|
||||
| `remote_annotations` | map[string]string | Kubernetes annotations of remote peer |
|
||||
|
||||
Use `map_get(local_labels, "key", "default")` for safe access that won't error
|
||||
on missing keys.
|
||||
|
||||
### Node Information
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `node` | map | Nested: `node["name"]`, `node["ip"]` |
|
||||
| `node_name` | string | Node name (flat alias) |
|
||||
| `node_ip` | string | Node IP (flat alias) |
|
||||
| `local_node_name` | string | Node name of local peer |
|
||||
| `remote_node_name` | string | Node name of remote peer |
|
||||
|
||||
### Process Information
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `local_process_name` | string | Process name on local peer |
|
||||
| `remote_process_name` | string | Process name on remote peer |
|
||||
|
||||
### DNS Resolution
|
||||
|
||||
| Variable | Type | Description |
|
||||
|----------|------|-------------|
|
||||
| `src.dns` | string | DNS resolution of source IP |
|
||||
| `dst.dns` | string | DNS resolution of destination IP |
|
||||
| `dns_resolutions` | []string | All DNS resolutions (deduplicated) |
|
||||
|
||||
### Resolution Status
|
||||
|
||||
| Variable | Type | Values |
|
||||
|----------|------|--------|
|
||||
| `local_resolution_status` | string | `""` (resolved), `"no_node_mapping"`, `"rpc_error"`, `"rpc_empty"`, `"cache_miss"`, `"queue_full"` |
|
||||
| `remote_resolution_status` | string | Same as above |
|
||||
|
||||
## Default Values
|
||||
|
||||
When a variable is not present in an entry, KFL2 uses these defaults:
|
||||
|
||||
| Type | Default |
|
||||
|------|---------|
|
||||
| string | `""` |
|
||||
| int | `0` |
|
||||
| bool | `false` |
|
||||
| list | `[]` |
|
||||
| map | `{}` |
|
||||
| bytes | `[]` |
|
||||
|
||||
## Protocol Variable Precedence
|
||||
|
||||
For protocols with request/response pairs (Kafka, RADIUS, Diameter), merged
|
||||
fields prefer the **request** side. If no request exists, the response value
|
||||
is used. Size totals are always computed as `request_size + response_size`.
|
||||
|
||||
## CEL Language Features
|
||||
|
||||
KFL2 supports the full CEL specification:
|
||||
|
||||
- **Short-circuit evaluation**: `&&` stops on first false, `||` stops on first true
|
||||
- **Ternary**: `condition ? value_if_true : value_if_false`
|
||||
- **Regex**: `str.matches("pattern")` uses RE2 syntax
|
||||
- **Type coercion**: Timestamps require `timestamp()`, durations require `duration()`
|
||||
- **Null safety**: Use `in` operator or `map_get()` before accessing map keys
|
||||
|
||||
For the full CEL specification, see the
|
||||
[CEL Language Definition](https://github.com/google/cel-spec/blob/master/doc/langdef.md).
|
||||
409
skills/network-rca/SKILL.md
Normal file
409
skills/network-rca/SKILL.md
Normal file
@@ -0,0 +1,409 @@
|
||||
---
|
||||
name: network-rca
|
||||
description: >
|
||||
Kubernetes network root cause analysis skill powered by Kubeshark MCP. Use this skill
|
||||
whenever the user wants to investigate past incidents, perform retrospective traffic
|
||||
analysis, take or manage traffic snapshots, extract PCAPs, dissect L7 API calls from
|
||||
historical captures, compare traffic patterns over time, detect drift or anomalies
|
||||
between snapshots, or do any kind of forensic network analysis in Kubernetes.
|
||||
Also trigger when the user mentions snapshots, raw capture, PCAP extraction,
|
||||
traffic replay, postmortem analysis, "what happened yesterday/last week",
|
||||
root cause analysis, RCA, cloud snapshot storage, snapshot dissection, or KFL filters
|
||||
for historical traffic. Even if the user just says "figure out what went wrong"
|
||||
or "compare today's traffic to yesterday" in a Kubernetes context, use this skill.
|
||||
---
|
||||
|
||||
# Network Root Cause Analysis with Kubeshark MCP
|
||||
|
||||
You are a Kubernetes network forensics specialist. Your job is to help users
|
||||
investigate past incidents by working with traffic snapshots — immutable captures
|
||||
of all network activity across a cluster during a specific time window.
|
||||
|
||||
Kubeshark is a search engine for network traffic. Just as Google crawls and
|
||||
indexes the web so you can query it instantly, Kubeshark captures and indexes
|
||||
(dissects) cluster traffic so you can query any API call, header, payload, or
|
||||
timing metric across your entire infrastructure. Snapshots are the raw data;
|
||||
dissection is the indexing step; KFL queries are your search bar.
|
||||
|
||||
Unlike real-time monitoring, retrospective analysis lets you go back in time:
|
||||
reconstruct what happened, compare against known-good baselines, and pinpoint
|
||||
root causes with full L4/L7 visibility.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before starting any analysis, verify the environment is ready.
|
||||
|
||||
### Kubeshark MCP Health Check
|
||||
|
||||
Confirm the Kubeshark MCP is accessible and tools are available. Look for tools
|
||||
like `list_api_calls`, `list_l4_flows`, `create_snapshot`, etc.
|
||||
|
||||
**Tool**: `check_kubeshark_status`
|
||||
|
||||
If tools like `list_api_calls` or `list_l4_flows` are missing from the response,
|
||||
something is wrong with the MCP connection. Guide the user through setup
|
||||
(see Setup Reference at the bottom).
|
||||
|
||||
### Raw Capture Must Be Enabled
|
||||
|
||||
Retrospective analysis depends on raw capture — Kubeshark's kernel-level (eBPF)
|
||||
packet recording that stores traffic at the node level. Without it, snapshots
|
||||
have nothing to work with.
|
||||
|
||||
Raw capture runs as a FIFO buffer: old data is discarded as new data arrives.
|
||||
The buffer size determines how far back you can go. Larger buffer = wider
|
||||
snapshot window.
|
||||
|
||||
```yaml
|
||||
tap:
|
||||
capture:
|
||||
raw:
|
||||
enabled: true
|
||||
storageSize: 10Gi # Per-node FIFO buffer
|
||||
```
|
||||
|
||||
If raw capture isn't enabled, inform the user that retrospective analysis
|
||||
requires it and share the configuration above.
|
||||
|
||||
### Snapshot Storage
|
||||
|
||||
Snapshots are assembled on the Hub's storage, which is ephemeral by default.
|
||||
For serious forensic work, persistent storage is recommended:
|
||||
|
||||
```yaml
|
||||
tap:
|
||||
snapshots:
|
||||
local:
|
||||
storageClass: gp2
|
||||
storageSize: 1000Gi
|
||||
```
|
||||
|
||||
## Core Workflow
|
||||
|
||||
The general flow for any RCA investigation:
|
||||
|
||||
1. **Determine time window** — When did the issue occur? Use `get_data_boundaries`
|
||||
to see what raw capture data is available.
|
||||
2. **Create or locate a snapshot** — Either take a new snapshot covering the
|
||||
incident window, or find an existing one with `list_snapshots`.
|
||||
3. **Dissect the snapshot** — Activate L7 dissection so you can query API calls,
|
||||
not just raw packets.
|
||||
4. **Investigate** — Use KFL filters to slice through the traffic. Start broad,
|
||||
narrow progressively.
|
||||
5. **Extract evidence** — Export filtered PCAPs, resolve workload IPs, pull
|
||||
specific API call details.
|
||||
6. **Compare** (optional) — Diff against a known-good snapshot to identify
|
||||
what changed.
|
||||
|
||||
## Snapshot Operations
|
||||
|
||||
### Check Data Boundaries
|
||||
|
||||
Before creating a snapshot, check what raw capture data exists across the cluster.
|
||||
|
||||
**Tool**: `get_data_boundaries`
|
||||
|
||||
This returns the time window available per node. You can only create snapshots
|
||||
within these boundaries — data outside the window has already been rotated out
|
||||
of the FIFO buffer.
|
||||
|
||||
**Example response**:
|
||||
```
|
||||
Cluster-wide:
|
||||
Oldest: 2026-03-14 16:12:34 UTC
|
||||
Newest: 2026-03-14 18:05:20 UTC
|
||||
|
||||
Per node:
|
||||
┌─────────────────────────────┬──────────┬──────────┐
|
||||
│ Node │ Oldest │ Newest │
|
||||
├─────────────────────────────┼──────────┼──────────┤
|
||||
│ ip-10-0-25-170.ec2.internal │ 16:12:34 │ 18:03:39 │
|
||||
│ ip-10-0-32-115.ec2.internal │ 16:13:45 │ 18:05:20 │
|
||||
└─────────────────────────────┴──────────┴──────────┘
|
||||
```
|
||||
|
||||
If the user's incident falls outside the available window, let them know the
|
||||
data has been rotated out. Suggest increasing `storageSize` for future coverage.
|
||||
|
||||
### Create a Snapshot
|
||||
|
||||
**Tool**: `create_snapshot`
|
||||
|
||||
Specify nodes (or cluster-wide) and a time window within the data boundaries.
|
||||
Snapshots include everything needed to reconstruct the traffic picture:
|
||||
raw capture files, Kubernetes pod events, and eBPF cgroup events.
|
||||
|
||||
Snapshots take time to build. After creating one, check its status.
|
||||
|
||||
**Tool**: `get_snapshot`
|
||||
|
||||
Wait until status is `completed` before proceeding with dissection or PCAP export.
|
||||
|
||||
### List Existing Snapshots
|
||||
|
||||
**Tool**: `list_snapshots`
|
||||
|
||||
Shows all snapshots on the local Hub, with name, size, status, and node count.
|
||||
Use this when the user wants to work with a previously captured snapshot.
|
||||
|
||||
### Cloud Storage
|
||||
|
||||
Snapshots on the Hub are ephemeral and space-limited. Cloud storage (S3, GCS,
|
||||
Azure Blob) provides long-term retention. Snapshots can be downloaded to any
|
||||
cluster with Kubeshark — not necessarily the original cluster. This means you can
|
||||
download a production snapshot to a local KinD cluster for safe analysis.
|
||||
|
||||
**Check cloud status**: `get_cloud_storage_status`
|
||||
**Upload to cloud**: `upload_snapshot_to_cloud`
|
||||
**Download from cloud**: `download_snapshot_from_cloud`
|
||||
|
||||
When cloud storage is configured, recommend uploading snapshots after analysis
|
||||
for long-term retention, especially for compliance or post-mortem documentation.
|
||||
|
||||
## L7 API Dissection
|
||||
|
||||
Think of dissection the way a search engine thinks of indexing. A raw snapshot
|
||||
is like the raw internet — billions of packets, impossible to query efficiently.
|
||||
Dissection indexes that traffic: it reconstructs packets into structured L7 API
|
||||
calls, builds a queryable database of every request, response, header, payload,
|
||||
and timing metric. Once dissected, Kubeshark becomes a search engine for your
|
||||
network traffic — you type a query (using KFL filters), and get instant,
|
||||
precise answers from terabytes of captured data.
|
||||
|
||||
Without dissection, you have PCAPs. With dissection, you have answers.
|
||||
|
||||
### Activate Dissection
|
||||
|
||||
**Tool**: `start_snapshot_dissection`
|
||||
|
||||
Dissection takes time proportional to the snapshot size — it's parsing every
|
||||
packet, reassembling streams, and building the index. After it completes,
|
||||
the full query engine is available:
|
||||
- `list_api_calls` — Search API transactions with filters (the "Google search" for your traffic)
|
||||
- `get_api_call` — Drill into a specific call (headers, body, timing)
|
||||
- `get_api_stats` — Aggregated statistics (throughput, error rates, latency)
|
||||
|
||||
### Investigation Strategy
|
||||
|
||||
Start broad, then narrow:
|
||||
|
||||
1. `get_api_stats` — Get the overall picture: error rates, latency percentiles,
|
||||
throughput. Look for spikes or anomalies.
|
||||
2. `list_api_calls` filtered by error codes (4xx, 5xx) or high latency — find
|
||||
the problematic transactions.
|
||||
3. `get_api_call` on specific calls — inspect headers, bodies, timing to
|
||||
understand what went wrong.
|
||||
4. Use KFL filters (see below) to slice the traffic by namespace, service,
|
||||
protocol, or any combination.
|
||||
|
||||
## PCAP Extraction
|
||||
|
||||
Sometimes you need the raw packets — for Wireshark analysis, sharing with
|
||||
network teams, or compliance evidence.
|
||||
|
||||
### Export a PCAP
|
||||
|
||||
**Tool**: `export_snapshot_pcap`
|
||||
|
||||
You can export the full snapshot or filter it down using:
|
||||
- **Nodes** — specific nodes only
|
||||
- **Time** — sub-window within the snapshot
|
||||
- **BPF filter** — standard Berkeley Packet Filter syntax (e.g., `host 10.0.53.101`,
|
||||
`port 8080`, `net 10.0.0.0/16`)
|
||||
|
||||
### Resolve Workload IPs
|
||||
|
||||
When you care about specific workloads but don't have their IPs, resolve them
|
||||
from the snapshot's metadata. Snapshots preserve the pod-to-IP mappings from
|
||||
capture time, so you get accurate resolution even if pods have since been
|
||||
rescheduled.
|
||||
|
||||
**Tool**: `resolve_workload`
|
||||
|
||||
**Example**: Resolve the IP of `orders-594487879c-7ddxf` from snapshot `slim-timestamp`
|
||||
→ Returns `10.0.53.101`
|
||||
|
||||
Then use that IP in a BPF filter to extract only that workload's traffic:
|
||||
`export_snapshot_pcap` with BPF `host 10.0.53.101`
|
||||
|
||||
## KFL — Kubeshark Filter Language
|
||||
|
||||
KFL2 is the query language for slicing through dissected traffic. For the
|
||||
complete KFL2 reference (all variables, operators, protocol fields, and examples),
|
||||
see the **KFL skill** (`skills/kfl/`).
|
||||
|
||||
### RCA-Specific Filter Patterns
|
||||
|
||||
Layer filters progressively when investigating an incident:
|
||||
|
||||
```
|
||||
// Step 1: Protocol + namespace
|
||||
http && dst.pod.namespace == "production"
|
||||
|
||||
// Step 2: Add error condition
|
||||
http && dst.pod.namespace == "production" && status_code >= 500
|
||||
|
||||
// Step 3: Narrow to service
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service"
|
||||
|
||||
// Step 4: Narrow to endpoint
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge")
|
||||
|
||||
// Step 5: Add timing
|
||||
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge") && elapsed_time > 2000000
|
||||
```
|
||||
|
||||
Other common RCA filters:
|
||||
|
||||
```
|
||||
dns && dns_response && status_code != 0 // Failed DNS lookups
|
||||
src.service.namespace != dst.service.namespace // Cross-namespace traffic
|
||||
http && elapsed_time > 5000000 // Slow transactions (> 5s)
|
||||
conn && conn_state == "open" && conn_local_bytes > 1000000 // High-volume connections
|
||||
```
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Post-Incident RCA
|
||||
|
||||
The primary use case. Something broke, it's been resolved, and now you need
|
||||
to understand why.
|
||||
|
||||
1. Identify the incident time window from alerts, logs, or user reports
|
||||
2. Check `get_data_boundaries` — is the window still in raw capture?
|
||||
3. `create_snapshot` covering the incident window (add buffer: 15 minutes
|
||||
before and after the reported time)
|
||||
4. `start_snapshot_dissection`
|
||||
5. `get_api_stats` — look for error rate spikes, latency jumps
|
||||
6. `list_api_calls` filtered to errors — identify the failing service chain
|
||||
7. `get_api_call` on specific failures — read headers, bodies, timing
|
||||
8. Follow the dependency chain upstream until you find the originating failure
|
||||
9. Export relevant PCAPs for the post-mortem document
|
||||
|
||||
### Trend Analysis and Drift Detection
|
||||
|
||||
Take snapshots at regular intervals (daily, weekly) with consistent parameters.
|
||||
Compare them to detect:
|
||||
|
||||
- **Latency drift** — p95 latency creeping up over days
|
||||
- **API surface changes** — new endpoints appearing, old ones disappearing
|
||||
- **Error rate trends** — gradual increase in 5xx responses
|
||||
- **Traffic pattern shifts** — new service-to-service connections, volume changes
|
||||
- **Security posture regression** — unencrypted traffic appearing, new external
|
||||
connections
|
||||
|
||||
**Workflow**:
|
||||
1. `create_snapshot` with consistent parameters (same time-of-day, same duration)
|
||||
2. `start_snapshot_dissection` on each
|
||||
3. `get_api_stats` on each — compare metrics side by side
|
||||
4. `list_api_calls` with targeted KFL filters — diff the results
|
||||
5. Flag anomalies and regressions
|
||||
|
||||
This is powerful when combined with scheduled tasks — automate daily snapshot
|
||||
creation and comparison to catch drift before it becomes an incident.
|
||||
|
||||
### Forensic Evidence Preservation
|
||||
|
||||
For compliance, legal, or audit requirements:
|
||||
|
||||
1. `create_snapshot` immediately when an incident is detected
|
||||
2. `upload_snapshot_to_cloud` — immutable copy in long-term storage
|
||||
3. Document the snapshot ID, time window, and chain of custody
|
||||
4. The snapshot can be downloaded to any Kubeshark cluster for later analysis,
|
||||
even months later, even on a completely different cluster
|
||||
|
||||
### Production-to-Local Replay
|
||||
|
||||
Investigate production issues safely on a local cluster:
|
||||
|
||||
1. `create_snapshot` on the production cluster
|
||||
2. `upload_snapshot_to_cloud`
|
||||
3. On a local KinD/minikube cluster with Kubeshark: `download_snapshot_from_cloud`
|
||||
4. `start_snapshot_dissection` — full L7 analysis locally
|
||||
5. Investigate without touching production
|
||||
|
||||
## Composability
|
||||
|
||||
This skill is designed to work alongside other Kubeshark-powered skills:
|
||||
|
||||
- **API Security Skill** — Run security scans against a snapshot's dissected traffic.
|
||||
Take daily snapshots and diff security findings to detect posture drift.
|
||||
- **Incident Response Skill** — Use this skill's snapshot workflow as the evidence
|
||||
preservation and forensic analysis layer within the IR methodology.
|
||||
- **Network Engineering Skill** — Use snapshots for baseline traffic characterization
|
||||
and architecture reviews.
|
||||
|
||||
When multiple skills are loaded, they share context. A snapshot created here
|
||||
can be analyzed by the security skill's OWASP scans or the IR skill's
|
||||
7-phase methodology.
|
||||
|
||||
## Setup Reference
|
||||
|
||||
### Installing the CLI
|
||||
|
||||
**Homebrew (macOS)**:
|
||||
```bash
|
||||
brew install kubeshark
|
||||
```
|
||||
|
||||
**Linux**:
|
||||
```bash
|
||||
sh <(curl -Ls https://kubeshark.com/install)
|
||||
```
|
||||
|
||||
**From source**:
|
||||
```bash
|
||||
git clone https://github.com/kubeshark/kubeshark
|
||||
cd kubeshark && make
|
||||
```
|
||||
|
||||
### MCP Configuration
|
||||
|
||||
**Claude Desktop / Cowork** (`claude_desktop_config.json`):
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubeshark": {
|
||||
"command": "kubeshark",
|
||||
"args": ["mcp"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Claude Code (CLI)**:
|
||||
```bash
|
||||
claude mcp add kubeshark -- kubeshark mcp
|
||||
```
|
||||
|
||||
**Without kubectl access** (direct URL mode):
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"kubeshark": {
|
||||
"command": "kubeshark",
|
||||
"args": ["mcp", "--url", "https://kubeshark.example.com"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
```bash
|
||||
# Claude Code equivalent:
|
||||
claude mcp add kubeshark -- kubeshark mcp --url https://kubeshark.example.com
|
||||
```
|
||||
|
||||
### Verification
|
||||
|
||||
- Claude Code: `/mcp` to check connection status
|
||||
- Terminal: `kubeshark mcp --list-tools`
|
||||
- Cluster: `kubectl get pods -l app=kubeshark-hub`
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
- **Binary not found** → Install via Homebrew or the install script above
|
||||
- **Connection refused** → Deploy Kubeshark first: `kubeshark tap`
|
||||
- **No L7 data** → Check `get_dissection_status` and `enable_dissection`
|
||||
- **Snapshot creation fails** → Verify raw capture is enabled in Kubeshark config
|
||||
- **Empty snapshot** → Check `get_data_boundaries` — the requested window may
|
||||
fall outside available data
|
||||
Reference in New Issue
Block a user