Compare commits

...

6 Commits

Author SHA1 Message Date
Alon Girmonsky
37f8becb7b Update label examples: direct access and map_get, note indexed queries 2026-03-18 17:11:35 -07:00
Alon Girmonsky
10dbedf356 Add KFL and Network RCA skills (#1875)
* Add KFL and Network RCA skills

Introduce the skills/ directory with two Kubeshark MCP skills:

- network-rca: Retrospective traffic analysis via snapshots, dissection,
  KFL queries, PCAP extraction, and trend comparison
- kfl: Complete KFL2 (Kubeshark Filter Language) reference covering all
  supported protocols, variables, operators, and filter patterns

Update CLAUDE.md with skill authoring guidelines, structure conventions,
and the list of available Kubeshark MCP tools.

* Optimize skills and add shared setup reference

- network-rca: cut repeated metaphor, add list_api_calls example response,
  consolidate use cases, remove unbuilt composability section, extract
  setup reference to references/setup.md (409 → 306 lines)
- kfl: merge thin protocol sections, fix map_get inconsistency, add
  negation examples, move capture source to reference doc
- kfl2-reference: add most-commonly-used variables table, add inline
  filter examples per protocol section
- Add skills/README.md with usage and contribution guidelines

* Add plugin infrastructure and update READMEs

- Add .claude-plugin/plugin.json and marketplace.json for Claude Code
  plugin distribution
- Add .mcp.json bundling the Kubeshark MCP configuration
- Update skills/README.md with plugin install, manual install, and
  agent compatibility sections
- Update mcp/README.md with AI skills section and install instructions
- Restructure network-rca skill into two distinct investigation routes:
  PCAP (no dissection, BPF filters, Wireshark/compliance) and
  Dissection (indexed queries, AI-driven analysis, payload inspection)

* Remove CLAUDE.md from tracked files

Content now lives in skills/README.md, mcp/README.md, and the skills themselves.

* Add README to .claude-plugin directory

* Reorder MCP config: default mode first, URL mode for no-kubectl

* Move AI Skills section to top of MCP README

* Reorder manual install: symlink first

* Streamline skills README: focus on usage and contributing

* Enforce KFL skill loading before writing filters

- network-rca: require loading KFL skill before constructing filters,
  suggest installation if unavailable
- kfl: set user-invocable: false (background knowledge skill), strengthen
  description to mandate loading before any filter construction

* Move KFL requirement to top of Dissection route

* Add strict fallback: only use exact examples if KFL skill unavailable

* Add clone step to manual installation

* Use $PWD/kubeshark paths in manual install examples

* Add mkdir before symlinks, simplify paths

* Move prerequisites before installation

---------

Co-authored-by: Alon Girmonsky <alongir@Alons-Mac-Studio.local>
2026-03-18 15:31:32 -07:00
Serhii Ponomarenko
963b3e4ac2 🐛 Add default value for demoModeEnabled (#1872) 2026-03-17 13:22:42 -07:00
Volodymyr Stoiko
b2813e02bd Add detailed docs for kubeshark irsa setup (#1871)
Co-authored-by: Alon Girmonsky <1990761+alongir@users.noreply.github.com>
2026-03-16 20:29:49 -07:00
Serhii Ponomarenko
707d7351b6 🛂 Demo Portal: Readonly mode / no authn (#1869)
* 🔨 Add snapshots-updating-enabled `front` env

* 🔨 Add snapshots-updating-enabled config

* 🔨 Add demo-enabled `front` env

* 🔨 Add demo-enabled config

* 🔨 Replace `liveConfigMapChangesDisabled` with `demoModeEnabled` flag

* 🐛 Fix dissection-control-enabled env logic

* 🦺 Handle nullish `demoModeEnabled` value
2026-03-16 20:01:18 -07:00
Serhii Ponomarenko
23c86be773 🛂 Control L4 map visibility (helm value) (#1866)
* 🔨 Add `tap.dashboard.clusterWideMapEnabled` helm value

* 🔨 Add cluster-wide-map-enabled `front` env

* 🔨 Add fallback value for `front` env
2026-03-11 15:36:20 -07:00
17 changed files with 1507 additions and 50 deletions

33
.claude-plugin/README.md Normal file
View File

@@ -0,0 +1,33 @@
# Kubeshark Claude Code Plugin
This directory contains the [Claude Code plugin](https://docs.anthropic.com/en/docs/claude-code/plugins) configuration for Kubeshark.
## What's here
| File | Purpose |
|------|---------|
| `plugin.json` | Plugin manifest — name, version, description, metadata |
| `marketplace.json` | Marketplace index — allows discovery via `/plugin marketplace add` |
## Installing the plugin
```
/plugin marketplace add kubeshark/kubeshark
/plugin install kubeshark
```
This loads the Kubeshark AI skills and MCP configuration. Skills appear as
`/kubeshark:network-rca` and `/kubeshark:kfl`.
## What the plugin includes
- **Skills** from [`skills/`](../skills/) — network root cause analysis and KFL filter expertise
- **MCP configuration** from [`.mcp.json`](../.mcp.json) — connects to the Kubeshark MCP server
## Local development
Test the plugin without installing:
```bash
claude --plugin-dir /path/to/kubeshark
```

View File

@@ -0,0 +1,15 @@
{
"name": "kubeshark",
"description": "Kubeshark network observability skills for Kubernetes",
"plugins": [
{
"name": "kubeshark",
"description": "Network observability skills powered by Kubeshark MCP — root cause analysis, KFL traffic filtering, snapshot forensics, PCAP extraction.",
"source": {
"source": "github",
"owner": "kubeshark",
"repo": "kubeshark"
}
}
]
}

View File

@@ -0,0 +1,24 @@
{
"name": "kubeshark",
"version": "1.0.0",
"description": "Kubernetes network observability skills powered by Kubeshark MCP. Root cause analysis, traffic filtering, snapshot forensics, PCAP extraction, and more.",
"author": {
"name": "Kubeshark",
"url": "https://kubeshark.com"
},
"homepage": "https://kubeshark.com",
"repository": "https://github.com/kubeshark/kubeshark",
"license": "Apache-2.0",
"keywords": [
"kubeshark",
"kubernetes",
"network",
"observability",
"traffic",
"mcp",
"rca",
"pcap",
"kfl",
"ebpf"
]
}

8
.mcp.json Normal file
View File

@@ -0,0 +1,8 @@
{
"mcpServers": {
"kubeshark": {
"command": "kubeshark",
"args": ["mcp"]
}
}
}

View File

@@ -153,6 +153,7 @@ func CreateDefaultConfig() ConfigStruct {
},
Dashboard: configStructs.DashboardConfig{
CompleteStreamingEnabled: true,
ClusterWideMapEnabled: false,
},
Capture: configStructs.CaptureConfig{
Dissection: configStructs.DissectionConfig{

View File

@@ -202,6 +202,7 @@ type RoutingConfig struct {
type DashboardConfig struct {
StreamingType string `yaml:"streamingType" json:"streamingType" default:"connect-rpc"`
CompleteStreamingEnabled bool `yaml:"completeStreamingEnabled" json:"completeStreamingEnabled" default:"true"`
ClusterWideMapEnabled bool `yaml:"clusterWideMapEnabled" json:"clusterWideMapEnabled" default:"false"`
}
type FrontRoutingConfig struct {
@@ -209,9 +210,9 @@ type FrontRoutingConfig struct {
}
type ReleaseConfig struct {
Repo string `yaml:"repo" json:"repo" default:"https://helm.kubeshark.com"`
Name string `yaml:"name" json:"name" default:"kubeshark"`
Namespace string `yaml:"namespace" json:"namespace" default:"default"`
Repo string `yaml:"repo" json:"repo" default:"https://helm.kubeshark.com"`
Name string `yaml:"name" json:"name" default:"kubeshark"`
Namespace string `yaml:"namespace" json:"namespace" default:"default"`
HelmChartPath string `yaml:"helmChartPath" json:"helmChartPath" default:""`
}
@@ -411,7 +412,6 @@ type TapConfig struct {
Gitops GitopsConfig `yaml:"gitops" json:"gitops"`
Sentry SentryConfig `yaml:"sentry" json:"sentry"`
DefaultFilter string `yaml:"defaultFilter" json:"defaultFilter" default:""`
LiveConfigMapChangesDisabled bool `yaml:"liveConfigMapChangesDisabled" json:"liveConfigMapChangesDisabled" default:"false"`
GlobalFilter string `yaml:"globalFilter" json:"globalFilter" default:""`
EnabledDissectors []string `yaml:"enabledDissectors" json:"enabledDissectors"`
PortMapping PortMapping `yaml:"portMapping" json:"portMapping"`

View File

@@ -232,7 +232,6 @@ Example for overriding image names:
| `tap.sentry.enabled` | Enable sending of error logs to Sentry | `false` |
| `tap.sentry.environment` | Sentry environment to label error logs with | `production` |
| `tap.defaultFilter` | Sets the default dashboard KFL filter (e.g. `http`). By default, this value is set to filter out noisy protocols such as DNS, UDP, ICMP and TCP. The user can easily change this, **temporarily**, in the Dashboard. For a permanent change, you should change this value in the `values.yaml` or `config.yaml` file. | `""` |
| `tap.liveConfigMapChangesDisabled` | If set to `true`, all user functionality (scripting, targeting settings, global & default KFL modification, traffic recording, traffic capturing on/off, protocol dissectors) involving dynamic ConfigMap changes from UI will be disabled | `false` |
| `tap.globalFilter` | Prepends to any KFL filter and can be used to limit what is visible in the dashboard. For example, `redact("request.headers.Authorization")` will redact the appropriate field. Another example `!dns` will not show any DNS traffic. | `""` |
| `tap.metrics.port` | Pod port used to expose Prometheus metrics | `49100` |
| `tap.enabledDissectors` | This is an array of strings representing the list of supported protocols. Remove or comment out redundant protocols (e.g., dns).| The default list excludes: `udp` and `tcp` |

View File

@@ -95,7 +95,85 @@ helm install kubeshark kubeshark/kubeshark \
### Example: IRSA (recommended for EKS)
Create a ConfigMap with bucket configuration:
[IAM Roles for Service Accounts (IRSA)](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html) lets EKS pods assume an IAM role without static credentials. EKS injects a short-lived token into the pod automatically.
**Prerequisites:**
1. Your EKS cluster must have an [OIDC provider](https://docs.aws.amazon.com/eks/latest/userguide/enable-iam-roles-for-service-accounts.html) associated with it.
2. An IAM role with a trust policy that allows the Kubeshark service account to assume it.
**Step 1 — Create an IAM policy scoped to your bucket:**
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:GetObjectVersion",
"s3:DeleteObjectVersion",
"s3:ListBucket",
"s3:ListBucketVersions",
"s3:GetBucketLocation",
"s3:GetBucketVersioning"
],
"Resource": [
"arn:aws:s3:::my-kubeshark-snapshots",
"arn:aws:s3:::my-kubeshark-snapshots/*"
]
}
]
}
```
> For read-only access, remove `s3:PutObject`, `s3:DeleteObject`, and `s3:DeleteObjectVersion`.
**Step 2 — Create an IAM role with IRSA trust policy:**
```bash
# Get your cluster's OIDC provider URL
OIDC_PROVIDER=$(aws eks describe-cluster --name CLUSTER_NAME \
--query "cluster.identity.oidc.issuer" --output text | sed 's|https://||')
# Create a trust policy
# The default K8s SA name is "<release-name>-service-account" (e.g. "kubeshark-service-account")
cat > trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::ACCOUNT_ID:oidc-provider/${OIDC_PROVIDER}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OIDC_PROVIDER}:sub": "system:serviceaccount:NAMESPACE:kubeshark-service-account",
"${OIDC_PROVIDER}:aud": "sts.amazonaws.com"
}
}
}
]
}
EOF
# Create the role and attach your policy
aws iam create-role \
--role-name KubesharkS3Role \
--assume-role-policy-document file://trust-policy.json
aws iam put-role-policy \
--role-name KubesharkS3Role \
--policy-name KubesharkSnapshotsBucketAccess \
--policy-document file://bucket-policy.json
```
**Step 3 — Create a ConfigMap with bucket configuration:**
```yaml
apiVersion: v1
@@ -107,10 +185,12 @@ data:
SNAPSHOT_AWS_REGION: us-east-1
```
Set Helm values:
**Step 4 — Set Helm values with `tap.annotations` to annotate the service account:**
```yaml
tap:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT_ID:role/KubesharkS3Role
snapshots:
cloud:
provider: "s3"
@@ -118,7 +198,17 @@ tap:
- kubeshark-s3-config
```
The hub pod's service account must be annotated for IRSA with an IAM role that has S3 access to the bucket.
Or via `--set`:
```bash
helm install kubeshark kubeshark/kubeshark \
--set tap.snapshots.cloud.provider=s3 \
--set tap.snapshots.cloud.s3.bucket=my-kubeshark-snapshots \
--set tap.snapshots.cloud.s3.region=us-east-1 \
--set tap.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::ACCOUNT_ID:role/KubesharkS3Role
```
No `accessKey`/`secretKey` is needed — EKS injects credentials automatically via the IRSA token.
### Example: Static Credentials

View File

@@ -26,15 +26,15 @@ spec:
- env:
- name: REACT_APP_AUTH_ENABLED
value: '{{- if or (and .Values.cloudLicenseEnabled (not (empty .Values.license))) (not .Values.internetConnectivity) -}}
{{ (and .Values.tap.auth.enabled (eq .Values.tap.auth.type "dex")) | ternary true false }}
{{ (default false .Values.demoModeEnabled) | ternary true ((and .Values.tap.auth.enabled (eq .Values.tap.auth.type "dex")) | ternary true false) }}
{{- else -}}
{{ .Values.cloudLicenseEnabled | ternary "true" .Values.tap.auth.enabled }}
{{ .Values.cloudLicenseEnabled | ternary "true" ((default false .Values.demoModeEnabled) | ternary "true" .Values.tap.auth.enabled) }}
{{- end }}'
- name: REACT_APP_AUTH_TYPE
value: '{{- if and .Values.cloudLicenseEnabled (not (eq .Values.tap.auth.type "dex")) -}}
default
{{- else -}}
{{ .Values.tap.auth.type }}
{{ (default false .Values.demoModeEnabled) | ternary "default" .Values.tap.auth.type }}
{{- end }}'
- name: REACT_APP_COMPLETE_STREAMING_ENABLED
value: '{{- if and (hasKey .Values.tap "dashboard") (hasKey .Values.tap.dashboard "completeStreamingEnabled") -}}
@@ -55,30 +55,22 @@ spec:
false
{{- end }}'
- name: REACT_APP_SCRIPTING_DISABLED
value: '{{- if .Values.tap.liveConfigMapChangesDisabled -}}
{{- if .Values.demoModeEnabled -}}
{{ .Values.demoModeEnabled | ternary false true }}
{{- else -}}
true
{{- end }}
{{- else -}}
false
{{- end }}'
value: '{{ default false .Values.demoModeEnabled }}'
- name: REACT_APP_TARGETED_PODS_UPDATE_DISABLED
value: '{{ .Values.tap.liveConfigMapChangesDisabled }}'
value: '{{ default false .Values.demoModeEnabled }}'
- name: REACT_APP_PRESET_FILTERS_CHANGING_ENABLED
value: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "false" "true" }}'
value: '{{ not (default false .Values.demoModeEnabled) }}'
- name: REACT_APP_BPF_OVERRIDE_DISABLED
value: '{{ eq .Values.tap.packetCapture "af_packet" | ternary "false" "true" }}'
- name: REACT_APP_RECORDING_DISABLED
value: '{{ .Values.tap.liveConfigMapChangesDisabled }}'
value: '{{ default false .Values.demoModeEnabled }}'
- name: REACT_APP_DISSECTION_ENABLED
value: '{{ .Values.tap.capture.dissection.enabled | ternary "true" "false" }}'
- name: REACT_APP_DISSECTION_CONTROL_ENABLED
value: '{{- if and .Values.tap.liveConfigMapChangesDisabled (not .Values.tap.capture.dissection.enabled) -}}
value: '{{- if and (not .Values.demoModeEnabled) (not .Values.tap.capture.dissection.enabled) -}}
true
{{- else -}}
{{ not .Values.tap.liveConfigMapChangesDisabled | ternary "true" "false" }}
{{ not (default false .Values.demoModeEnabled) | ternary false true }}
{{- end -}}'
- name: 'REACT_APP_CLOUD_LICENSE_ENABLED'
value: '{{- if or (and .Values.cloudLicenseEnabled (not (empty .Values.license))) (not .Values.internetConnectivity) -}}
@@ -91,7 +83,13 @@ spec:
- name: REACT_APP_BETA_ENABLED
value: '{{ default false .Values.betaEnabled | ternary "true" "false" }}'
- name: REACT_APP_DISSECTORS_UPDATING_ENABLED
value: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "false" "true" }}'
value: '{{ not (default false .Values.demoModeEnabled) }}'
- name: REACT_APP_SNAPSHOTS_UPDATING_ENABLED
value: '{{ not (default false .Values.demoModeEnabled) }}'
- name: REACT_APP_DEMO_MODE_ENABLED
value: '{{ default false .Values.demoModeEnabled }}'
- name: REACT_APP_CLUSTER_WIDE_MAP_ENABLED
value: '{{ default false (((.Values).tap).dashboard).clusterWideMapEnabled }}'
- name: REACT_APP_RAW_CAPTURE_ENABLED
value: '{{ .Values.tap.capture.raw.enabled | ternary "true" "false" }}'
- name: REACT_APP_SENTRY_ENABLED

View File

@@ -19,14 +19,14 @@ data:
INGRESS_HOST: '{{ .Values.tap.ingress.host }}'
PROXY_FRONT_PORT: '{{ .Values.tap.proxy.front.port }}'
AUTH_ENABLED: '{{- if and .Values.cloudLicenseEnabled (not (empty .Values.license)) -}}
{{ and .Values.tap.auth.enabled (eq .Values.tap.auth.type "dex") | ternary true false }}
{{ (default false .Values.demoModeEnabled) | ternary true ((and .Values.tap.auth.enabled (eq .Values.tap.auth.type "dex")) | ternary true false) }}
{{- else -}}
{{ .Values.cloudLicenseEnabled | ternary "true" (.Values.tap.auth.enabled | ternary "true" "") }}
{{ .Values.cloudLicenseEnabled | ternary "true" ((default false .Values.demoModeEnabled) | ternary "true" .Values.tap.auth.enabled) }}
{{- end }}'
AUTH_TYPE: '{{- if and .Values.cloudLicenseEnabled (not (eq .Values.tap.auth.type "dex")) -}}
default
{{- else -}}
{{ .Values.tap.auth.type }}
{{ (default false .Values.demoModeEnabled) | ternary "default" .Values.tap.auth.type }}
{{- end }}'
AUTH_SAML_IDP_METADATA_URL: '{{ .Values.tap.auth.saml.idpMetadataUrl }}'
AUTH_SAML_ROLE_ATTRIBUTE: '{{ .Values.tap.auth.saml.roleAttribute }}'
@@ -44,22 +44,14 @@ data:
false
{{- end }}'
TELEMETRY_DISABLED: '{{ not .Values.internetConnectivity | ternary "true" (not .Values.tap.telemetry.enabled | ternary "true" "false") }}'
SCRIPTING_DISABLED: '{{- if .Values.tap.liveConfigMapChangesDisabled -}}
{{- if .Values.demoModeEnabled -}}
{{ .Values.demoModeEnabled | ternary false true }}
{{- else -}}
true
{{- end }}
{{- else -}}
false
{{- end }}'
TARGETED_PODS_UPDATE_DISABLED: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "true" "" }}'
PRESET_FILTERS_CHANGING_ENABLED: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "false" "true" }}'
RECORDING_DISABLED: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "true" "" }}'
DISSECTION_CONTROL_ENABLED: '{{- if and .Values.tap.liveConfigMapChangesDisabled (not .Values.tap.capture.dissection.enabled) -}}
SCRIPTING_DISABLED: '{{ default false .Values.demoModeEnabled }}'
TARGETED_PODS_UPDATE_DISABLED: '{{ default false .Values.demoModeEnabled }}'
PRESET_FILTERS_CHANGING_ENABLED: '{{ not (default false .Values.demoModeEnabled) }}'
RECORDING_DISABLED: '{{ (default false .Values.demoModeEnabled) | ternary true false }}'
DISSECTION_CONTROL_ENABLED: '{{- if and (not .Values.demoModeEnabled) (not .Values.tap.capture.dissection.enabled) -}}
true
{{- else -}}
{{ not .Values.tap.liveConfigMapChangesDisabled | ternary "true" "false" }}
{{ (default false .Values.demoModeEnabled) | ternary false true }}
{{- end }}'
GLOBAL_FILTER: {{ include "kubeshark.escapeDoubleQuotes" .Values.tap.globalFilter | quote }}
DEFAULT_FILTER: {{ include "kubeshark.escapeDoubleQuotes" .Values.tap.defaultFilter | quote }}
@@ -76,7 +68,9 @@ data:
DUPLICATE_TIMEFRAME: '{{ .Values.tap.misc.duplicateTimeframe }}'
ENABLED_DISSECTORS: '{{ gt (len .Values.tap.enabledDissectors) 0 | ternary (join "," .Values.tap.enabledDissectors) "" }}'
CUSTOM_MACROS: '{{ toJson .Values.tap.customMacros }}'
DISSECTORS_UPDATING_ENABLED: '{{ .Values.tap.liveConfigMapChangesDisabled | ternary "false" "true" }}'
DISSECTORS_UPDATING_ENABLED: '{{ not (default false .Values.demoModeEnabled) }}'
SNAPSHOTS_UPDATING_ENABLED: '{{ not (default false .Values.demoModeEnabled) }}'
DEMO_MODE_ENABLED: '{{ default false .Values.demoModeEnabled }}'
DETECT_DUPLICATES: '{{ .Values.tap.misc.detectDuplicates | ternary "true" "false" }}'
PCAP_DUMP_ENABLE: '{{ .Values.pcapdump.enabled }}'
PCAP_TIME_INTERVAL: '{{ .Values.pcapdump.timeInterval }}'

View File

@@ -185,6 +185,7 @@ tap:
dashboard:
streamingType: connect-rpc
completeStreamingEnabled: true
clusterWideMapEnabled: false
telemetry:
enabled: true
resourceGuard:
@@ -197,7 +198,6 @@ tap:
enabled: false
environment: production
defaultFilter: ""
liveConfigMapChangesDisabled: false
globalFilter: ""
enabledDissectors:
- amqp

View File

@@ -2,6 +2,18 @@
[Kubeshark](https://kubeshark.com) MCP (Model Context Protocol) server enables AI assistants like Claude Desktop, Cursor, and other MCP-compatible clients to query real-time Kubernetes network traffic.
## AI Skills
The MCP provides the tools — [AI skills](../skills/) teach agents how to use them.
Skills turn raw MCP capabilities into domain-specific workflows like root cause
analysis, traffic filtering, and forensic investigation. See the
[skills README](../skills/README.md) for installation and usage.
| Skill | Description |
|-------|-------------|
| [`network-rca`](../skills/network-rca/) | Network Root Cause Analysis — snapshot-based retrospective investigation with PCAP and dissection routes |
| [`kfl`](../skills/kfl/) | KFL2 filter expert — write, debug, and optimize traffic queries across all supported protocols |
## Features
- **L7 API Traffic Analysis**: Query HTTP, gRPC, Redis, Kafka, DNS transactions
@@ -34,20 +46,20 @@ Add to your Claude Desktop configuration:
**macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
**Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
#### URL Mode (Recommended for existing deployments)
#### Default (requires kubectl access / kube context)
```json
{
"mcpServers": {
"kubeshark": {
"command": "kubeshark",
"args": ["mcp", "--url", "https://kubeshark.example.com"]
"args": ["mcp"]
}
}
}
```
#### Proxy Mode (Requires kubectl access)
With an explicit kubeconfig path:
```json
{
@@ -59,14 +71,18 @@ Add to your Claude Desktop configuration:
}
}
```
or:
#### URL Mode (no kubectl required)
Use this when the machine doesn't have kubectl access or a kube context.
Connect directly to an existing Kubeshark deployment:
```json
{
"mcpServers": {
"kubeshark": {
"command": "kubeshark",
"args": ["mcp"]
"args": ["mcp", "--url", "https://kubeshark.example.com"]
}
}
}

120
skills/README.md Normal file
View File

@@ -0,0 +1,120 @@
# Kubeshark AI Skills
Open-source AI skills that work with the [Kubeshark MCP](https://github.com/kubeshark/kubeshark).
Skills teach AI agents how to use Kubeshark's MCP tools for specific workflows
like root cause analysis, traffic filtering, and forensic investigation.
Skills use the open [Agent Skills](https://github.com/anthropics/skills) format
and work with Claude Code, OpenAI Codex CLI, Gemini CLI, Cursor, and other
compatible agents.
## Available Skills
| Skill | Description |
|-------|-------------|
| [`network-rca`](network-rca/) | Network Root Cause Analysis. Retrospective traffic analysis via snapshots, with two investigation routes: PCAP (for Wireshark/compliance) and Dissection (for AI-driven API-level investigation). |
| [`kfl`](kfl/) | KFL2 (Kubeshark Filter Language) expert. Complete reference for writing, debugging, and optimizing CEL-based traffic filters across all supported protocols. |
## Prerequisites
All skills require the Kubeshark MCP:
```bash
# Claude Code
claude mcp add kubeshark -- kubeshark mcp
# Without kubectl access (direct URL)
claude mcp add kubeshark -- kubeshark mcp --url https://kubeshark.example.com
```
For Claude Desktop, add to `claude_desktop_config.json`:
```json
{
"mcpServers": {
"kubeshark": {
"command": "kubeshark",
"args": ["mcp"]
}
}
}
```
## Installation
### Option 1: Plugin (recommended)
Install as a Claude Code plugin directly from GitHub:
```
/plugin marketplace add kubeshark/kubeshark
/plugin install kubeshark
```
Skills appear as `/kubeshark:network-rca` and `/kubeshark:kfl`. The plugin
also bundles the Kubeshark MCP configuration automatically.
### Option 2: Clone and run
```bash
git clone https://github.com/kubeshark/kubeshark
cd kubeshark
claude
```
Skills trigger automatically based on your conversation.
### Option 3: Manual installation
Clone the repo (if you haven't already), then symlink or copy the skills:
```bash
git clone https://github.com/kubeshark/kubeshark
mkdir -p ~/.claude/skills
# Symlink to stay in sync with the repo (recommended)
ln -s kubeshark/skills/network-rca ~/.claude/skills/network-rca
ln -s kubeshark/skills/kfl ~/.claude/skills/kfl
# Or copy to your project (project scope only)
mkdir -p .claude/skills
cp -r kubeshark/skills/network-rca .claude/skills/
cp -r kubeshark/skills/kfl .claude/skills/
# Or copy for personal use (all your projects)
cp -r kubeshark/skills/network-rca ~/.claude/skills/
cp -r kubeshark/skills/kfl ~/.claude/skills/
```
## Contributing
We welcome contributions — whether improving an existing skill or proposing a new one.
- **Suggest improvements**: Open an issue or PR with changes to an existing skill's `SKILL.md`
or reference docs. Better examples, clearer workflows, and additional filter patterns
are always appreciated.
- **Add a new skill**: Open an issue describing the use case first. New skills should
follow the structure below and reference Kubeshark MCP tools by exact name.
### Skill structure
```
skills/
└── <skill-name>/
├── SKILL.md # Required. YAML frontmatter + markdown body.
└── references/ # Optional. Detailed reference docs.
└── *.md
```
### Guidelines
- Keep `SKILL.md` under 500 lines. Use `references/` for detailed content.
- Use imperative tone. Reference MCP tools by exact name.
- Include realistic example tool responses.
- The `description` frontmatter should be generous with trigger keywords.
### Planned skills
- `api-security` — OWASP API Top 10 assessment against live or snapshot traffic.
- `incident-response` — 7-phase forensic incident investigation methodology.
- `network-engineering` — Real-time traffic analysis, latency debugging, dependency mapping.

344
skills/kfl/SKILL.md Normal file
View File

@@ -0,0 +1,344 @@
---
name: kfl
user-invocable: false
description: >
KFL2 (Kubeshark Filter Language) reference. This skill MUST be loaded before
writing, constructing, or suggesting any KFL filter expression. KFL is statically
typed — incorrect field names or syntax will fail silently or error. Do not guess
at KFL syntax without this skill loaded. Trigger on any mention of KFL, CEL filters,
traffic filtering, display filters, query syntax, filter expressions, write a filter,
construct a query, build a KFL, create a filter expression, "how do I filter",
"show me only", "find traffic where", protocol-specific queries (HTTP status codes,
DNS lookups, Redis commands, Kafka topics), Kubernetes-aware filtering (by namespace,
pod, service, label, annotation), L4 connection/flow filters, time-based queries,
or any request to slice/search/narrow network traffic in Kubeshark. Also trigger
when other skills need to construct filters — KFL is the query language for all
Kubeshark traffic analysis.
---
# KFL2 — Kubeshark Filter Language
You are a KFL2 expert. KFL2 is built on Google's CEL (Common Expression Language)
and is the query language for all Kubeshark traffic analysis. It operates as a
**display filter** — it doesn't affect what's captured, only what you see.
Think of KFL the way you think of SQL for databases or Google search syntax for
the web. Kubeshark captures and indexes all cluster traffic; KFL is how you
search it.
For the complete variable and field reference, see `references/kfl2-reference.md`.
## Core Syntax
KFL expressions are boolean CEL expressions. An empty filter matches everything.
### Operators
| Category | Operators |
|----------|-----------|
| Comparison | `==`, `!=`, `<`, `<=`, `>`, `>=` |
| Logical | `&&`, `\|\|`, `!` |
| Arithmetic | `+`, `-`, `*`, `/`, `%` |
| Membership | `in` |
| Ternary | `condition ? true_val : false_val` |
### String Functions
```
str.contains(substring) // Substring search
str.startsWith(prefix) // Prefix match
str.endsWith(suffix) // Suffix match
str.matches(regex) // Regex match
size(str) // String length
```
### Collection Functions
```
size(collection) // List/map/string length
key in map // Key existence
map[key] // Value access
map_get(map, key, default) // Safe access with default
value in list // List membership
```
### Time Functions
```
timestamp("2026-03-14T22:00:00Z") // Parse ISO timestamp
duration("5m") // Parse duration
now() // Current time (snapshot at filter creation)
```
### Negation
```
!http // Everything that is NOT HTTP
http && status_code != 200 // HTTP responses that aren't 200
http && !path.contains("/health") // Exclude health checks
!(src.pod.namespace == "kube-system") // Exclude system namespace
```
## Protocol Detection
Boolean flags that indicate which protocol was detected. Use these as the first
filter term — they're fast and narrow the search space immediately.
| Flag | Protocol | Flag | Protocol |
|------|----------|------|----------|
| `http` | HTTP/1.1, HTTP/2 | `redis` | Redis |
| `dns` | DNS | `kafka` | Kafka |
| `tls` | TLS/SSL | `amqp` | AMQP |
| `tcp` | TCP | `ldap` | LDAP |
| `udp` | UDP | `ws` | WebSocket |
| `sctp` | SCTP | `gql` | GraphQL (v1+v2) |
| `icmp` | ICMP | `gqlv1` / `gqlv2` | GraphQL version-specific |
| `radius` | RADIUS | `conn` / `flow` | L4 connection/flow tracking |
| `diameter` | Diameter | `tcp_conn` / `udp_conn` | Transport-specific connections |
## Kubernetes Context
The most common starting point. Filter by where traffic originates or terminates.
### Pod and Service Fields
```
src.pod.name == "orders-594487879c-7ddxf"
dst.pod.namespace == "production"
src.service.name == "api-gateway"
dst.service.namespace == "payments"
```
Pod fields fall back to service data when pod info is unavailable, so
`dst.pod.namespace` works even for service-level entries.
### Aggregate Collections
Match against any direction (src or dst):
```
"production" in namespaces // Any namespace match
"orders" in pods // Any pod name match
"api-gateway" in services // Any service name match
```
### Labels and Annotations
```
// Direct access — works when the label is expected to exist
local_labels.app == "payment" || remote_labels.app == "payment"
// Safe access with default — use when the label may not exist
map_get(local_labels, "app", "") == "checkout"
map_get(remote_labels, "version", "") == "canary"
// Label existence check
"tier" in local_labels
```
Direct access (`local_labels.app`) returns an error if the key doesn't exist.
Use `map_get()` when you're not sure the label is present on all workloads.
Queries can be as complex as needed — combine labels with any other fields.
Responses are fast because all API elements are indexed:
```
local_labels.app == "payment" && http && status_code >= 500 && dst.pod.namespace == "production"
```
### Node and Process
```
node_name == "ip-10-0-25-170.ec2.internal"
local_process_name == "nginx"
remote_process_name.contains("postgres")
```
### DNS Resolution
```
src.dns == "api.example.com"
dst.dns.contains("redis")
```
## HTTP Filtering
HTTP is the most common protocol for API-level investigation.
### Fields
| Field | Type | Example |
|-------|------|---------|
| `method` | string | `"GET"`, `"POST"`, `"PUT"`, `"DELETE"` |
| `url` | string | Full path + query: `"/api/users?id=123"` |
| `path` | string | Path only: `"/api/users"` |
| `status_code` | int | `200`, `404`, `500` |
| `http_version` | string | `"HTTP/1.1"`, `"HTTP/2"` |
| `request.headers` | map | `request.headers["content-type"]` |
| `response.headers` | map | `response.headers["server"]` |
| `request.cookies` | map | `request.cookies["session"]` |
| `response.cookies` | map | `response.cookies["token"]` |
| `query_string` | map | `query_string["id"]` |
| `request_body_size` | int | Request body bytes |
| `response_body_size` | int | Response body bytes |
| `elapsed_time` | int | Duration in **microseconds** |
### Common Patterns
```
// Error investigation
http && status_code >= 500 // Server errors
http && status_code == 429 // Rate limiting
http && status_code >= 400 && status_code < 500 // Client errors
// Endpoint targeting
http && method == "POST" && path.contains("/orders")
http && url.matches(".*/api/v[0-9]+/users.*")
// Performance
http && elapsed_time > 5000000 // > 5 seconds
http && response_body_size > 1000000 // > 1MB responses
// Header inspection
http && "authorization" in request.headers
http && request.headers["content-type"] == "application/json"
// GraphQL (subset of HTTP)
gql && method == "POST" && status_code >= 400
```
## DNS Filtering
DNS issues are often the hidden root cause of outages.
| Field | Type | Description |
|-------|------|-------------|
| `dns_questions` | []string | Question domain names |
| `dns_answers` | []string | Answer domain names |
| `dns_question_types` | []string | Record types: A, AAAA, CNAME, MX, TXT, SRV, PTR |
| `dns_request` | bool | Is request |
| `dns_response` | bool | Is response |
| `dns_request_length` | int | Request size |
| `dns_response_length` | int | Response size |
```
dns && "api.external-service.com" in dns_questions
dns && dns_response && status_code != 0 // Failed lookups
dns && "A" in dns_question_types // A record queries
dns && size(dns_questions) > 1 // Multi-question
```
## Database and Messaging Protocols
### Redis
```
redis && redis_type == "GET" // Command type
redis && redis_key.startsWith("session:") // Key pattern
redis && redis_command.contains("DEL") // Command search
redis && redis_total_size > 10000 // Large operations
```
### Kafka
```
kafka && kafka_api_key_name == "PRODUCE" // Produce operations
kafka && kafka_client_id == "payment-processor" // Client filtering
kafka && kafka_request_summary.contains("orders") // Topic filtering
kafka && kafka_size > 10000 // Large messages
```
### AMQP, LDAP, RADIUS, Diameter
```
amqp && amqp_method == "basic.publish" // AMQP publish
ldap && ldap_type == "bind" // LDAP bind requests
radius && radius_code_name == "Access-Request" // RADIUS auth
diameter && diameter_method.contains("Credit") // Diameter credit control
```
For the full variable list for these protocols, see `references/kfl2-reference.md`.
## Transport Layer (L4)
### TCP/UDP Fields
```
tcp && tcp_error_type != "" // TCP errors
udp && udp_length > 1000 // Large UDP packets
```
### Connection Tracking
```
conn && conn_state == "open" // Active connections
conn && conn_local_bytes > 1000000 // High-volume
conn && "HTTP" in conn_l7_detected // L7 protocol detection
tcp_conn && conn_state == "closed" // Closed TCP connections
```
### Flow Tracking (with Rate Metrics)
```
flow && flow_local_pps > 1000 // High packet rate
flow && flow_local_bps > 1000000 // High bandwidth
flow && flow_state == "closed" && "TLS" in flow_l7_detected
tcp_flow && flow_local_bps > 5000000 // High-throughput TCP
```
## Network Layer
```
src.ip == "10.0.53.101"
dst.ip.startsWith("192.168.")
src.port == 8080
dst.port >= 8000 && dst.port <= 9000
```
## Time-Based Filtering
```
timestamp > timestamp("2026-03-14T22:00:00Z")
timestamp >= timestamp("2026-03-14T22:00:00Z") && timestamp <= timestamp("2026-03-14T23:00:00Z")
timestamp > now() - duration("5m") // Last 5 minutes
elapsed_time > 2000000 // Older than 2 seconds
```
## Building Filters: Progressive Narrowing
The most effective investigation technique — start broad, add constraints:
```
// Step 1: Protocol + namespace
http && dst.pod.namespace == "production"
// Step 2: Add error condition
http && dst.pod.namespace == "production" && status_code >= 500
// Step 3: Narrow to service
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service"
// Step 4: Narrow to endpoint
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge")
// Step 5: Add timing
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge") && elapsed_time > 2000000
```
## Performance Tips
1. **Protocol flags first**`http && ...` is faster than `... && http`
2. **`startsWith`/`endsWith` over `contains`** — prefix/suffix checks are faster
3. **Specific ports before string ops**`dst.port == 80` is cheaper than `url.contains(...)`
4. **Use `map_get` for labels** — avoids errors on missing keys
5. **Keep filters simple** — CEL short-circuits on `&&`, so put cheap checks first
## Type Safety
KFL2 is statically typed. Common gotchas:
- `status_code` is `int`, not string — use `status_code == 200`, not `"200"`
- `elapsed_time` is in **microseconds** — 5 seconds = `5000000`
- `timestamp` requires `timestamp()` function — not a raw string
- Map access on missing keys errors — use `key in map` or `map_get()` first
- List membership uses `value in list` — not `list.contains(value)`

View File

@@ -0,0 +1,407 @@
# KFL2 Complete Variable and Field Reference
This is the exhaustive reference for every variable available in KFL2 filters.
KFL2 is built on Google's CEL (Common Expression Language) and evaluates against
Kubeshark's protobuf-based `BaseEntry` structure.
## Most Commonly Used Variables
These are the variables you'll reach for in 90% of investigations:
| Variable | Type | What it's for |
|----------|------|---------------|
| `status_code` | int | HTTP response status (200, 404, 500) |
| `method` | string | HTTP method (GET, POST, PUT, DELETE) |
| `path` | string | URL path without query string |
| `dst.pod.namespace` | string | Where traffic is going (namespace) |
| `dst.service.name` | string | Where traffic is going (service) |
| `src.pod.name` | string | Where traffic comes from (pod) |
| `elapsed_time` | int | Request duration in microseconds |
| `dns_questions` | []string | DNS domains being queried |
| `namespaces` | []string | All namespaces involved (src + dst) |
## Network-Level Variables
| Variable | Type | Description | Example |
|----------|------|-------------|---------|
| `src.ip` | string | Source IP address | `"10.0.53.101"` |
| `dst.ip` | string | Destination IP address | `"192.168.1.1"` |
| `src.port` | int | Source port number | `43210` |
| `dst.port` | int | Destination port number | `8080` |
| `protocol` | string | Detected protocol type | `"HTTP"`, `"DNS"` |
## Identity and Metadata Variables
| Variable | Type | Description |
|----------|------|-------------|
| `id` | int | BaseEntry unique identifier (assigned by sniffer) |
| `node_id` | string | Node identifier (assigned by hub) |
| `index` | int | Entry index for stream uniqueness |
| `stream` | string | Stream identifier (hex string) |
| `timestamp` | timestamp | Event time (UTC), use with `timestamp()` function |
| `elapsed_time` | int | Age since timestamp in microseconds |
| `worker` | string | Worker identifier |
## Cross-Reference Variables
| Variable | Type | Description |
|----------|------|-------------|
| `conn_id` | int | L7 to L4 connection cross-reference ID |
| `flow_id` | int | L7 to L4 flow cross-reference ID |
| `has_pcap` | bool | Whether PCAP data is available for this entry |
## Capture Source Variables
| Variable | Type | Description | Values |
|----------|------|-------------|--------|
| `capture_source` | string | Canonical capture source | `"unspecified"`, `"af_packet"`, `"ebpf"`, `"ebpf_tls"` |
| `capture_backend` | string | Backend family | `"af_packet"`, `"ebpf"` |
| `capture_source_code` | int | Numeric enum | 0=unspecified, 1=af_packet, 2=ebpf, 3=ebpf_tls |
| `capture` | map | Nested map access | `capture["source"]`, `capture["backend"]` |
## Protocol Detection Flags
Boolean variables indicating detected protocol. Use as first filter term for performance.
| Variable | Protocol | Variable | Protocol |
|----------|----------|----------|----------|
| `http` | HTTP/1.1, HTTP/2 | `redis` | Redis |
| `dns` | DNS | `kafka` | Kafka |
| `tls` | TLS/SSL handshake | `amqp` | AMQP messaging |
| `tcp` | TCP transport | `ldap` | LDAP directory |
| `udp` | UDP transport | `ws` | WebSocket |
| `sctp` | SCTP streaming | `gql` | GraphQL (v1 or v2) |
| `icmp` | ICMP | `gqlv1` | GraphQL v1 only |
| `radius` | RADIUS auth | `gqlv2` | GraphQL v2 only |
| `diameter` | Diameter | `conn` | L4 connection tracking |
| `flow` | L4 flow tracking | `tcp_conn` | TCP connection tracking |
| `tcp_flow` | TCP flow tracking | `udp_conn` | UDP connection tracking |
| `udp_flow` | UDP flow tracking | | |
## HTTP Variables
| Variable | Type | Description | Example |
|----------|------|-------------|---------|
| `method` | string | HTTP method | `"GET"`, `"POST"`, `"PUT"`, `"DELETE"`, `"PATCH"` |
| `url` | string | Full URL path and query string | `"/api/users?id=123"` |
| `path` | string | URL path component (no query) | `"/api/users"` |
| `status_code` | int | HTTP response status code | `200`, `404`, `500` |
| `http_version` | string | HTTP protocol version | `"HTTP/1.1"`, `"HTTP/2"` |
| `query_string` | map[string]string | Parsed URL query parameters | `query_string["id"]``"123"` |
| `request.headers` | map[string]string | Request HTTP headers | `request.headers["content-type"]` |
| `response.headers` | map[string]string | Response HTTP headers | `response.headers["server"]` |
| `request.cookies` | map[string]string | Request cookies | `request.cookies["session"]` |
| `response.cookies` | map[string]string | Response cookies | `response.cookies["token"]` |
| `request_headers_size` | int | Request headers size in bytes | |
| `request_body_size` | int | Request body size in bytes | |
| `response_headers_size` | int | Response headers size in bytes | |
| `response_body_size` | int | Response body size in bytes | |
GraphQL requests have `gql` (or `gqlv1`/`gqlv2`) set to true and all HTTP
variables available.
**Example**: `http && method == "POST" && status_code >= 500 && path.contains("/api")`
## DNS Variables
| Variable | Type | Description | Example |
|----------|------|-------------|---------|
| `dns_questions` | []string | Question domain names (request + response) | `["example.com"]` |
| `dns_answers` | []string | Answer domain names | `["1.2.3.4"]` |
| `dns_question_types` | []string | Record types in questions | `["A"]`, `["AAAA"]`, `["CNAME"]` |
| `dns_request` | bool | Is DNS request message | |
| `dns_response` | bool | Is DNS response message | |
| `dns_request_length` | int | DNS request size in bytes (0 if absent) | |
| `dns_response_length` | int | DNS response size in bytes (0 if absent) | |
| `dns_total_size` | int | Sum of request + response sizes | |
Supported question types: A, AAAA, NS, CNAME, SOA, MX, TXT, SRV, PTR, ANY.
**Example**: `dns && dns_response && status_code != 0` (failed DNS lookups)
## TLS Variables
| Variable | Type | Description | Example |
|----------|------|-------------|---------|
| `tls` | bool | TLS payload detected | |
| `tls_summary` | string | TLS handshake summary | `"ClientHello"`, `"ServerHello"` |
| `tls_info` | string | TLS connection details | `"TLS 1.3, AES-256-GCM"` |
| `tls_request_size` | int | TLS request size in bytes | |
| `tls_response_size` | int | TLS response size in bytes | |
| `tls_total_size` | int | Sum of request + response (computed if not provided) | |
## TCP Variables
| Variable | Type | Description |
|----------|------|-------------|
| `tcp` | bool | TCP payload detected |
| `tcp_method` | string | TCP method information |
| `tcp_payload` | bytes | Raw TCP payload data |
| `tcp_error_type` | string | TCP error type (empty if none) |
| `tcp_error_message` | string | TCP error message (empty if none) |
## UDP Variables
| Variable | Type | Description |
|----------|------|-------------|
| `udp` | bool | UDP payload detected |
| `udp_length` | int | UDP packet length |
| `udp_checksum` | int | UDP checksum value |
| `udp_payload` | bytes | Raw UDP payload data |
## SCTP Variables
| Variable | Type | Description |
|----------|------|-------------|
| `sctp` | bool | SCTP payload detected |
| `sctp_checksum` | int | SCTP checksum value |
| `sctp_chunk_type` | string | SCTP chunk type |
| `sctp_length` | int | SCTP chunk length |
## ICMP Variables
| Variable | Type | Description |
|----------|------|-------------|
| `icmp` | bool | ICMP payload detected |
| `icmp_type` | string | ICMP type code |
| `icmp_version` | int | ICMP version (4 or 6) |
| `icmp_length` | int | ICMP message length |
## WebSocket Variables
| Variable | Type | Description | Values |
|----------|------|-------------|--------|
| `ws` | bool | WebSocket payload detected | |
| `ws_opcode` | string | WebSocket operation code | `"text"`, `"binary"`, `"close"`, `"ping"`, `"pong"` |
| `ws_request` | bool | Is WebSocket request | |
| `ws_response` | bool | Is WebSocket response | |
| `ws_request_payload_data` | string | Request payload (safely truncated) | |
| `ws_request_payload_length` | int | Request payload length in bytes | |
| `ws_response_payload_length` | int | Response payload length in bytes | |
## Redis Variables
| Variable | Type | Description | Example |
|----------|------|-------------|---------|
| `redis` | bool | Redis payload detected | |
| `redis_type` | string | Redis command verb | `"GET"`, `"SET"`, `"DEL"`, `"HGET"` |
| `redis_command` | string | Full Redis command line | `"GET session:1234"` |
| `redis_key` | string | Key (truncated to 64 bytes) | `"session:1234"` |
| `redis_request_size` | int | Request size (0 if absent) | |
| `redis_response_size` | int | Response size (0 if absent) | |
| `redis_total_size` | int | Sum of request + response | |
**Example**: `redis && redis_type == "GET" && redis_key.startsWith("session:")`
## Kafka Variables
| Variable | Type | Description | Example |
|----------|------|-------------|---------|
| `kafka` | bool | Kafka payload detected | |
| `kafka_api_key` | int | Kafka API key number | 0=FETCH, 1=PRODUCE |
| `kafka_api_key_name` | string | Human-readable API operation | `"PRODUCE"`, `"FETCH"` |
| `kafka_client_id` | string | Kafka client identifier | `"payment-processor"` |
| `kafka_size` | int | Message size (request preferred, else response) | |
| `kafka_request` | bool | Is Kafka request | |
| `kafka_response` | bool | Is Kafka response | |
| `kafka_request_summary` | string | Request summary/topic | `"orders-topic"` |
| `kafka_request_size` | int | Request size (0 if absent) | |
| `kafka_response_size` | int | Response size (0 if absent) | |
**Example**: `kafka && kafka_api_key_name == "PRODUCE" && kafka_request_summary.contains("orders")`
## AMQP Variables
| Variable | Type | Description | Example |
|----------|------|-------------|---------|
| `amqp` | bool | AMQP payload detected | |
| `amqp_method` | string | AMQP method name | `"basic.publish"`, `"channel.open"` |
| `amqp_summary` | string | Operation summary | |
| `amqp_request` | bool | Is AMQP request | |
| `amqp_response` | bool | Is AMQP response | |
| `amqp_request_length` | int | Request length (0 if absent) | |
| `amqp_response_length` | int | Response length (0 if absent) | |
| `amqp_total_size` | int | Sum of request + response | |
## LDAP Variables
| Variable | Type | Description |
|----------|------|-------------|
| `ldap` | bool | LDAP payload detected |
| `ldap_type` | string | LDAP operation type (request preferred) |
| `ldap_summary` | string | Operation summary |
| `ldap_request` | bool | Is LDAP request |
| `ldap_response` | bool | Is LDAP response |
| `ldap_request_length` | int | Request length (0 if absent) |
| `ldap_response_length` | int | Response length (0 if absent) |
| `ldap_total_size` | int | Sum of request + response |
## RADIUS Variables
| Variable | Type | Description | Example |
|----------|------|-------------|---------|
| `radius` | bool | RADIUS payload detected | |
| `radius_code` | int | RADIUS code (request preferred) | |
| `radius_code_name` | string | Code name | `"Access-Request"` |
| `radius_request` | bool | Is RADIUS request | |
| `radius_response` | bool | Is RADIUS response | |
| `radius_request_authenticator` | string | Request authenticator (hex) | |
| `radius_request_length` | int | Request size (0 if absent) | |
| `radius_response_length` | int | Response size (0 if absent) | |
| `radius_total_size` | int | Sum of request + response | |
## Diameter Variables
| Variable | Type | Description |
|----------|------|-------------|
| `diameter` | bool | Diameter payload detected |
| `diameter_method` | string | Method name (request preferred) |
| `diameter_summary` | string | Operation summary |
| `diameter_request` | bool | Is Diameter request |
| `diameter_response` | bool | Is Diameter response |
| `diameter_request_length` | int | Request size (0 if absent) |
| `diameter_response_length` | int | Response size (0 if absent) |
| `diameter_total_size` | int | Sum of request + response |
## L4 Connection Tracking Variables
| Variable | Type | Description | Example |
|----------|------|-------------|---------|
| `conn` | bool | Connection tracking entry | |
| `conn_state` | string | Connection state | `"open"`, `"in_progress"`, `"closed"` |
| `conn_local_pkts` | int | Packets from local peer | |
| `conn_local_bytes` | int | Bytes from local peer | |
| `conn_remote_pkts` | int | Packets from remote peer | |
| `conn_remote_bytes` | int | Bytes from remote peer | |
| `conn_l7_detected` | []string | L7 protocols detected on connection | `["HTTP", "TLS"]` |
| `conn_group_id` | int | Connection group identifier | |
**Example**: `conn && conn_state == "open" && conn_local_bytes > 1000000` (high-volume open connections)
## L4 Flow Tracking Variables
Flows extend connections with rate metrics (packets/bytes per second).
| Variable | Type | Description |
|----------|------|-------------|
| `flow` | bool | Flow tracking entry |
| `flow_state` | string | Flow state (`"open"`, `"in_progress"`, `"closed"`) |
| `flow_local_pkts` | int | Packets from local peer |
| `flow_local_bytes` | int | Bytes from local peer |
| `flow_remote_pkts` | int | Packets from remote peer |
| `flow_remote_bytes` | int | Bytes from remote peer |
| `flow_local_pps` | int | Local packets per second |
| `flow_local_bps` | int | Local bytes per second |
| `flow_remote_pps` | int | Remote packets per second |
| `flow_remote_bps` | int | Remote bytes per second |
| `flow_l7_detected` | []string | L7 protocols detected on flow |
| `flow_group_id` | int | Flow group identifier |
**Example**: `tcp_flow && flow_local_bps > 5000000` (high-bandwidth TCP flows)
## Kubernetes Variables
### Pod and Service (Directional)
| Variable | Type | Description |
|----------|------|-------------|
| `src.pod.name` | string | Source pod name |
| `src.pod.namespace` | string | Source pod namespace |
| `dst.pod.name` | string | Destination pod name |
| `dst.pod.namespace` | string | Destination pod namespace |
| `src.service.name` | string | Source service name |
| `src.service.namespace` | string | Source service namespace |
| `dst.service.name` | string | Destination service name |
| `dst.service.namespace` | string | Destination service namespace |
**Fallback behavior**: Pod namespace/name fields automatically fall back to
service data when pod info is unavailable. This means `dst.pod.namespace` works
even when only service-level resolution exists.
**Example**: `src.service.name == "api-gateway" && dst.pod.namespace == "production"`
### Aggregate Collections (Non-Directional)
| Variable | Type | Description |
|----------|------|-------------|
| `namespaces` | []string | All namespaces (src + dst, pod + service) |
| `pods` | []string | All pod names (src + dst) |
| `services` | []string | All service names (src + dst) |
### Labels and Annotations
| Variable | Type | Description |
|----------|------|-------------|
| `local_labels` | map[string]string | Kubernetes labels of local peer |
| `local_annotations` | map[string]string | Kubernetes annotations of local peer |
| `remote_labels` | map[string]string | Kubernetes labels of remote peer |
| `remote_annotations` | map[string]string | Kubernetes annotations of remote peer |
Use `map_get(local_labels, "key", "default")` for safe access that won't error
on missing keys.
**Example**: `map_get(local_labels, "app", "") == "checkout" && "production" in namespaces`
### Node Information
| Variable | Type | Description |
|----------|------|-------------|
| `node` | map | Nested: `node["name"]`, `node["ip"]` |
| `node_name` | string | Node name (flat alias) |
| `node_ip` | string | Node IP (flat alias) |
| `local_node_name` | string | Node name of local peer |
| `remote_node_name` | string | Node name of remote peer |
### Process Information
| Variable | Type | Description |
|----------|------|-------------|
| `local_process_name` | string | Process name on local peer |
| `remote_process_name` | string | Process name on remote peer |
### DNS Resolution
| Variable | Type | Description |
|----------|------|-------------|
| `src.dns` | string | DNS resolution of source IP |
| `dst.dns` | string | DNS resolution of destination IP |
| `dns_resolutions` | []string | All DNS resolutions (deduplicated) |
### Resolution Status
| Variable | Type | Values |
|----------|------|--------|
| `local_resolution_status` | string | `""` (resolved), `"no_node_mapping"`, `"rpc_error"`, `"rpc_empty"`, `"cache_miss"`, `"queue_full"` |
| `remote_resolution_status` | string | Same as above |
## Default Values
When a variable is not present in an entry, KFL2 uses these defaults:
| Type | Default |
|------|---------|
| string | `""` |
| int | `0` |
| bool | `false` |
| list | `[]` |
| map | `{}` |
| bytes | `[]` |
## Protocol Variable Precedence
For protocols with request/response pairs (Kafka, RADIUS, Diameter), merged
fields prefer the **request** side. If no request exists, the response value
is used. Size totals are always computed as `request_size + response_size`.
## CEL Language Features
KFL2 supports the full CEL specification:
- **Short-circuit evaluation**: `&&` stops on first false, `||` stops on first true
- **Ternary**: `condition ? value_if_true : value_if_false`
- **Regex**: `str.matches("pattern")` uses RE2 syntax
- **Type coercion**: Timestamps require `timestamp()`, durations require `duration()`
- **Null safety**: Use `in` operator or `map_get()` before accessing map keys
For the full CEL specification, see the
[CEL Language Definition](https://github.com/google/cel-spec/blob/master/doc/langdef.md).

338
skills/network-rca/SKILL.md Normal file
View File

@@ -0,0 +1,338 @@
---
name: network-rca
description: >
Kubernetes network root cause analysis skill powered by Kubeshark MCP. Use this skill
whenever the user wants to investigate past incidents, perform retrospective traffic
analysis, take or manage traffic snapshots, extract PCAPs, dissect L7 API calls from
historical captures, compare traffic patterns over time, detect drift or anomalies
between snapshots, or do any kind of forensic network analysis in Kubernetes.
Also trigger when the user mentions snapshots, raw capture, PCAP extraction,
traffic replay, postmortem analysis, "what happened yesterday/last week",
root cause analysis, RCA, cloud snapshot storage, snapshot dissection, or KFL filters
for historical traffic. Even if the user just says "figure out what went wrong"
or "compare today's traffic to yesterday" in a Kubernetes context, use this skill.
---
# Network Root Cause Analysis with Kubeshark MCP
You are a Kubernetes network forensics specialist. Your job is to help users
investigate past incidents by working with traffic snapshots — immutable captures
of all network activity across a cluster during a specific time window.
Kubeshark is a search engine for network traffic. Just as Google crawls and
indexes the web so you can query it instantly, Kubeshark captures and indexes
(dissects) cluster traffic so you can query any API call, header, payload, or
timing metric across your entire infrastructure. Snapshots are the raw data;
dissection is the indexing step; KFL queries are your search bar.
Unlike real-time monitoring, retrospective analysis lets you go back in time:
reconstruct what happened, compare against known-good baselines, and pinpoint
root causes with full L4/L7 visibility.
## Prerequisites
Before starting any analysis, verify the environment is ready.
### Kubeshark MCP Health Check
Confirm the Kubeshark MCP is accessible and tools are available. Look for tools
like `list_api_calls`, `list_l4_flows`, `create_snapshot`, etc.
**Tool**: `check_kubeshark_status`
If tools like `list_api_calls` or `list_l4_flows` are missing from the response,
something is wrong with the MCP connection. Guide the user through setup
(see Setup Reference at the bottom).
### Raw Capture Must Be Enabled
Retrospective analysis depends on raw capture — Kubeshark's kernel-level (eBPF)
packet recording that stores traffic at the node level. Without it, snapshots
have nothing to work with.
Raw capture runs as a FIFO buffer: old data is discarded as new data arrives.
The buffer size determines how far back you can go. Larger buffer = wider
snapshot window.
```yaml
tap:
capture:
raw:
enabled: true
storageSize: 10Gi # Per-node FIFO buffer
```
If raw capture isn't enabled, inform the user that retrospective analysis
requires it and share the configuration above.
### Snapshot Storage
Snapshots are assembled on the Hub's storage, which is ephemeral by default.
For serious forensic work, persistent storage is recommended:
```yaml
tap:
snapshots:
local:
storageClass: gp2
storageSize: 1000Gi
```
## Core Workflow
Every investigation starts with a snapshot. After that, you choose one of two
investigation routes depending on your goal:
1. **Determine time window** — When did the issue occur? Use `get_data_boundaries`
to see what raw capture data is available.
2. **Create or locate a snapshot** — Either take a new snapshot covering the
incident window, or find an existing one with `list_snapshots`.
3. **Choose your investigation route** — PCAP or Dissection (see below).
### Choosing the Right Route
| | PCAP Route | Dissection Route |
|---|---|---|
| **Speed** | Immediate — no indexing needed | Takes time to index |
| **Filtering** | Nodes, time window, BPF filters | Kubernetes & API-level (pods, labels, paths, status codes) |
| **Output** | Cluster-wide PCAP files | Structured query results |
| **Investigation by** | Human (Wireshark) | AI agent or human (queryable database) |
| **Best for** | Compliance, sharing with network teams, Wireshark deep-dives | Root cause analysis, API-level debugging, automated investigation |
Both routes are valid and complementary. Use PCAP when you need raw packets
for human analysis or compliance. Use Dissection when you want an AI agent
to search and analyze traffic programmatically.
## Snapshot Operations
Both routes start here. A snapshot is an immutable freeze of all cluster traffic
in a time window.
### Check Data Boundaries
**Tool**: `get_data_boundaries`
Check what raw capture data exists across the cluster. You can only create
snapshots within these boundaries — data outside the window has been rotated
out of the FIFO buffer.
**Example response**:
```
Cluster-wide:
Oldest: 2026-03-14 16:12:34 UTC
Newest: 2026-03-14 18:05:20 UTC
Per node:
┌─────────────────────────────┬──────────┬──────────┐
│ Node │ Oldest │ Newest │
├─────────────────────────────┼──────────┼──────────┤
│ ip-10-0-25-170.ec2.internal │ 16:12:34 │ 18:03:39 │
│ ip-10-0-32-115.ec2.internal │ 16:13:45 │ 18:05:20 │
└─────────────────────────────┴──────────┴──────────┘
```
If the incident falls outside the available window, the data has been rotated
out. Suggest increasing `storageSize` for future coverage.
### Create a Snapshot
**Tool**: `create_snapshot`
Specify nodes (or cluster-wide) and a time window within the data boundaries.
Snapshots include raw capture files, Kubernetes pod events, and eBPF cgroup events.
Snapshots take time to build. Check status with `get_snapshot` — wait until
`completed` before proceeding with either route.
### List Existing Snapshots
**Tool**: `list_snapshots`
Shows all snapshots on the local Hub, with name, size, status, and node count.
### Cloud Storage
Snapshots on the Hub are ephemeral. Cloud storage (S3, GCS, Azure Blob)
provides long-term retention. Snapshots can be downloaded to any cluster
with Kubeshark — not necessarily the original one.
**Check cloud status**: `get_cloud_storage_status`
**Upload to cloud**: `upload_snapshot_to_cloud`
**Download from cloud**: `download_snapshot_from_cloud`
---
## Route 1: PCAP
The PCAP route does **not** require dissection. It works directly with the raw
snapshot data to produce filtered, cluster-wide PCAP files. Use this route when:
- You need raw packets for Wireshark analysis
- You're sharing captures with network teams
- You need evidence for compliance or audit
- A human will perform the investigation (not an AI agent)
### Filtering a PCAP
**Tool**: `export_snapshot_pcap`
Filter the snapshot down to what matters using:
- **Nodes** — specific cluster nodes only
- **Time** — sub-window within the snapshot
- **BPF filter** — standard Berkeley Packet Filter syntax (e.g., `host 10.0.53.101`,
`port 8080`, `net 10.0.0.0/16`)
These filters are combinable — select specific nodes, narrow the time range,
and apply a BPF expression all at once.
### Workload-to-BPF Workflow
When you know the workload names but not their IPs, resolve them from the
snapshot's metadata. Snapshots preserve pod-to-IP mappings from capture time,
so resolution is accurate even if pods have been rescheduled since.
**Tool**: `resolve_workload`
**Example workflow** — extract PCAP for specific workloads:
1. Resolve IPs: `resolve_workload` for `orders-594487879c-7ddxf``10.0.53.101`
2. Resolve IPs: `resolve_workload` for `payment-service-6b8f9d-x2k4p``10.0.53.205`
3. Build BPF: `host 10.0.53.101 or host 10.0.53.205`
4. Export: `export_snapshot_pcap` with that BPF filter
This gives you a cluster-wide PCAP filtered to exactly the workloads involved
in the incident — ready for Wireshark or long-term storage.
---
## Route 2: Dissection
The Dissection route indexes raw packets into structured L7 API calls, building
a queryable database from the snapshot. Use this route when:
- An AI agent is performing the investigation
- You need to search by Kubernetes context (pods, namespaces, labels, services)
- You need to search by API elements (paths, status codes, headers, payloads)
- You want structured responses you can analyze programmatically
- You need to drill into the payload of a specific API call
**KFL requirement**: The Dissection route uses KFL filters for all queries
(`list_api_calls`, `get_api_stats`, etc.). Before constructing any KFL filter,
load the KFL skill (`skills/kfl/`). KFL is statically typed — incorrect field
names or syntax will fail silently or error. If the KFL skill is not available,
suggest the user install it:
```bash
ln -s /path/to/kubeshark/skills/kfl ~/.claude/skills/kfl
```
**If the KFL skill cannot be loaded**, only use the exact filter examples shown
in this skill. Do not improvise or guess at field names, operators, or syntax.
KFL field names differ from what you might expect (e.g., `status_code` not
`response.status`, `src.pod.namespace` not `src.namespace`). Using incorrect
fields produces wrong results without warning.
### Activate Dissection
**Tool**: `start_snapshot_dissection`
Dissection takes time proportional to snapshot size — it parses every packet,
reassembles streams, and builds the index. After completion, these tools
become available:
- `list_api_calls` — Search API transactions with KFL filters
- `get_api_call` — Drill into a specific call (headers, body, timing, payload)
- `get_api_stats` — Aggregated statistics (throughput, error rates, latency)
### Investigation Strategy
Start broad, then narrow:
1. `get_api_stats` — Get the overall picture: error rates, latency percentiles,
throughput. Look for spikes or anomalies.
2. `list_api_calls` filtered by error codes (4xx, 5xx) or high latency — find
the problematic transactions.
3. `get_api_call` on specific calls — inspect headers, bodies, timing, and
full payload to understand what went wrong.
4. Use KFL filters to slice by namespace, service, protocol, or any combination.
**Example `list_api_calls` response** (filtered to `http && status_code >= 500`):
```
┌──────────────────────┬────────┬──────────────────────────┬────────┬───────────┐
│ Timestamp │ Method │ URL │ Status │ Elapsed │
├──────────────────────┼────────┼──────────────────────────┼────────┼───────────┤
│ 2026-03-14 17:23:45 │ POST │ /api/v1/orders/charge │ 503 │ 12,340 ms │
│ 2026-03-14 17:23:46 │ POST │ /api/v1/orders/charge │ 503 │ 11,890 ms │
│ 2026-03-14 17:23:48 │ GET │ /api/v1/inventory/check │ 500 │ 8,210 ms │
│ 2026-03-14 17:24:01 │ POST │ /api/v1/payments/process │ 502 │ 30,000 ms │
└──────────────────────┴────────┴──────────────────────────┴────────┴───────────┘
Src: api-gateway (prod) → Dst: payment-service (prod)
```
Use the pattern of repeated failures and high latency to identify the failing
service chain, then drill into individual calls with `get_api_call`.
### KFL Filters for Dissected Traffic
Layer filters progressively when investigating:
```
// Step 1: Protocol + namespace
http && dst.pod.namespace == "production"
// Step 2: Add error condition
http && dst.pod.namespace == "production" && status_code >= 500
// Step 3: Narrow to service
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service"
// Step 4: Narrow to endpoint
http && dst.pod.namespace == "production" && status_code >= 500 && dst.service.name == "payment-service" && path.contains("/charge")
```
Other common RCA filters:
```
dns && dns_response && status_code != 0 // Failed DNS lookups
src.service.namespace != dst.service.namespace // Cross-namespace traffic
http && elapsed_time > 5000000 // Slow transactions (> 5s)
conn && conn_state == "open" && conn_local_bytes > 1000000 // High-volume connections
```
---
## Combining Both Routes
The two routes are complementary. A common pattern:
1. Start with **Dissection** — let the AI agent search and identify the root cause
2. Once you've pinpointed the problematic workloads, use `resolve_workload`
to get their IPs
3. Switch to **PCAP** — export a filtered PCAP of just those workloads for
Wireshark deep-dive, sharing with the network team, or compliance archival
## Use Cases
### Post-Incident RCA
1. Identify the incident time window from alerts, logs, or user reports
2. Check `get_data_boundaries` — is the window still in raw capture?
3. `create_snapshot` covering the incident window (add 15 minutes buffer)
4. **Dissection route**: `start_snapshot_dissection``get_api_stats`
`list_api_calls``get_api_call` → follow the dependency chain
5. **PCAP route**: `resolve_workload``export_snapshot_pcap` with BPF →
hand off to Wireshark or archive
### Other Use Cases
- **Trend analysis** — Take snapshots at regular intervals and compare
`get_api_stats` across them to detect latency drift, error rate changes,
or new service-to-service connections.
- **Forensic preservation** — `create_snapshot` + `upload_snapshot_to_cloud`
for immutable, long-term evidence. Downloadable to any cluster months later.
- **Production-to-local replay** — Upload a production snapshot to cloud,
download it on a local KinD cluster, and investigate safely.
## Setup Reference
For CLI installation, MCP configuration, verification, and troubleshooting,
see `references/setup.md`.

View File

@@ -0,0 +1,70 @@
# Kubeshark MCP Setup Reference
## Installing the CLI
**Homebrew (macOS)**:
```bash
brew install kubeshark
```
**Linux**:
```bash
sh <(curl -Ls https://kubeshark.com/install)
```
**From source**:
```bash
git clone https://github.com/kubeshark/kubeshark
cd kubeshark && make
```
## MCP Configuration
**Claude Desktop / Cowork** (`claude_desktop_config.json`):
```json
{
"mcpServers": {
"kubeshark": {
"command": "kubeshark",
"args": ["mcp"]
}
}
}
```
**Claude Code (CLI)**:
```bash
claude mcp add kubeshark -- kubeshark mcp
```
**Without kubectl access** (direct URL mode):
```json
{
"mcpServers": {
"kubeshark": {
"command": "kubeshark",
"args": ["mcp", "--url", "https://kubeshark.example.com"]
}
}
}
```
```bash
# Claude Code equivalent:
claude mcp add kubeshark -- kubeshark mcp --url https://kubeshark.example.com
```
## Verification
- Claude Code: `/mcp` to check connection status
- Terminal: `kubeshark mcp --list-tools`
- Cluster: `kubectl get pods -l app=kubeshark-hub`
## Troubleshooting
- **Binary not found** → Install via Homebrew or the install script above
- **Connection refused** → Deploy Kubeshark first: `kubeshark tap`
- **No L7 data** → Check `get_dissection_status` and `enable_dissection`
- **Snapshot creation fails** → Verify raw capture is enabled in Kubeshark config
- **Empty snapshot** → Check `get_data_boundaries` — the requested window may
fall outside available data