Telemetry¶

This document describes the Prometheus-based telemetry system in sase. Telemetry tracks metrics across all major subsystems — agent lifecycle, LLM providers, axe daemon, hooks, beads, VCS/workspace, and notifications — and provides CLI tools for monitoring, health checks, and a bundled Docker Compose monitoring stack.

Overview¶

The telemetry system uses Prometheus metrics to instrument sase internals. Key design principles:

Zero-cost when disabled: All metrics are lightweight no-op stubs when telemetry.enabled is false (the default). There is no runtime overhead unless telemetry is explicitly enabled.
Dual data collection: Short-lived processes (agents) push metrics to a Push Gateway. Long-lived processes (axe daemon) expose metrics via an HTTP endpoint for Prometheus to scrape.
34 metrics across 7 subsystems: Comprehensive coverage of the full sase lifecycle.

Configuration¶

Telemetry is configured under the telemetry key in sase.yml:

telemetry:
  enabled: false # Toggle telemetry on/off globally
  prometheus:
    exposition_port: 9464 # HTTP server port for axe and agents
    pushgateway_url: "localhost:9091" # Push Gateway address
  health_thresholds:
    error_rate_warn: 10.0 # % threshold for warn status
    error_rate_critical: 25.0 # % threshold for critical status
    retry_rate_warn: 10.0
    retry_rate_critical: 25.0
    p95_latency_warn: 300.0 # seconds
    p95_latency_critical: 600.0

Field	Type	Default	Description
`telemetry.enabled`	bool	`false`	Enable or disable telemetry globally.
`telemetry.prometheus.exposition_port`	int	`9464`	HTTP server port for metric exposition.
`telemetry.prometheus.pushgateway_url`	str	`localhost:9091`	Prometheus Push Gateway address.
`telemetry.health_thresholds.*_warn`	float	varies	Percentage threshold for WARN health status.
`telemetry.health_thresholds.*_critical`	float	varies	Percentage threshold for CRITICAL status.
`telemetry.health_thresholds.p95_latency_*`	float	300/600	P95 latency thresholds in seconds.

Source: src/sase/default_config.yml, src/sase/telemetry/_config.py

CLI Commands¶

`sase telemetry status`¶

Quick health check and configuration display. Shows telemetry enabled/disabled status, metric counts by type, and reachability of the Push Gateway and HTTP exposition server.

sase telemetry status

`sase telemetry list`¶

Display the metric catalog from internal definitions, grouped by subsystem.

sase telemetry list                      # All metrics
sase telemetry list -s "Agent Lifecycle" # Filter by subsystem
sase telemetry list -t counter           # Filter by type (counter, gauge, histogram)

Flag	Values	Default	Description
`-s, --subsystem`	subsystem name	-	Filter to a subsystem
`-t, --type`	`counter`, `gauge`, `histogram`	-	Filter by metric type

`sase telemetry snapshot`¶

Fetch and display current metric values from the Push Gateway or exposition endpoint.

sase telemetry snapshot                          # Rich table (default)
sase telemetry snapshot -f json                  # JSON output
sase telemetry snapshot -f prometheus            # Raw Prometheus text format
sase telemetry snapshot -S pushgateway           # Force pushgateway source

Flag	Values	Default	Description
`-S, --source`	`auto`, `pushgateway`, `exposition`	`auto`	Data source
`-f, --format`	`rich`, `json`, `prometheus`	`rich`	Output format
`-s, --subsystem`	subsystem name	-	Filter by subsystem

`sase telemetry dashboard`¶

Live auto-refreshing TUI dashboard with two modes: summary (default) and charts.

sase telemetry dashboard                  # Summary mode, 5s refresh
sase telemetry dashboard -c               # Charts mode with historical data
sase telemetry dashboard -c -r 24h        # Charts over last 24 hours
sase telemetry dashboard -c -s "LLM Provider"  # Focus on one subsystem
sase telemetry dashboard -i 10            # 10-second refresh interval

Flag	Values	Default	Description
`-i, --interval`	int (seconds)	`5`	Refresh interval
`-S, --source`	`auto`, `pushgateway`, `exposition`	`auto`	Data source
`-c, --charts`	flag	-	Enable charts mode with historical data
`-r, --range`	`1h`, `6h`, `24h`, `7d`	`1h`	Time range for charts
`-s, --subsystem`	subsystem name	-	Focus on a single subsystem (larger charts)

Summary mode shows styled stat panels with color-coded gauges (Active Agents, Active Workspaces, Active Beads) and compact subsystem metric tables.

Charts mode (-c) renders historical line and bar charts via Prometheus range queries, showing: Agent Run Duration, Active Agents, Agent Throughput, LLM Token Usage, LLM Latency, and Error Rate. Falls back to summary mode when Prometheus is unreachable.

`sase telemetry health`¶

Traffic-light health assessment with OK/WARN/CRITICAL status for each subsystem based on configured thresholds.

sase telemetry health                    # Rich output
sase telemetry health -j                 # Machine-readable JSON

Flag	Values	Default	Description
`-j, --json`	flag	-	JSON output
`-S, --source`	`auto`, `pushgateway`, `exposition`	`auto`	Data source

Exit codes: 0 (healthy), 1 (degraded/warn), 2 (critical).

`sase telemetry export-config`¶

Export the bundled monitoring stack (Docker Compose, Prometheus config, Grafana dashboard) to a local directory.

sase telemetry export-config                     # Default: ./sase-monitoring/
sase telemetry export-config -o /tmp/monitoring  # Custom output directory
sase telemetry export-config -f                  # Overwrite existing

Flag	Values	Default	Description
`-o, --output-dir`	path	`./sase-monitoring/`	Target directory
`-f, --force`	flag	-	Overwrite target if it exists

After exporting, start the stack with:

cd sase-monitoring && docker compose up -d

Architecture¶

┌──────────────────────────────────────────────────────────┐
│                    sase telemetry CLI                    │
│  status · list · snapshot · dashboard · health · export  │
├──────────────────────────────────────────────────────────┤
│                       Scrape Client                      │
│          (fetches from Push Gateway or exposition)       │
├─────────────────────┬────────────────────────────────────┤
│   Push Gateway      │   HTTP Exposition Server           │
│ (short-lived procs) │   (long-lived: axe daemon)         │
├─────────────────────┴────────────────────────────────────┤
│                 Prometheus Metrics Layer                 │
│       metrics.py (34 metric singletons, stub/real switch)│
├──────────────────────────────────────────────────────────┤
│                  Instrumentation Points                  │
│  Agent · LLM · Axe · Hooks · Beads · VCS · Notifications │
└──────────────────────────────────────────────────────────┘

Source Layout¶

File / Directory	Purpose
`src/sase/telemetry/__init__.py`	Public API exports
`src/sase/telemetry/metrics.py`	Module-level metric singletons (34 attrs)
`src/sase/telemetry/_registry.py`	Init, Push Gateway integration, atexit
`src/sase/telemetry/_stubs.py`	No-op stub classes (zero overhead)
`src/sase/telemetry/_config.py`	Configuration loading from sase.yml
`src/sase/telemetry/catalog.py`	Structured metric catalog and grouping
`src/sase/telemetry/scrape.py`	HTTP client and Prometheus text parser
`src/sase/telemetry/prom_query.py`	Prometheus HTTP API client (range queries)
`src/sase/telemetry/charts.py`	Terminal chart rendering via plotext
`src/sase/telemetry/cli_*.py`	CLI subcommand handlers
`src/sase/telemetry/monitoring/`	Bundled Docker Compose + Prometheus + Grafana

Metric Catalog¶

34 metrics organized into 7 subsystems:

Agent Lifecycle¶

Prometheus Name	Type	Labels	Description
`sase_agent_runs_total`	counter	llm_provider, status, workflow	Total agent runs
`sase_agent_run_duration_seconds`	histogram	llm_provider, workflow	Agent run duration
`sase_agent_active`	gauge	llm_provider, project	Currently active agents
`sase_agent_spawns_total`	counter	llm_provider, project	Total agent spawns
`sase_agent_kills_total`	counter	reason	Total agent kills

LLM Provider¶

Prometheus Name	Type	Labels	Description
`sase_llm_invocations_total`	counter	provider, status	Total LLM invocations
`sase_llm_invocation_duration_seconds`	histogram	provider	Invocation duration
`sase_llm_errors_total`	counter	provider, error_type	LLM errors
`sase_llm_retries_total`	counter	provider	LLM retries
`sase_llm_retry_spawns_total`	counter	outcome	Cross-process retries
`sase_llm_input_tokens_total`	counter	provider	Input tokens consumed
`sase_llm_output_tokens_total`	counter	provider	Output tokens generated
`sase_llm_cache_read_tokens_total`	counter	provider	Cache-read tokens

Axe Orchestrator¶

Prometheus Name	Type	Labels	Description
`sase_axe_cycles_total`	counter	cycle_type	Total axe cycles
`sase_axe_cycle_duration_seconds`	histogram	cycle_type	Cycle duration
`sase_axe_lumberjacks_active`	gauge	-	Active lumberjacks
`sase_axe_lumberjack_restarts_total`	counter	-	Lumberjack restarts
`sase_axe_errors_total`	counter	error_type	Axe errors

Hooks / Mentors / Workflows¶

Prometheus Name	Type	Labels	Description
`sase_hook_executions_total`	counter	hook_type, status	Hook executions
`sase_hook_duration_seconds`	histogram	hook_type	Hook duration
`sase_hook_retries_total`	counter	hook_type	Hook retries
`sase_mentor_executions_total`	counter	status	Mentor executions
`sase_workflow_executions_total`	counter	workflow, status	Workflow executions
`sase_workflow_duration_seconds`	histogram	workflow	Workflow duration
`sase_zombie_detections_total`	counter	-	Zombie process detections

Beads¶

Prometheus Name	Type	Labels	Description
`sase_bead_operations_total`	counter	operation	Bead CRUD operations
`sase_bead_status_transitions_total`	counter	from_status, to_status	Bead status transitions
`sase_bead_active`	gauge	project, status	Active beads

VCS / Workspace¶

Prometheus Name	Type	Labels	Description
`sase_vcs_commits_total`	counter	provider, type	VCS commits
`sase_vcs_operations_total`	counter	provider, operation, status	VCS operations
`sase_workspace_acquisitions_total`	counter	project	Workspace acquisitions
`sase_workspace_releases_total`	counter	project	Workspace releases
`sase_workspace_active`	gauge	project	Active workspaces

Notifications¶

Prometheus Name	Type	Labels	Description
`sase_notifications_sent_total`	counter	type, status	Notifications sent

Monitoring Stack¶

The sase telemetry export-config command exports a ready-to-use Docker Compose stack:

Services¶

Service	Port	Description
Push Gateway	9091	Receives metrics from short-lived agent processes
Prometheus	9090	Scrapes Push Gateway and axe exposition server; stores metrics
Grafana	3000	Pre-configured dashboard with panels for all subsystems

Prometheus Scrape Configuration¶

Prometheus is configured with two scrape jobs:

sase_axe: Scrapes the axe daemon's HTTP exposition server (port 9464)
pushgateway: Scrapes the Push Gateway (port 9091)

Global scrape interval: 15 seconds.

Alerting Rules¶

20+ alert rules covering:

Agent error rate thresholds (10%, 25%)
Agent p95 latency thresholds (300s, 600s)
LLM error rate and retry rate
Axe cycle errors
Hook retry rates
Workspace release failures

Alert labels use severity: warning and severity: critical.

Grafana Dashboard¶

A pre-built Grafana dashboard is automatically provisioned with panels for all telemetry subsystems, accessible at http://localhost:3000 after starting the stack.

Integration Points¶

Telemetry is instrumented across the codebase:

Subsystem	Location	What is tracked
Agent	`src/sase/agent/launcher.py`	Runs, duration, spawns, kills
LLM Provider	`src/sase/llm_provider/_invoke.py`	Invocations, tokens, errors, retries
Axe	`src/sase/axe/orchestrator.py`	Cycles, errors, lumberjack activity
Hooks/Mentors	Hook execution modules	Execution counts, durations, retries
Beads	`src/sase/bead/project.py`	CRUD operations, status transitions
VCS	VCS operation modules	Commits, operations
Notifications	`src/sase/notifications/senders.py`	Notifications sent

The axe orchestrator calls init_telemetry(start_http_server=True) on startup to begin exposing metrics. Agent processes use push_metrics() to send data to the Push Gateway on exit via an atexit handler.