Telemetry¶
This document describes the Prometheus-based telemetry system in sase. Telemetry tracks metrics across all major subsystems — agent lifecycle, LLM providers, axe daemon, hooks, beads, VCS/workspace, and notifications — and provides CLI tools for monitoring, health checks, and a bundled Docker Compose monitoring stack.
Table of Contents¶
- Overview
- Configuration
- CLI Commands
- status
- list
- snapshot
- dashboard
- health
- export-config
- Architecture
- Metric Catalog
- Monitoring Stack
- Integration Points
Overview¶
The telemetry system uses Prometheus metrics to instrument sase internals. Key design principles:
- Zero-cost when disabled: All metrics are lightweight no-op stubs when
telemetry.enabledisfalse(the default). There is no runtime overhead unless telemetry is explicitly enabled. - Dual data collection: Short-lived processes (agents) push metrics to a Push Gateway. Long-lived processes (axe daemon) expose metrics via an HTTP endpoint for Prometheus to scrape.
- 34 metrics across 7 subsystems: Comprehensive coverage of the full sase lifecycle.
Configuration¶
Telemetry is configured under the telemetry key in sase.yml:
telemetry:
enabled: false # Toggle telemetry on/off globally
prometheus:
exposition_port: 9464 # HTTP server port for axe and agents
pushgateway_url: "localhost:9091" # Push Gateway address
health_thresholds:
error_rate_warn: 10.0 # % threshold for warn status
error_rate_critical: 25.0 # % threshold for critical status
retry_rate_warn: 10.0
retry_rate_critical: 25.0
p95_latency_warn: 300.0 # seconds
p95_latency_critical: 600.0
| Field | Type | Default | Description |
|---|---|---|---|
telemetry.enabled |
bool | false |
Enable or disable telemetry globally. |
telemetry.prometheus.exposition_port |
int | 9464 |
HTTP server port for metric exposition. |
telemetry.prometheus.pushgateway_url |
str | localhost:9091 |
Prometheus Push Gateway address. |
telemetry.health_thresholds.*_warn |
float | varies | Percentage threshold for WARN health status. |
telemetry.health_thresholds.*_critical |
float | varies | Percentage threshold for CRITICAL status. |
telemetry.health_thresholds.p95_latency_* |
float | 300/600 | P95 latency thresholds in seconds. |
Source: src/sase/default_config.yml, src/sase/telemetry/_config.py
CLI Commands¶
sase telemetry status¶
Quick health check and configuration display. Shows telemetry enabled/disabled status, metric counts by type, and reachability of the Push Gateway and HTTP exposition server.
sase telemetry status
sase telemetry list¶
Display the metric catalog from internal definitions, grouped by subsystem.
sase telemetry list # All metrics
sase telemetry list -s "Agent Lifecycle" # Filter by subsystem
sase telemetry list -t counter # Filter by type (counter, gauge, histogram)
| Flag | Values | Default | Description |
|---|---|---|---|
-s, --subsystem |
subsystem name | - | Filter to a subsystem |
-t, --type |
counter, gauge, histogram |
- | Filter by metric type |
sase telemetry snapshot¶
Fetch and display current metric values from the Push Gateway or exposition endpoint.
sase telemetry snapshot # Rich table (default)
sase telemetry snapshot -f json # JSON output
sase telemetry snapshot -f prometheus # Raw Prometheus text format
sase telemetry snapshot -S pushgateway # Force pushgateway source
| Flag | Values | Default | Description |
|---|---|---|---|
-S, --source |
auto, pushgateway, exposition |
auto |
Data source |
-f, --format |
rich, json, prometheus |
rich |
Output format |
-s, --subsystem |
subsystem name | - | Filter by subsystem |
sase telemetry dashboard¶
Live auto-refreshing TUI dashboard with two modes: summary (default) and charts.
sase telemetry dashboard # Summary mode, 5s refresh
sase telemetry dashboard -c # Charts mode with historical data
sase telemetry dashboard -c -r 24h # Charts over last 24 hours
sase telemetry dashboard -c -s "LLM Provider" # Focus on one subsystem
sase telemetry dashboard -i 10 # 10-second refresh interval
| Flag | Values | Default | Description |
|---|---|---|---|
-i, --interval |
int (seconds) | 5 |
Refresh interval |
-S, --source |
auto, pushgateway, exposition |
auto |
Data source |
-c, --charts |
flag | - | Enable charts mode with historical data |
-r, --range |
1h, 6h, 24h, 7d |
1h |
Time range for charts |
-s, --subsystem |
subsystem name | - | Focus on a single subsystem (larger charts) |
Summary mode shows styled stat panels with color-coded gauges (Active Agents, Active Workspaces, Active Beads) and compact subsystem metric tables.
Charts mode (-c) renders historical line and bar charts via Prometheus range queries, showing: Agent Run Duration,
Active Agents, Agent Throughput, LLM Token Usage, LLM Latency, and Error Rate. Falls back to summary mode when
Prometheus is unreachable.
sase telemetry health¶
Traffic-light health assessment with OK/WARN/CRITICAL status for each subsystem based on configured thresholds.
sase telemetry health # Rich output
sase telemetry health -j # Machine-readable JSON
| Flag | Values | Default | Description |
|---|---|---|---|
-j, --json |
flag | - | JSON output |
-S, --source |
auto, pushgateway, exposition |
auto |
Data source |
Exit codes: 0 (healthy), 1 (degraded/warn), 2 (critical).
sase telemetry export-config¶
Export the bundled monitoring stack (Docker Compose, Prometheus config, Grafana dashboard) to a local directory.
sase telemetry export-config # Default: ./sase-monitoring/
sase telemetry export-config -o /tmp/monitoring # Custom output directory
sase telemetry export-config -f # Overwrite existing
| Flag | Values | Default | Description |
|---|---|---|---|
-o, --output-dir |
path | ./sase-monitoring/ |
Target directory |
-f, --force |
flag | - | Overwrite target if it exists |
After exporting, start the stack with:
cd sase-monitoring && docker compose up -d
Architecture¶
┌──────────────────────────────────────────────────────────┐
│ sase telemetry CLI │
│ status · list · snapshot · dashboard · health · export │
├──────────────────────────────────────────────────────────┤
│ Scrape Client │
│ (fetches from Push Gateway or exposition) │
├─────────────────────┬────────────────────────────────────┤
│ Push Gateway │ HTTP Exposition Server │
│ (short-lived procs) │ (long-lived: axe daemon) │
├─────────────────────┴────────────────────────────────────┤
│ Prometheus Metrics Layer │
│ metrics.py (34 metric singletons, stub/real switch)│
├──────────────────────────────────────────────────────────┤
│ Instrumentation Points │
│ Agent · LLM · Axe · Hooks · Beads · VCS · Notifications │
└──────────────────────────────────────────────────────────┘
Source Layout¶
| File / Directory | Purpose |
|---|---|
src/sase/telemetry/__init__.py |
Public API exports |
src/sase/telemetry/metrics.py |
Module-level metric singletons (34 attrs) |
src/sase/telemetry/_registry.py |
Init, Push Gateway integration, atexit |
src/sase/telemetry/_stubs.py |
No-op stub classes (zero overhead) |
src/sase/telemetry/_config.py |
Configuration loading from sase.yml |
src/sase/telemetry/catalog.py |
Structured metric catalog and grouping |
src/sase/telemetry/scrape.py |
HTTP client and Prometheus text parser |
src/sase/telemetry/prom_query.py |
Prometheus HTTP API client (range queries) |
src/sase/telemetry/charts.py |
Terminal chart rendering via plotext |
src/sase/telemetry/cli_*.py |
CLI subcommand handlers |
src/sase/telemetry/monitoring/ |
Bundled Docker Compose + Prometheus + Grafana |
Metric Catalog¶
34 metrics organized into 7 subsystems:
Agent Lifecycle¶
| Prometheus Name | Type | Labels | Description |
|---|---|---|---|
sase_agent_runs_total |
counter | llm_provider, status, workflow | Total agent runs |
sase_agent_run_duration_seconds |
histogram | llm_provider, workflow | Agent run duration |
sase_agent_active |
gauge | llm_provider, project | Currently active agents |
sase_agent_spawns_total |
counter | llm_provider, project | Total agent spawns |
sase_agent_kills_total |
counter | reason | Total agent kills |
LLM Provider¶
| Prometheus Name | Type | Labels | Description |
|---|---|---|---|
sase_llm_invocations_total |
counter | provider, status | Total LLM invocations |
sase_llm_invocation_duration_seconds |
histogram | provider | Invocation duration |
sase_llm_errors_total |
counter | provider, error_type | LLM errors |
sase_llm_retries_total |
counter | provider | LLM retries |
sase_llm_retry_spawns_total |
counter | outcome | Cross-process retries |
sase_llm_input_tokens_total |
counter | provider | Input tokens consumed |
sase_llm_output_tokens_total |
counter | provider | Output tokens generated |
sase_llm_cache_read_tokens_total |
counter | provider | Cache-read tokens |
Axe Orchestrator¶
| Prometheus Name | Type | Labels | Description |
|---|---|---|---|
sase_axe_cycles_total |
counter | cycle_type | Total axe cycles |
sase_axe_cycle_duration_seconds |
histogram | cycle_type | Cycle duration |
sase_axe_lumberjacks_active |
gauge | - | Active lumberjacks |
sase_axe_lumberjack_restarts_total |
counter | - | Lumberjack restarts |
sase_axe_errors_total |
counter | error_type | Axe errors |
Hooks / Mentors / Workflows¶
| Prometheus Name | Type | Labels | Description |
|---|---|---|---|
sase_hook_executions_total |
counter | hook_type, status | Hook executions |
sase_hook_duration_seconds |
histogram | hook_type | Hook duration |
sase_hook_retries_total |
counter | hook_type | Hook retries |
sase_mentor_executions_total |
counter | status | Mentor executions |
sase_workflow_executions_total |
counter | workflow, status | Workflow executions |
sase_workflow_duration_seconds |
histogram | workflow | Workflow duration |
sase_zombie_detections_total |
counter | - | Zombie process detections |
Beads¶
| Prometheus Name | Type | Labels | Description |
|---|---|---|---|
sase_bead_operations_total |
counter | operation | Bead CRUD operations |
sase_bead_status_transitions_total |
counter | from_status, to_status | Bead status transitions |
sase_bead_active |
gauge | project, status | Active beads |
VCS / Workspace¶
| Prometheus Name | Type | Labels | Description |
|---|---|---|---|
sase_vcs_commits_total |
counter | provider, type | VCS commits |
sase_vcs_operations_total |
counter | provider, operation, status | VCS operations |
sase_workspace_acquisitions_total |
counter | project | Workspace acquisitions |
sase_workspace_releases_total |
counter | project | Workspace releases |
sase_workspace_active |
gauge | project | Active workspaces |
Notifications¶
| Prometheus Name | Type | Labels | Description |
|---|---|---|---|
sase_notifications_sent_total |
counter | type, status | Notifications sent |
Monitoring Stack¶
The sase telemetry export-config command exports a ready-to-use Docker Compose stack:
Services¶
| Service | Port | Description |
|---|---|---|
| Push Gateway | 9091 | Receives metrics from short-lived agent processes |
| Prometheus | 9090 | Scrapes Push Gateway and axe exposition server; stores metrics |
| Grafana | 3000 | Pre-configured dashboard with panels for all subsystems |
Prometheus Scrape Configuration¶
Prometheus is configured with two scrape jobs:
sase_axe: Scrapes the axe daemon's HTTP exposition server (port 9464)pushgateway: Scrapes the Push Gateway (port 9091)
Global scrape interval: 15 seconds.
Alerting Rules¶
20+ alert rules covering:
- Agent error rate thresholds (10%, 25%)
- Agent p95 latency thresholds (300s, 600s)
- LLM error rate and retry rate
- Axe cycle errors
- Hook retry rates
- Workspace release failures
Alert labels use severity: warning and severity: critical.
Grafana Dashboard¶
A pre-built Grafana dashboard is automatically provisioned with panels for all telemetry subsystems, accessible at
http://localhost:3000 after starting the stack.
Integration Points¶
Telemetry is instrumented across the codebase:
| Subsystem | Location | What is tracked |
|---|---|---|
| Agent | src/sase/agent/launcher.py |
Runs, duration, spawns, kills |
| LLM Provider | src/sase/llm_provider/_invoke.py |
Invocations, tokens, errors, retries |
| Axe | src/sase/axe/orchestrator.py |
Cycles, errors, lumberjack activity |
| Hooks/Mentors | Hook execution modules | Execution counts, durations, retries |
| Beads | src/sase/bead/project.py |
CRUD operations, status transitions |
| VCS | VCS operation modules | Commits, operations |
| Notifications | src/sase/notifications/senders.py |
Notifications sent |
The axe orchestrator calls init_telemetry(start_http_server=True) on startup to begin exposing metrics. Agent
processes use push_metrics() to send data to the Push Gateway on exit via an atexit handler.