Grafana Alerting¶
These provisioning files add four stall-detection alerts under grafana/provisioning/alerting/:
Zero Activity: fires wheneventshas no rows in the last 30 minutes.Dispatch Stall: fires when the latestpipeline_starvation_detectedevent reportstodo_tasks > 0and notask_assignedevent has happened for 20 minutes.Agent Crash Spike: fires when more than 3member_crashedorpane_deathevents land within 10 minutes.Review Queue Backup: fires when a task has been completed for 30+ minutes without a latertask_auto_merged,task_manual_merged, ortask_reworkedevent.
Thresholds¶
Default windows and thresholds:
| Alert | Lookback | Trigger | for |
|---|---|---|---|
| Zero Activity | 30m | no events | 5m |
| Dispatch Stall | 20m | todo work exists and no dispatches | 2m |
| Agent Crash Spike | 10m | count > 3 |
2m |
| Review Queue Backup | 30m | count > 0 |
5m |
Customization¶
To tune an alert, edit both of these sections in stall-detection.yaml:
relativeTimeRange.from: lookback window in seconds.model.conditions[0].evaluator.params[0]: numeric threshold.for: how long Grafana keeps the condition firing before notifying.
Notification routing lives in notification-channels.yaml. By default, alerts go to the local orchestrator log webhook at http://127.0.0.1:8787/grafana-alerts.
Telegram is intentionally left as an opt-in example. Uncomment the batty-telegram contact point and the routes block, then set:
BATTY_GRAFANA_TELEGRAM_BOT_TOKENBATTY_GRAFANA_TELEGRAM_CHAT_ID
Notes¶
Two alerts use telemetry-backed proxies rather than live board state:
Dispatch Stalldepends on recentpipeline_starvation_detectedevents because the telemetry database does not persist current board status directly.Review Queue Backuptreatstask_completedas the start of review and clears the queue when merge or rework events arrive.