Skip to content

Grafana Alerting

These provisioning files add four stall-detection alerts under grafana/provisioning/alerting/:

  • Zero Activity: fires when events has no rows in the last 30 minutes.
  • Dispatch Stall: fires when the latest pipeline_starvation_detected event reports todo_tasks > 0 and no task_assigned event has happened for 20 minutes.
  • Agent Crash Spike: fires when more than 3 member_crashed or pane_death events land within 10 minutes.
  • Review Queue Backup: fires when a task has been completed for 30+ minutes without a later task_auto_merged, task_manual_merged, or task_reworked event.

Thresholds

Default windows and thresholds:

Alert Lookback Trigger for
Zero Activity 30m no events 5m
Dispatch Stall 20m todo work exists and no dispatches 2m
Agent Crash Spike 10m count > 3 2m
Review Queue Backup 30m count > 0 5m

Customization

To tune an alert, edit both of these sections in stall-detection.yaml:

  • relativeTimeRange.from: lookback window in seconds.
  • model.conditions[0].evaluator.params[0]: numeric threshold.
  • for: how long Grafana keeps the condition firing before notifying.

Notification routing lives in notification-channels.yaml. By default, alerts go to the local orchestrator log webhook at http://127.0.0.1:8787/grafana-alerts.

Telegram is intentionally left as an opt-in example. Uncomment the batty-telegram contact point and the routes block, then set:

  • BATTY_GRAFANA_TELEGRAM_BOT_TOKEN
  • BATTY_GRAFANA_TELEGRAM_CHAT_ID

Notes

Two alerts use telemetry-backed proxies rather than live board state:

  • Dispatch Stall depends on recent pipeline_starvation_detected events because the telemetry database does not persist current board status directly.
  • Review Queue Backup treats task_completed as the start of review and clears the queue when merge or rework events arrive.