Testing¶
Batty has three test surfaces. Pick the one that matches what you're trying to protect:
| Surface | When to use | Runs in CI? |
|---|---|---|
cargo test --lib |
Unit tests for any module. Fast, no tmux, no subprocess. | Yes, on every PR |
cargo test --features integration |
Tmux-dependent integration tests. | No (needs a tmux server) |
cargo test --test scenarios --features scenario-test |
End-to-end scenarios driving the real TeamDaemon with in-process fake shims on a per-test tempdir. |
Yes, on every PR |
This document covers the scenario framework — the end-to-end
harness introduced in phase 1–3 under tickets #636–#646. For tmux
integration tests see the existing src/team/harness.rs docs; for
unit testing conventions see CLAUDE.md.
What the scenario framework gives you¶
A scenario test is a single #[test] in tests/scenarios/prescribed/
that:
- Builds a
ScenarioFixture(tempdir + git repo + kanban board + TeamDaemon + optional fake shims). - Drives the daemon through a deterministic sequence of ticks and
asserts against
TickReport+ fixture state between calls. - Runs in under 300 ms and cleans up on drop — no tmux, no subprocess, no shared state.
Phase 1 ships 22 prescriptive scenarios (one happy path + 7
regressions + 14 cross-feature) plus a randomized proptest-driven
fuzz harness with 10 invariants (ticket #645). Every recent release
bug has a dedicated regression scenario, and the fuzz targets
generate randomized workflow sequences that shrink to minimal
reproducers when an invariant fails.
Running the suite¶
# Full scenario suite + fuzz smoke (~60s)
cargo test --test scenarios --features scenario-test
# Just the prescriptive catalog
cargo test --test scenarios --features scenario-test prescribed::
# Just the regression catalog
cargo test --test scenarios --features scenario-test regressions::
# Just the fuzz targets (each case spawns a real TeamDaemon)
cargo test --test scenarios --features scenario-test fuzz_workflow
# Fuzz with more cases (nightly-style)
PROPTEST_CASES=512 cargo test --test scenarios --features scenario-test \
--release fuzz_workflow_happy
CI runs the default suite on every PR and a nightly fuzz job with
PROPTEST_CASES=2048 scheduled at 02:15 UTC. Both live in
.github/workflows/ci.yml.
Feeding external verification back into Batty¶
The daemon can ingest GitHub or CI check results from
.batty/github_verification.jsonl. Each line is one JSON object:
{"task_id":42,"branch":"eng-1/42","commit":"abcdef1","check_name":"ci/test","status":"failure","next_action":"fix the failing test"}
Use status: "failure" (or failed/error) to surface a task-scoped
blocker in batty status and review/dispatch intervention messages.
Use status: "success" (or passed) for a later record on the same
task/commit to clear the active blocker while retaining the previous
failed line as audit history. Records for unknown tasks or stale
branches/commits are ignored for blocking and emitted as daemon
warnings instead.
batty release --readiness also reads the same file and renders a
release-level GitHub feedback section. Failing records for the current HEAD
block readiness, warning-only statuses such as warning or neutral are
reported without blocking, and records for non-HEAD commits are listed as stale
with their recorded commit and age. When no current failing or warning feedback
exists, the readiness artifact still prints an explicit clean section.
Writing a new prescriptive scenario¶
Every prescriptive scenario lives in tests/scenarios/prescribed/.
Keep each file under 80 lines (regressions) or 150 lines (cross-
feature). Start from the following template:
//! My scenario: <what this test proves>.
//!
//! Failure mode being protected: <what would break if the fix
//! regressed>.
use super::super::scenarios_common::ScenarioFixture;
#[test]
fn my_scenario_does_the_thing() {
// 1. Build the fixture with the team shape this scenario needs.
let mut fixture = ScenarioFixture::builder()
.with_manager("manager")
.with_engineers(1)
.with_task(1, "my task", "todo", None)
.build();
// 2. Set up the initial fault state (optional).
// Examples: write_raw_task_file, append_raw_event_line,
// insert_fake_shim, set_active_task, etc.
// 3. Drive the daemon.
let report = fixture.tick();
// 4. Assert against `report` and/or fixture state.
assert!(report.subsystem_errors.is_empty());
// 5. Always end with the consistency check.
fixture.assert_state_consistent();
}
Then add a pub mod my_scenario; line to the appropriate
mod.rs (either prescribed/mod.rs or a subdirectory's
mod.rs). Run cargo test --test scenarios --features scenario-test my_scenario to make sure it passes, then commit.
If your scenario needs internal daemon state that the public
ScenarioFixture API doesn't expose, add a method on
ScenarioHooks in src/team/daemon/scenario_api.rs. That's the one
documented seam for reaching into the daemon from integration tests;
do NOT widen visibility on daemon fields.
Fake shims¶
An in-process fake shim replaces a real agent subprocess with scripted behavior. Typical usage:
use batty_cli::shim::fake::ShimBehavior;
use std::path::PathBuf;
fixture.insert_fake_shim("eng-1");
fixture.shim("eng-1").queue(ShimBehavior::CompleteWith {
response: "implemented the thing".to_string(),
files_touched: vec![(
PathBuf::from("src/lib.rs"),
"// new code\n".to_string(),
)],
});
// Simulate a dispatch from the manager and drain the fake's response.
fixture.send_to_shim("eng-1", "manager", "implement src/lib.rs");
let _events = fixture.process_shim("eng-1");
// Tick the daemon so it observes the completion.
let report = fixture.tick();
ShimBehavior variants:
CompleteWith— happy-path completion; commits the listed files.NarrationOnly— Completion with zero files.NarrationFirstThenClean— first call is narration, second is clean.ErrorOut— emitsEvent::Error.ContextExhausted— emitsEvent::ContextExhausted.Silent— swallows the command with no response.Script(Vec<Event>)— verbatim event sequence.Once(Box<ShimBehavior>)— apply inner once, then revert.
Fuzz targets¶
The fuzz harness lives under tests/scenarios/fuzz/:
model.rs— pureModelBoard+Transitiondata types.reference_sm.rs—ReferenceStateMachineimpl + pureapplyoracle.sut.rs—StateMachineTestimpl mapping each transition to concreteScenarioFixtureoperations.invariants.rs— ten cross-subsystem invariants checked after every transition.fuzz_workflow.rs— threeprop_state_machine!targets.
Reading a fuzz failure¶
When a fuzz case fails, proptest prints a seed and the full transition sequence that triggered the failure. Example output:
thread 'fuzz::fuzz_workflow::fuzz_workflow_happy' panicked at ...
assertion failed: claim_exclusivity: two engineers claim task #5
...
Minimal failing input: (
initial_state: ModelBoard { ... },
transitions: [
DispatchTask { task_id: 5, engineer: "eng-1" },
DispatchTask { task_id: 5, engineer: "eng-2" },
]
)
The transitions list is already shrunk to the minimal reproducer.
Copy it verbatim into a new file under
tests/scenarios/prescribed/regressions/ following the template
above. The shrunk sequence becomes a permanent regression guard that
runs on every PR.
Re-running a specific fuzz case¶
Proptest writes failing seeds to
tests/scenarios/fuzz/proptest-regressions/. To re-run a specific
seed:
PROPTEST_CASES=1 \
PROPTEST_REPLAY=<seed-from-output> \
cargo test --test scenarios --features scenario-test fuzz_workflow_happy
Or bump PROPTEST_CASES to exhaustively re-probe the same shape:
PROPTEST_CASES=1024 cargo test --test scenarios --features \
scenario-test fuzz_workflow_happy
Debugging a flaky scenario¶
Run a single scenario in a tight loop:
for i in $(seq 1 20); do
cargo test --test scenarios --features scenario-test my_scenario \
|| { echo "FAILED on run $i"; break; }
done
If the failure is intermittent, the most likely cause is:
- Wall-clock time dependency — some timer in the daemon fires
on real-time bounds. Use
ScenarioHooks::backdate_*to set timestamps explicitly instead of waiting. - Parallel test ordering — two tests mutating the same global.
All scenario tests should be hermetic (their own
TempDir); if you see this, check forstd::envorPATHmutation. - Socketpair drain ordering —
process_shimmust be called betweensend_to_shimand the nexttickso the fake can respond before the daemon polls.
Non-goals¶
The scenario framework deliberately does NOT:
- Spawn real
claude/codex/kirosubprocesses. Those live insrc/shim/tests_sdk.rs,tests_codex.rs,tests_kiro.rs. - Touch tmux. That's the
integrationfeature. - Mutate
std::envorPATH. Every test must be hermetic. - Use
fail-rsfailpoints. Phase 4 (future) adds source-level fault injection; phase 1 relies onShimBehaviorand direct filesystem manipulation.
Related¶
- Design plan:
~/.claude/plans/serene-pondering-snowglobe.md - Execution order:
planning/scenario-framework-execution.md - Tickets #636–#646 on the batty board cover every phase-1 deliverable.