Autonomous coding agents¶

How this repo uses Charlie Labs and a Claude Agent SDK dispatcher to pick up issues, ship PRs, and maintain the codebase under human review.

The shape¶

Issues land in GitHub (typed by the operator, or auto-created from the in-app feedback widget — see LIS-301). The operator triages each with a complexity label. Agents read the labels and decide whether to act.

feedback widget        ┐
manual operator typing ┤── GitHub issue ──┬── ready-for-agent + complexity:s ──→ Charlie
issue migrated from FB ┘                  ├── ready-for-agent + complexity:m ──→ Charlie
                                          ├── ready-for-agent + complexity:l ──→ human (operator pairs with Claude in chat)
                                          └── human-only (auto-applied)      ──→ human (no agent)
                                                          │
                                          ┌───────────────┴───────────────┐
                                          │  Agent opens draft PR         │
                                          │  CI runs (test + docker smoke)│
                                          │  Operator reviews + merges    │
                                          │  GH Actions auto-deploys      │
                                          └───────────────────────────────┘

Three active tiers, in increasing trust required of the actor (T2 is deferred — see below):

Tier	Who	Scope	Identity
T1	Charlie Labs	`complexity:s` AND `complexity:m` (≤3 files, ≤200 lines). Lint, dep upgrades, bug fixes, scoped features with tests, small refactors. Expanded from `complexity:s`-only after a 2026-05-16 trial.	GitHub App: `charlie[bot]`
T2	Claude Agent SDK dispatcher	Deferred indefinitely after Charlie's trial. See "Trial outcome" below.	GH Actions: `github-actions[bot]` + `Co-Authored-By: Claude` trailer
T3	Operator + Claude in chat	Open-ended design or execution work. Architectural decisions stay here	Operator commits, Claude as co-author
T4	Human only	CODEOWNERS-protected paths, anything labeled `human-only`	Operator

Trial outcome (2026-05-16)¶

Five issues across two batches went through Charlie's coding-agent tier. Summary table first; per-issue notes below.

Issue	Type	Outcome	Grade	Iteration?
#48	`complexity:s`, Dockerfile-touching	Refused (correct)	n/a	—
#49	`complexity:s`, Dockerfile-touching	Refused → triage → workaround → PR #52	9/9	Operator-approved alt path
#50	`complexity:s`, template change	Shipped PR #51 first-pass	8/9	—
#55	`complexity:m`, admin feature	Shipped PR #56	7/9 → 8/9	Description fix on operator comment
#57	docs PR (Claude-authored)	Charlie autonomously reviewed: "no actionable feedback"	LGTM	—
#58	`complexity:m`, investigation	Shipped PR #59	7/9 → 9/9	Analytical follow-through on operator comment

Per-issue detail¶

#48 venv PATH (Dockerfile) — refused with citation of CLAUDE.md, .github/CODEOWNERS, and the labeler guardrail. Correct refusal.
#49 CPU-only torch (Dockerfile) — refused, but then triaged the spec for technical correctness, identified a non-Dockerfile workaround (pyproject.toml direct-URL wheel pin), and asked for explicit operator approval before acting. After approval, shipped PR #52 with a 9/9 grade. The workaround was cleaner than the operator's original Dockerfile spec — Charlie surfaced a better solution rather than executing the lower-quality one literally.
#50 access-denied template — shipped PR #51 at 8/9 (the missed point was no "Test plan" section in the PR body — acceptable given the issue's own "no tests required" acceptance criteria).
#55 admin access-requests page — shipped PR #56 at 7/9 initially. Description's first Summary bullet claimed credit for routes that existed from PR #39. Operator left a soft-feedback comment; Charlie corrected the description cleanly (single comment, no extra code push), elevating the grade to 8/9. Proved Charlie can take iterative review feedback.
#57 docs PR (CLAUDE.md amendments) — first observation of Charlie's autonomous review behavior. Charlie reviewed a Claude-authored PR without being asked and left a single "no actionable feedback" comment. Useful signal: low-noise additional review pair-of-eyes on every PR.
#58 slot done-tracker investigation — investigation-shaped (not spec-shaped). Charlie wrote a failing test, proposed a fix in compute_next (not _slot_done — the latter would have been broader than scoped). Initial PR was 7/9 because the description didn't include investigation findings. Operator commented asking Charlie to verify whether the original user-reported symptom (strip-tick logic) was independently correct. Charlie responded with a full analytical trail: traced through app/main.py:747, ran a targeted repro with three test cases, surfaced an unrelated caveat ("based on marked completed today, not opened") as a potential follow-up. Final grade 9/9. Proved Charlie can do analytical investigation, not just spec execution.

Verdict¶

Clear pass across three different work shapes: 1. Spec-shaped (#50, #55) — translate acceptance criteria into a working PR 2. Operator-approved alternative path (#49 → #52) — Charlie surfaces a better solution and gets it sanctioned 3. Investigation-shaped (#58) — diagnose root cause, write a failing test, fix in the right layer

The CODEOWNERS / CLAUDE.md boundary is enforced at the document-reading layer (Charlie reads these files at session start), not just by the post-PR auto-labeler. Stronger guarantee than originally designed for.

T2 SDK dispatcher was originally scoped as insurance: if Charlie disappointed on complexity:m work, the SDK dispatcher gave us an in-house path. Charlie didn't disappoint. Building T2 now would be carrying weight for no current benefit. We retain the option to ship it if (a) Charlie regresses in quality, or (b) a future need arises for an agent path with no vendor dependency.

Iteration patterns that work¶

Three soft-feedback patterns observed working cleanly on Charlie's PRs: 1. Description correction — comment pointing at a specific bullet that's wrong → Charlie edits the body, no code push. (PR #56) 2. Analytical follow-through — comment asking for verification of a related-but-untested concern → Charlie traces through code, runs targeted repros, posts findings as a comment. (PR #59) 3. Better-solution negotiation — Charlie refuses CODEOWNERS-protected work, proposes alternative path → operator approves with "proceed with X on #N" reply → Charlie executes on the approved path. (#49 → PR #52)

Pattern 1 and 2 are documented in CLAUDE.md so future Charlie runs know what shape of comment maps to what shape of action.

Identity model — no second GitHub account¶

Two design constraints made this clean:

Charlie installs as a GitHub App (charlie[bot]). It's a service identity by construction; no user, no PAT, no 2FA setup. One-click install and one-click revoke from the repo's Installed GitHub Apps settings.
T2 runs in GitHub Actions, where the auto-issued GITHUB_TOKEN is per-workflow-run, scoped via the workflow's permissions: block, and cannot trigger downstream workflows on push (anti-recursion). The actor on commits and PRs is github-actions[bot], with Co-Authored-By: Claude <noreply@anthropic.com> on the commit message so the LLM side is visible in git log.

The previous "create a lis-bench-bot GitHub account with a fine-grained PAT" approach was rejected during design because it required managing a second identity and standing-credential rotation.

Routing by label¶

A few label combinations are meaningful:

ready-for-agent + (complexity:s OR complexity:m) → Charlie may act.
human-only → nobody but the operator. Auto-applied by .github/workflows/labeler.yml to PRs touching CODEOWNERS paths.
agent-blocked → applied after repeated CI failures by an agent. Requires operator unblock before the agent retries.
vendor:charlie / vendor:claude → pin a specific tier. vendor:claude currently has no consumer (T2 deferred); kept in the label set so we can route to a future dispatcher without renaming.
agent:run → reserved for a future SDK-dispatcher trigger; no consumer today.

Label definitions in .github/labels.yml; synced to GitHub by .github/workflows/sync-labels.yml.

T2 — Claude SDK dispatcher pipeline (deferred design)¶

This section is kept as the design we'd ship if Charlie became inadequate or we needed a self-hosted alternative. It is not built. See "Trial outcome" above for why.

The shape was: when an issue gets agent:run, a workflow fires three subagents in sequence. Splitting reduces cost (Haiku for plan + review), reduces risk (the guardrail step sits between the plan and any code change), and makes the work explainable (the plan is visible in CI logs before any commit lands).

issue labeled `agent:run`
  ↓
PLAN subagent  ── Haiku, read-only
   inputs:  CLAUDE.md, relevant skills, issue body
   output:  structured plan: { file list, approach, tests, est. LoC }
  ↓
GUARDRAIL CHECK ── pure Python, no LLM
   - any plan file in CODEOWNERS? → abort, comment "human-only path"
   - file count > 3 or estimated LoC > 200? → abort, recommend label upgrade
   - plan references files that don't exist? → abort, comment
  ↓
BUILD subagent ── Sonnet, write+edit+bash, scoped to a git worktree
   inputs:  plan, CLAUDE.md, skills
   output:  diff applied + pytest tests/unit/ run + green
   on red:  comment with the test output, no PR opened
  ↓
REVIEW subagent ── Haiku, read-only
   inputs:  diff, plan, test output
   output:  PR title, PR body (why + test plan section)
  ↓
OPEN draft PR via gh CLI, identity = github-actions[bot]

Failure modes are explicit: the dispatcher comments on the issue and exits without opening a PR if the guardrail rejects the plan, or if pytest fails after the build step. The operator then either rewrites the issue spec or escalates the complexity label.

Agent knowledge lives in the repo¶

Two artifacts give agents durable context:

CLAUDE.md at repo root — the agent's reading list. Kept short on purpose (per the Claude Code agent guide). Stack, key commands, CODEOWNERS list, conventions, and the subtle things that bite. Both Charlie and the SDK dispatcher read this at session start.
.claude/skills/<name>/SKILL.md — versioned, reusable instructions for common tasks. Discoverable by the SDK's skill tool; loaded on-demand. Examples planned: add-route, add-migration, run-tests. Each new pattern that emerges gets a skill so the agent gets better at this codebase over time.

Threat model¶

What we're defending against:

Threat	Mitigation
Agent edits auth/security/prompt code subtly wrong	CODEOWNERS + auto-labeler `human-only` blocks the path entirely
Runaway loop burns Anthropic credits	Per-service spend caps wired to every Sonnet callsite (see `app/services/security.py`) — circuit-breaks to offline fallback
Agent introduces regression that breaks production	212 unit tests + Docker image import smoke test + CI green required for merge
Container-only regression (passes CI, breaks in prod)	Docker import smoke test inside CI catches the PR #37 class of bugs
Recursive workflow trigger (deploy.yml fires from agent PR)	`GITHUB_TOKEN` cannot trigger downstream workflows on push (GitHub's anti-recursion guarantee)
Agent credentials leaked or compromised	Charlie's App is installation-scoped + one-click revocable; `GITHUB_TOKEN` is per-run ephemeral, no standing PAT to leak
Agent PR auto-merges before review	All agent PRs open as draft by default; merge requires explicit "ready for review" from the operator

Phase history¶

Phase 1 — Foundations ✓ shipped: 1. CODEOWNERS + labels + auto-labeler + CLAUDE.md + this doc + skills/README (PR #42) 2. Per-service spend cap wiring (PR #43, wired the helpers from PR #32) 3. Docker image import smoke test in CI (PR #44)

Phase 2 — Trial ✓ complete (2026-05-16): 3 complexity:s issues opened: #48 (Dockerfile, CODEOWNERS-protected),

49 (Dockerfile, CODEOWNERS-protected), #50 (template, unprotected).¶

Outcome: 2 correct refusals + 1 shipped PR (#51, 8/9 grade). After operator approval, Charlie also shipped PR #52 for #49 via a cleaner pyproject.toml workaround (9/9 grade). See "Trial outcome" section above.

Phase 3 — Scope expansion ✓ (this PR): T1 (Charlie) expanded to handle complexity:m based on the trial data. T2 SDK dispatcher deferred. Docs updated to reflect the shipped reality.

Phase 4 — Steady state. Feedback flows through the system mostly without the operator's involvement on small/medium work. Operator's time is on T3/T4: prompts, content pipeline, architectural calls. The bench becomes self-maintaining. We're entering this phase now.

Known quirks¶

Charlie has predictable rough edges. These aren't bugs and don't disqualify the autonomous loop — they're operational realities to expect and triage cheaply.

Duplicate parallel PRs on a single issue¶

Charlie often opens 2–4 PRs against the same ready-for-agent issue, each from a different head branch, each a separate attempt at the same spec. Observed during the 2026-05-16 → 2026-05-17 trial:

Issue #5 (settings reset confirmation) → PRs #69 + #72
Issue #8 (feedback widget metadata) → PRs #70 + #71
Issue #4 (slot done-tracker) → PRs #73 + #75
Issue #78 (closed before Charlie noticed) → PRs #81/#82/#83/#84
Issue #80 (voice lifecycle) → already had two attempts before #86 landed
Issue #92 (one-bullet doc edit) → PRs #93 + #94 + #95

The duplicates are usually different implementations of the same spec, not literal copies — Charlie genuinely re-attempts. Sometimes one approach is cleaner than the others, so the pick has real value; on a one-bullet doc edit, the duplication is just noise.

Operator triage: diff the candidates briefly, pick a winner, close the rest as duplicates with a one-line rationale (so the audit trail records why this approach beat that one). The 2026-05-17 trial settled into a repeatable pattern of doing this in 5–10 minutes per duplicate set. Don't try to merge them all — the diffs won't compose.

Mitigation worth trying: the Charlie Labs dashboard has settings for max_concurrent_attempts_per_issue (or similar — check the current dashboard). Reducing to 1 would remove the noise but might also remove the occasional benefit when Charlie's first attempt is weaker than its second.

Issue-state staleness¶

Charlie does not always re-check an issue's open/closed state between starting work and opening the PR. If the operator closes an issue mid-Charlie-run, the PR still opens. PRs #81/#82/#83/#84 landed against a closed #78 because the close happened after Charlie had started.

Operator triage: close such PRs with a redirect comment pointing at the real target issue. CLAUDE.md (commit 207264d) now tells agents to re-check state, so this should diminish — but the backstop is the operator review.

Label-gate awareness gap¶

Charlie respects CODEOWNERS (it reads CLAUDE.md and skips protected paths) but didn't initially honor blocked:device or other blocked:* labels — PR #85 opened against blocked:device issue

76 anyway. CLAUDE.md was updated (same commit `207264d`) to¶

require skipping blocked:* and human-only labels. Watch for this remaining a gap in practice.

Forgetting cache-bust on static-asset changes (resolved)¶

Before #96 landed: every change to static/js/voice-note.js had to include a manual bump of the ?v=N query string in templates/base.html. Charlie forgot in PR #89, which meant the fix never reached the operator's browser and produced a phantom follow-up bug (issue #88 was filed against #89's pre-deploy behavior). The static_hash() Jinja helper (PR #96) automates this; the rule is no longer needed.

When you (the operator) need to intervene¶

Agent PR is wrong: leave a review comment with agent-blocked label; the agent backs off. Fix manually or rewrite the issue spec.
Agent picked the wrong scope: change the complexity:* label; Charlie will refuse on the next loop.
Agent is looping: revoke Charlie's App temporarily (one click) from the repo's Installed GitHub Apps settings. Investigate before re-enabling.
An issue needs human attention: add human-only manually; nobody acts.
Agent surfaces a better solution than the spec (as in #49 → #52): reply on the issue with explicit approval ("proceed with X on #N"), then it acts on the approved path.