Autonomous coding agents¶
How this repo uses Charlie Labs and a Claude Agent SDK dispatcher to pick up issues, ship PRs, and maintain the codebase under human review.
The shape¶
Issues land in GitHub (typed by the operator, or auto-created from the in-app feedback widget — see LIS-301). The operator triages each with a complexity label. Agents read the labels and decide whether to act.
feedback widget ┐
manual operator typing ┤── GitHub issue ──┬── ready-for-agent + complexity:s ──→ Charlie
issue migrated from FB ┘ ├── ready-for-agent + complexity:m ──→ Charlie
├── ready-for-agent + complexity:l ──→ human (operator pairs with Claude in chat)
└── human-only (auto-applied) ──→ human (no agent)
│
┌───────────────┴───────────────┐
│ Agent opens draft PR │
│ CI runs (test + docker smoke)│
│ Operator reviews + merges │
│ GH Actions auto-deploys │
└───────────────────────────────┘
Three active tiers, in increasing trust required of the actor (T2 is deferred — see below):
| Tier | Who | Scope | Identity |
|---|---|---|---|
| T1 | Charlie Labs | complexity:s AND complexity:m (≤3 files, ≤200 lines). Lint, dep upgrades, bug fixes, scoped features with tests, small refactors. Expanded from complexity:s-only after a 2026-05-16 trial. |
GitHub App: charlie[bot] |
| T2 | Claude Agent SDK dispatcher | Deferred indefinitely after Charlie's trial. See "Trial outcome" below. | GH Actions: github-actions[bot] + Co-Authored-By: Claude trailer |
| T3 | Operator + Claude in chat | Open-ended design or execution work. Architectural decisions stay here | Operator commits, Claude as co-author |
| T4 | Human only | CODEOWNERS-protected paths, anything labeled human-only |
Operator |
Trial outcome (2026-05-16)¶
Five issues across two batches went through Charlie's coding-agent tier. Summary table first; per-issue notes below.
| Issue | Type | Outcome | Grade | Iteration? |
|---|---|---|---|---|
| #48 | complexity:s, Dockerfile-touching |
Refused (correct) | n/a | — |
| #49 | complexity:s, Dockerfile-touching |
Refused → triage → workaround → PR #52 | 9/9 | Operator-approved alt path |
| #50 | complexity:s, template change |
Shipped PR #51 first-pass | 8/9 | — |
| #55 | complexity:m, admin feature |
Shipped PR #56 | 7/9 → 8/9 | Description fix on operator comment |
| #57 | docs PR (Claude-authored) | Charlie autonomously reviewed: "no actionable feedback" | LGTM | — |
| #58 | complexity:m, investigation |
Shipped PR #59 | 7/9 → 9/9 | Analytical follow-through on operator comment |
Per-issue detail¶
- #48 venv PATH (Dockerfile) — refused with citation of
CLAUDE.md,.github/CODEOWNERS, and the labeler guardrail. Correct refusal. - #49 CPU-only torch (Dockerfile) — refused, but then triaged the spec for technical correctness, identified a non-Dockerfile workaround (
pyproject.tomldirect-URL wheel pin), and asked for explicit operator approval before acting. After approval, shipped PR #52 with a 9/9 grade. The workaround was cleaner than the operator's original Dockerfile spec — Charlie surfaced a better solution rather than executing the lower-quality one literally. - #50 access-denied template — shipped PR #51 at 8/9 (the missed point was no "Test plan" section in the PR body — acceptable given the issue's own "no tests required" acceptance criteria).
- #55 admin access-requests page — shipped PR #56 at 7/9 initially. Description's first Summary bullet claimed credit for routes that existed from PR #39. Operator left a soft-feedback comment; Charlie corrected the description cleanly (single comment, no extra code push), elevating the grade to 8/9. Proved Charlie can take iterative review feedback.
- #57 docs PR (CLAUDE.md amendments) — first observation of Charlie's autonomous review behavior. Charlie reviewed a Claude-authored PR without being asked and left a single "no actionable feedback" comment. Useful signal: low-noise additional review pair-of-eyes on every PR.
- #58 slot done-tracker investigation — investigation-shaped (not spec-shaped). Charlie wrote a failing test, proposed a fix in
compute_next(not_slot_done— the latter would have been broader than scoped). Initial PR was 7/9 because the description didn't include investigation findings. Operator commented asking Charlie to verify whether the original user-reported symptom (strip-tick logic) was independently correct. Charlie responded with a full analytical trail: traced throughapp/main.py:747, ran a targeted repro with three test cases, surfaced an unrelated caveat ("based on marked completed today, not opened") as a potential follow-up. Final grade 9/9. Proved Charlie can do analytical investigation, not just spec execution.
Verdict¶
Clear pass across three different work shapes: 1. Spec-shaped (#50, #55) — translate acceptance criteria into a working PR 2. Operator-approved alternative path (#49 → #52) — Charlie surfaces a better solution and gets it sanctioned 3. Investigation-shaped (#58) — diagnose root cause, write a failing test, fix in the right layer
The CODEOWNERS / CLAUDE.md boundary is enforced at the document-reading layer (Charlie reads these files at session start), not just by the post-PR auto-labeler. Stronger guarantee than originally designed for.
T2 SDK dispatcher was originally scoped as insurance: if Charlie disappointed on complexity:m work, the SDK dispatcher gave us an in-house path. Charlie didn't disappoint. Building T2 now would be carrying weight for no current benefit. We retain the option to ship it if (a) Charlie regresses in quality, or (b) a future need arises for an agent path with no vendor dependency.
Iteration patterns that work¶
Three soft-feedback patterns observed working cleanly on Charlie's PRs: 1. Description correction — comment pointing at a specific bullet that's wrong → Charlie edits the body, no code push. (PR #56) 2. Analytical follow-through — comment asking for verification of a related-but-untested concern → Charlie traces through code, runs targeted repros, posts findings as a comment. (PR #59) 3. Better-solution negotiation — Charlie refuses CODEOWNERS-protected work, proposes alternative path → operator approves with "proceed with X on #N" reply → Charlie executes on the approved path. (#49 → PR #52)
Pattern 1 and 2 are documented in CLAUDE.md so future Charlie runs know what shape of comment maps to what shape of action.
Identity model — no second GitHub account¶
Two design constraints made this clean:
-
Charlie installs as a GitHub App (
charlie[bot]). It's a service identity by construction; no user, no PAT, no 2FA setup. One-click install and one-click revoke from the repo's Installed GitHub Apps settings. -
T2 runs in GitHub Actions, where the auto-issued
GITHUB_TOKENis per-workflow-run, scoped via the workflow'spermissions:block, and cannot trigger downstream workflows on push (anti-recursion). The actor on commits and PRs isgithub-actions[bot], withCo-Authored-By: Claude <noreply@anthropic.com>on the commit message so the LLM side is visible in git log.
The previous "create a lis-bench-bot GitHub account with a fine-grained
PAT" approach was rejected during design because it required managing a
second identity and standing-credential rotation.
Routing by label¶
A few label combinations are meaningful:
ready-for-agent+ (complexity:sORcomplexity:m) → Charlie may act.human-only→ nobody but the operator. Auto-applied by.github/workflows/labeler.ymlto PRs touching CODEOWNERS paths.agent-blocked→ applied after repeated CI failures by an agent. Requires operator unblock before the agent retries.vendor:charlie/vendor:claude→ pin a specific tier.vendor:claudecurrently has no consumer (T2 deferred); kept in the label set so we can route to a future dispatcher without renaming.agent:run→ reserved for a future SDK-dispatcher trigger; no consumer today.
Label definitions in .github/labels.yml;
synced to GitHub by .github/workflows/sync-labels.yml.
T2 — Claude SDK dispatcher pipeline (deferred design)¶
This section is kept as the design we'd ship if Charlie became inadequate or we needed a self-hosted alternative. It is not built. See "Trial outcome" above for why.
The shape was: when an issue gets agent:run, a workflow fires three
subagents in sequence. Splitting reduces cost (Haiku for plan + review),
reduces risk (the guardrail step sits between the plan and any code
change), and makes the work explainable (the plan is visible in CI
logs before any commit lands).
issue labeled `agent:run`
↓
PLAN subagent ── Haiku, read-only
inputs: CLAUDE.md, relevant skills, issue body
output: structured plan: { file list, approach, tests, est. LoC }
↓
GUARDRAIL CHECK ── pure Python, no LLM
- any plan file in CODEOWNERS? → abort, comment "human-only path"
- file count > 3 or estimated LoC > 200? → abort, recommend label upgrade
- plan references files that don't exist? → abort, comment
↓
BUILD subagent ── Sonnet, write+edit+bash, scoped to a git worktree
inputs: plan, CLAUDE.md, skills
output: diff applied + pytest tests/unit/ run + green
on red: comment with the test output, no PR opened
↓
REVIEW subagent ── Haiku, read-only
inputs: diff, plan, test output
output: PR title, PR body (why + test plan section)
↓
OPEN draft PR via gh CLI, identity = github-actions[bot]
Failure modes are explicit: the dispatcher comments on the issue and
exits without opening a PR if the guardrail rejects the plan, or if
pytest fails after the build step. The operator then either rewrites
the issue spec or escalates the complexity label.
Agent knowledge lives in the repo¶
Two artifacts give agents durable context:
CLAUDE.mdat repo root — the agent's reading list. Kept short on purpose (per the Claude Code agent guide). Stack, key commands, CODEOWNERS list, conventions, and the subtle things that bite. Both Charlie and the SDK dispatcher read this at session start..claude/skills/<name>/SKILL.md— versioned, reusable instructions for common tasks. Discoverable by the SDK's skill tool; loaded on-demand. Examples planned:add-route,add-migration,run-tests. Each new pattern that emerges gets a skill so the agent gets better at this codebase over time.
Threat model¶
What we're defending against:
| Threat | Mitigation |
|---|---|
| Agent edits auth/security/prompt code subtly wrong | CODEOWNERS + auto-labeler human-only blocks the path entirely |
| Runaway loop burns Anthropic credits | Per-service spend caps wired to every Sonnet callsite (see app/services/security.py) — circuit-breaks to offline fallback |
| Agent introduces regression that breaks production | 212 unit tests + Docker image import smoke test + CI green required for merge |
| Container-only regression (passes CI, breaks in prod) | Docker import smoke test inside CI catches the PR #37 class of bugs |
| Recursive workflow trigger (deploy.yml fires from agent PR) | GITHUB_TOKEN cannot trigger downstream workflows on push (GitHub's anti-recursion guarantee) |
| Agent credentials leaked or compromised | Charlie's App is installation-scoped + one-click revocable; GITHUB_TOKEN is per-run ephemeral, no standing PAT to leak |
| Agent PR auto-merges before review | All agent PRs open as draft by default; merge requires explicit "ready for review" from the operator |
Phase history¶
Phase 1 — Foundations ✓ shipped:
1. CODEOWNERS + labels + auto-labeler + CLAUDE.md + this doc + skills/README (PR #42)
2. Per-service spend cap wiring (PR #43, wired the helpers from PR #32)
3. Docker image import smoke test in CI (PR #44)
Phase 2 — Trial ✓ complete (2026-05-16):
3 complexity:s issues opened: #48 (Dockerfile, CODEOWNERS-protected),
49 (Dockerfile, CODEOWNERS-protected), #50 (template, unprotected).¶
Outcome: 2 correct refusals + 1 shipped PR (#51, 8/9 grade). After
operator approval, Charlie also shipped PR #52 for #49 via a cleaner
pyproject.toml workaround (9/9 grade). See "Trial outcome" section
above.
Phase 3 — Scope expansion ✓ (this PR):
T1 (Charlie) expanded to handle complexity:m based on the trial
data. T2 SDK dispatcher deferred. Docs updated to reflect the
shipped reality.
Phase 4 — Steady state. Feedback flows through the system mostly without the operator's involvement on small/medium work. Operator's time is on T3/T4: prompts, content pipeline, architectural calls. The bench becomes self-maintaining. We're entering this phase now.
Known quirks¶
Charlie has predictable rough edges. These aren't bugs and don't disqualify the autonomous loop — they're operational realities to expect and triage cheaply.
Duplicate parallel PRs on a single issue¶
Charlie often opens 2–4 PRs against the same ready-for-agent issue,
each from a different head branch, each a separate attempt at the
same spec. Observed during the 2026-05-16 → 2026-05-17 trial:
- Issue #5 (settings reset confirmation) → PRs #69 + #72
- Issue #8 (feedback widget metadata) → PRs #70 + #71
- Issue #4 (slot done-tracker) → PRs #73 + #75
- Issue #78 (closed before Charlie noticed) → PRs #81/#82/#83/#84
- Issue #80 (voice lifecycle) → already had two attempts before #86 landed
- Issue #92 (one-bullet doc edit) → PRs #93 + #94 + #95
The duplicates are usually different implementations of the same spec, not literal copies — Charlie genuinely re-attempts. Sometimes one approach is cleaner than the others, so the pick has real value; on a one-bullet doc edit, the duplication is just noise.
Operator triage: diff the candidates briefly, pick a winner, close the rest as duplicates with a one-line rationale (so the audit trail records why this approach beat that one). The 2026-05-17 trial settled into a repeatable pattern of doing this in 5–10 minutes per duplicate set. Don't try to merge them all — the diffs won't compose.
Mitigation worth trying: the Charlie Labs dashboard has settings
for max_concurrent_attempts_per_issue (or similar — check the
current dashboard). Reducing to 1 would remove the noise but might
also remove the occasional benefit when Charlie's first attempt is
weaker than its second.
Issue-state staleness¶
Charlie does not always re-check an issue's open/closed state between starting work and opening the PR. If the operator closes an issue mid-Charlie-run, the PR still opens. PRs #81/#82/#83/#84 landed against a closed #78 because the close happened after Charlie had started.
Operator triage: close such PRs with a redirect comment pointing
at the real target issue. CLAUDE.md (commit 207264d) now tells
agents to re-check state, so this should diminish — but the
backstop is the operator review.
Label-gate awareness gap¶
Charlie respects CODEOWNERS (it reads CLAUDE.md and skips protected
paths) but didn't initially honor blocked:device or other
blocked:* labels — PR #85 opened against blocked:device issue
76 anyway. CLAUDE.md was updated (same commit 207264d) to¶
require skipping blocked:* and human-only labels. Watch for
this remaining a gap in practice.
Forgetting cache-bust on static-asset changes (resolved)¶
Before #96 landed: every change to static/js/voice-note.js had to
include a manual bump of the ?v=N query string in
templates/base.html. Charlie forgot in PR #89, which meant the
fix never reached the operator's browser and produced a phantom
follow-up bug (issue #88 was filed against #89's pre-deploy
behavior). The static_hash() Jinja helper (PR #96) automates
this; the rule is no longer needed.
When you (the operator) need to intervene¶
- Agent PR is wrong: leave a review comment with
agent-blockedlabel; the agent backs off. Fix manually or rewrite the issue spec. - Agent picked the wrong scope: change the
complexity:*label; Charlie will refuse on the next loop. - Agent is looping: revoke Charlie's App temporarily (one click) from the repo's Installed GitHub Apps settings. Investigate before re-enabling.
- An issue needs human attention: add
human-onlymanually; nobody acts. - Agent surfaces a better solution than the spec (as in #49 → #52): reply on the issue with explicit approval ("proceed with X on #N"), then it acts on the approved path.
See also¶
.github/CODEOWNERS.github/labels.ymlCLAUDE.md.claude/skills/- Deploy guide (the auto-deploy that fires from agent PRs)
- Common dev tasks