Roadmap and decisions¶

What's been built, what's queued, what's deliberately deferred.

Phases¶

The platform's been built in phases tracked under LIS-281.

Phase	State	What it included
1: Supabase + Postgres schema	✅	20-table consolidated migration, UUID FKs to auth.users, JSONB everywhere
2: Feedback import	✅	Historical SQLite rows migrated; live widget persists to Postgres
3: GitHub OAuth (PKCE)	✅	Server-managed code_verifier cookie, auth-gate middleware, /auth/me
4: Postgres data layer	✅	13 repos, ~50 callsites refactored, UUID throughout
5: Test-user separation	✅	Real test users via Supabase Admin API; 14 integration tests
6: Onboarding polish	✅	Skip path, easy-first ramp, surface ramp indicator
7: Deploy bench.libearden.dev	📋 deferred	Singapore region + in-process cache delivered enough perf to keep local

Features shipped (recent)¶

Daily progress strip + Up-next hero (LIS-274 chain)
Hash-based startup skip — 70s → 1s restart when content unchanged
In-process cache with read-through + invalidate-on-write
External reading capture (/library/add with Claude-powered URL parse) — LIS-294
JD + CV IAC profile + alignment view — LIS-295
Run_tests resubmit guard — LIS-291
Reflect skeleton loader — LIS-290
Readiness timeline evaluation — LIS-296
Mentor voices registry — LIS-297

Queued¶

Issue	Scope	Priority
LIS-285	Slot done-tracker over-marks completion (audit-trail bug)	Med — needs repro data
LIS-284	Unify tag taxonomy between problems and content_items	Med — Library discovery
LIS-292	`/library/add` submit failure investigation	Low — possibly browser-side
Reflect-voice routing	Route hamming.py through the voice registry	Low
Reflect-IAC integration	Pull JD/CV IAC into reflect prompt context	Low
Multi-JD readiness aggregate	"What to focus on this week given all your JDs"	Low
Acceleration mode	Research-replication content + extra daily slots	Big
Calibrated hours-per-axis-point	Replace heuristic with attempt-derived value	Big, needs ≥20 attempts/user

Security posture¶

Concern	Status
GitHub OAuth via Supabase (PKCE, ES256 JWT)	✅ shipped
CSP headers + form-action lockdown	✅ shipped
Access allowlist (`LIS_ALLOWED_USER_IDS`)	✅ shipped — single-tenant gate
Bearer-token MCP server auth	✅ shipped (PR #243 / #246)
Row-Level Security on every table	✅ shipped (migration 010 / PR #254-ish) — defense-in-depth against direct PostgREST or anon-key paths; bench's `postgres`-role connection bypasses
OAuth 2.1 shim for claude.ai web	parked (issue #244)

Parking lot — explicit deferrals¶

Ideas captured here so they don't sit in the operator's head. Each entry has a revisit trigger — the condition that should pull it back onto the active queue.

Expand `/writings` into a long-form drafting pipeline¶

The current /writings route is a draft list. The expansion adds: kanban-style pipeline (idea → outline → draft → ready → published), split-pane Markdown editor with live GFM preview, clean export for Substack / LW / AF / personal, audit-trailed Sonnet structure/clarity passes that never write substance.

Full design in docs/proposals/writing-pipeline.md; spec lives at GH #258. Operator logged it Low / phased on purpose.

Why parked. Phase 1 alone is a 1500-2000 LOC build with a real UI surface (the kanban board). That's 1-2 weeks competing directly with the capstone arXiv deadline (LIS-186, 2026-07-15). The writing habit itself can start now in any Markdown editor — the bench's pipeline is an accelerant for a habit, not a prerequisite to it.

Revisit when. Capstone is off the critical path AND the operator notices "where's my draft, what stage was it in, how do I export it" is actually slowing weekly writing cadence.

Hard constraint: the Sonnet guardrail (never writes substance) is non-negotiable. Build it into the API surface (no "generate draft" endpoint exists), the prompt (clarity-pass refuses to fill [TODO]s), and the UI (sidebar suggestions, never inline edits, explicit Apply/Discard).

`/connecting` — networking tracker with warmth monitor¶

Bi-directional networking surface. Tracks outreach + replies per contact, classifies by type (peer / mentor / field / initiative), surfaces a "warmth" signal so connections don't go cold without the operator watching them all. Default view is an action queue (only cooling/cold/new, most-overdue-first) — anti-obsession is a hard requirement.

Full design in docs/proposals/connecting-route.md; spec lives at GH #256. Operator logged it Low on purpose.

Why parked. Capstone arXiv submission deadline (LIS-186) is 2026-07-15. Building /connecting before that trades the capstone for a tool, and the tool itself is meant to be low-cognitive-overhead, not a build project that consumes cognition.

Revisit when. Capstone is off the critical path (post-arXiv, after 2026-07-15). Pick up MVP scope (per the proposal's recommendation, without Linear mirror initially) when there's spare bandwidth.

`/lab` — eval-design playground¶

A /lab route where the operator specs a small alignment-shaped eval (e.g., "does Sonnet detect deceptive Llama-2-7B outputs given prompt template X?"), the bench runs it through the existing Anthropic API integration (plus optionally a small open-source model via Modal), and produces a notebook entry — hypothesis / method / raw results / interpretation. Each notebook is operator-flagged public-or-private; public ones become shareable portfolio artifacts at /lab/notebook/{slug}.

Why interesting. Designing small evals is the JD task for METR / Apollo / Anthropic Interp / Redwood. A /lab notebook that records hypothesis → method → results → takeaway is a directly shareable artifact of that skill. Three or four good notebooks IS an application portfolio.

Why parked. Bench-building is at risk of swallowing the time meant for interview prep itself. The current make-it-legible wave (BEN-74 through BEN-77, plus the BEN-79 → BEN-82 focus/journal/MCP stack) is finishing the platform as a portfolio artifact. Adding /lab re-opens it as a project for another 3–4 weeks. Distinguishing parked from "no" matters: this idea is the right shape, just not the right week.

Revisit when. Either (a) the operator finds themselves wanting to design an eval and reaches for the wrong tool, OR (b) the make-it-legible wave fully lands and the operator wants to commit a multi-week build before applications open. Revisit no later than 2 weeks from filing.

Not parked variants. "Lab as Jupyter-in-the-bench" (notebook hosting) and "lab as content-authoring playground" — both rejected on first pass. The eval-design framing is the only one worth doing.

"Open in Cursor" buttons¶

Skipped. Reviewers don't have Cursor installed; the operator is already in Cursor. Revisit only if the public-corpus repo (#196) ships and a public-facing "try it" flow becomes desirable — in which case the right answer is GitHub Codespaces ("Open in Codespaces" button), not Cursor.

Email-forward inbound for news / readings¶

A real inbox (content@bench.libearden.dev via Resend/Postmark inbound webhook) that turns forwarded newsletter emails into draft content/readings/ items. Defer until the RSS news-feed engine (Phase 1 of the content-generation engine, separate sub-issue) proves the news feed is actually used.

Revisit when. RSS news feed is shipped and operator finds themselves manually creating content/readings/*.md from emails ≥3 times.

Design decisions worth knowing¶

Anti-sycophancy is structural, not stylistic. Reviews and reflect prompts are required to identify specific gaps in JSON. The system prompts call out generic "keep up the great work" as the explicit failure mode.

Single-tab UX. The Today page has ONE next thing. The platform refuses to fan out into 12 "what should I do" widgets. If you want to break out, the library is one click.

Hints cap at structural scaffolding. Tier 3 is the end of the ladder. After that, you look at the reference solution and trace why — and that lookup is logged, affecting scheduler scoring.

JSONB over per-field columns. Settings, hints_used, axis_weights, parsed_json, iac_profile — JSONB. Pragma: the shape is small + user-controlled + we can index when it matters.

No SPA. Server-rendered Jinja2 + HTMX for the small handful of partial swaps. Faster to develop, faster to debug, lower client-side complexity.

LLM-or-fallback, never LLM-only. Every Claude-flavored service has an offline path. Routes never branch on ANTHROPIC_API_KEY — that's the service's responsibility.