Roadmap and decisions¶
What's been built, what's queued, what's deliberately deferred.
Phases¶
The platform's been built in phases tracked under LIS-281.
| Phase | State | What it included |
|---|---|---|
| 1: Supabase + Postgres schema | ✅ | 20-table consolidated migration, UUID FKs to auth.users, JSONB everywhere |
| 2: Feedback import | ✅ | Historical SQLite rows migrated; live widget persists to Postgres |
| 3: GitHub OAuth (PKCE) | ✅ | Server-managed code_verifier cookie, auth-gate middleware, /auth/me |
| 4: Postgres data layer | ✅ | 13 repos, ~50 callsites refactored, UUID throughout |
| 5: Test-user separation | ✅ | Real test users via Supabase Admin API; 14 integration tests |
| 6: Onboarding polish | ✅ | Skip path, easy-first ramp, surface ramp indicator |
| 7: Deploy bench.libearden.dev | 📋 deferred | Singapore region + in-process cache delivered enough perf to keep local |
Features shipped (recent)¶
- Daily progress strip + Up-next hero (LIS-274 chain)
- Hash-based startup skip — 70s → 1s restart when content unchanged
- In-process cache with read-through + invalidate-on-write
- External reading capture (
/library/addwith Claude-powered URL parse) — LIS-294 - JD + CV IAC profile + alignment view — LIS-295
- Run_tests resubmit guard — LIS-291
- Reflect skeleton loader — LIS-290
- Readiness timeline evaluation — LIS-296
- Mentor voices registry — LIS-297
Queued¶
| Issue | Scope | Priority |
|---|---|---|
| LIS-285 | Slot done-tracker over-marks completion (audit-trail bug) | Med — needs repro data |
| LIS-284 | Unify tag taxonomy between problems and content_items | Med — Library discovery |
| LIS-292 | /library/add submit failure investigation |
Low — possibly browser-side |
| Reflect-voice routing | Route hamming.py through the voice registry | Low |
| Reflect-IAC integration | Pull JD/CV IAC into reflect prompt context | Low |
| Multi-JD readiness aggregate | "What to focus on this week given all your JDs" | Low |
| Acceleration mode | Research-replication content + extra daily slots | Big |
| Calibrated hours-per-axis-point | Replace heuristic with attempt-derived value | Big, needs ≥20 attempts/user |
Security posture¶
| Concern | Status |
|---|---|
| GitHub OAuth via Supabase (PKCE, ES256 JWT) | ✅ shipped |
| CSP headers + form-action lockdown | ✅ shipped |
Access allowlist (LIS_ALLOWED_USER_IDS) |
✅ shipped — single-tenant gate |
| Bearer-token MCP server auth | ✅ shipped (PR #243 / #246) |
| Row-Level Security on every table | ✅ shipped (migration 010 / PR #254-ish) — defense-in-depth against direct PostgREST or anon-key paths; bench's postgres-role connection bypasses |
| OAuth 2.1 shim for claude.ai web | parked (issue #244) |
Parking lot — explicit deferrals¶
Ideas captured here so they don't sit in the operator's head. Each entry has a revisit trigger — the condition that should pull it back onto the active queue.
Expand /writings into a long-form drafting pipeline¶
The current /writings route is a draft list. The expansion adds: kanban-style pipeline (idea → outline → draft → ready → published), split-pane Markdown editor with live GFM preview, clean export for Substack / LW / AF / personal, audit-trailed Sonnet structure/clarity passes that never write substance.
Full design in docs/proposals/writing-pipeline.md; spec lives at GH #258. Operator logged it Low / phased on purpose.
Why parked. Phase 1 alone is a 1500-2000 LOC build with a real UI surface (the kanban board). That's 1-2 weeks competing directly with the capstone arXiv deadline (LIS-186, 2026-07-15). The writing habit itself can start now in any Markdown editor — the bench's pipeline is an accelerant for a habit, not a prerequisite to it.
Revisit when. Capstone is off the critical path AND the operator notices "where's my draft, what stage was it in, how do I export it" is actually slowing weekly writing cadence.
Hard constraint: the Sonnet guardrail (never writes substance) is non-negotiable. Build it into the API surface (no "generate draft" endpoint exists), the prompt (clarity-pass refuses to fill [TODO]s), and the UI (sidebar suggestions, never inline edits, explicit Apply/Discard).
/connecting — networking tracker with warmth monitor¶
Bi-directional networking surface. Tracks outreach + replies per contact, classifies by type (peer / mentor / field / initiative), surfaces a "warmth" signal so connections don't go cold without the operator watching them all. Default view is an action queue (only cooling/cold/new, most-overdue-first) — anti-obsession is a hard requirement.
Full design in docs/proposals/connecting-route.md; spec lives at GH #256. Operator logged it Low on purpose.
Why parked. Capstone arXiv submission deadline (LIS-186) is 2026-07-15. Building /connecting before that trades the capstone for a tool, and the tool itself is meant to be low-cognitive-overhead, not a build project that consumes cognition.
Revisit when. Capstone is off the critical path (post-arXiv, after 2026-07-15). Pick up MVP scope (per the proposal's recommendation, without Linear mirror initially) when there's spare bandwidth.
/lab — eval-design playground¶
A /lab route where the operator specs a small alignment-shaped eval (e.g., "does Sonnet detect deceptive Llama-2-7B outputs given prompt template X?"), the bench runs it through the existing Anthropic API integration (plus optionally a small open-source model via Modal), and produces a notebook entry — hypothesis / method / raw results / interpretation. Each notebook is operator-flagged public-or-private; public ones become shareable portfolio artifacts at /lab/notebook/{slug}.
Why interesting. Designing small evals is the JD task for METR / Apollo / Anthropic Interp / Redwood. A /lab notebook that records hypothesis → method → results → takeaway is a directly shareable artifact of that skill. Three or four good notebooks IS an application portfolio.
Why parked. Bench-building is at risk of swallowing the time meant for interview prep itself. The current make-it-legible wave (BEN-74 through BEN-77, plus the BEN-79 → BEN-82 focus/journal/MCP stack) is finishing the platform as a portfolio artifact. Adding /lab re-opens it as a project for another 3–4 weeks. Distinguishing parked from "no" matters: this idea is the right shape, just not the right week.
Revisit when. Either (a) the operator finds themselves wanting to design an eval and reaches for the wrong tool, OR (b) the make-it-legible wave fully lands and the operator wants to commit a multi-week build before applications open. Revisit no later than 2 weeks from filing.
Not parked variants. "Lab as Jupyter-in-the-bench" (notebook hosting) and "lab as content-authoring playground" — both rejected on first pass. The eval-design framing is the only one worth doing.
"Open in Cursor" buttons¶
Skipped. Reviewers don't have Cursor installed; the operator is already in Cursor. Revisit only if the public-corpus repo (#196) ships and a public-facing "try it" flow becomes desirable — in which case the right answer is GitHub Codespaces ("Open in Codespaces" button), not Cursor.
Email-forward inbound for news / readings¶
A real inbox (content@bench.libearden.dev via Resend/Postmark inbound webhook) that turns forwarded newsletter emails into draft content/readings/ items. Defer until the RSS news-feed engine (Phase 1 of the content-generation engine, separate sub-issue) proves the news feed is actually used.
Revisit when. RSS news feed is shipped and operator finds themselves manually creating content/readings/*.md from emails ≥3 times.
Design decisions worth knowing¶
Anti-sycophancy is structural, not stylistic. Reviews and reflect prompts are required to identify specific gaps in JSON. The system prompts call out generic "keep up the great work" as the explicit failure mode.
Single-tab UX. The Today page has ONE next thing. The platform refuses to fan out into 12 "what should I do" widgets. If you want to break out, the library is one click.
Hints cap at structural scaffolding. Tier 3 is the end of the ladder. After that, you look at the reference solution and trace why — and that lookup is logged, affecting scheduler scoring.
JSONB over per-field columns. Settings, hints_used, axis_weights, parsed_json, iac_profile — JSONB. Pragma: the shape is small + user-controlled + we can index when it matters.
No SPA. Server-rendered Jinja2 + HTMX for the small handful of partial swaps. Faster to develop, faster to debug, lower client-side complexity.
LLM-or-fallback, never LLM-only. Every Claude-flavored service has an offline path. Routes never branch on ANTHROPIC_API_KEY — that's the service's responsibility.
See also¶
- LIS-281 phase tracker on Linear (canonical)
- Architecture: code tour