Auto E2E Test Architecture

Why CLI is the primary surface

• Exercises the full real backend pipeline
• `one_shot --json` is scriptable
• No DOM flake (skip UI testing for most flows)
• External agent can drive prompts as if it were a user
• Privy session reused, no re-login per test

Trace = `logs` table

• All processing is captured per runId
• `scripts/reconstruct-chat.ts --json` is canonical
• Skill router selection, ACL loop, actions, results
• Plus `auto-chat-debugger` skill for analysis
• No new audit table needed

Poll before reading (critical)

• Logs are written async (`log-queue.ts` batches)
• Poll `messageProcessing.status` until completed
• Then run reconstruct-chat.ts
• Timeout = 30s default (LLM can be slow)
• `loopExitedByCap: true` = test should fail anyway

Judge — 4 dimensions, tolerant

• skillsMatched — registry vs orchestrator output
• actionsExecuted — log type === "action", success
• stateAssertions — Convex/on-chain queries pass
• synthesisQuality — optional LLM rubric per test
• Tolerance set per-test; not all tests use LLM judge

Real money discipline

• Test wallet ~$300 multi-chain
• Per-test budget: $1-2 (small trades only)
• Daily CI cap: $20
• Batch afterAll cleanup per domain
• Polymarket + Hyperliquid = mainnet only (no testnet)

Env = develop

• Self-hosted Convex (iso production)
• Team interne use it daily, our wallet won't bother
• If develop passes, prod will pass
• Staging = Convex Cloud (asymmetry, skip)
• Prod = real users, skip auto tests