- • Exercises the full real backend pipeline
- • `one_shot --json` is scriptable
- • No DOM flake (skip UI testing for most flows)
- • External agent can drive prompts as if it were a user
- • Privy session reused, no re-login per test
- • All processing is captured per runId
- • `scripts/reconstruct-chat.ts --json` is canonical
- • Skill router selection, ACL loop, actions, results
- • Plus `auto-chat-debugger` skill for analysis
- • No new audit table needed
- • Logs are written async (`log-queue.ts` batches)
- • Poll `messageProcessing.status` until completed
- • Then run reconstruct-chat.ts
- • Timeout = 30s default (LLM can be slow)
- • `loopExitedByCap: true` = test should fail anyway
- • skillsMatched — registry vs orchestrator output
- • actionsExecuted — log type === "action", success
- • stateAssertions — Convex/on-chain queries pass
- • synthesisQuality — optional LLM rubric per test
- • Tolerance set per-test; not all tests use LLM judge
- • Test wallet ~$300 multi-chain
- • Per-test budget: $1-2 (small trades only)
- • Daily CI cap: $20
- • Batch afterAll cleanup per domain
- • Polymarket + Hyperliquid = mainnet only (no testnet)
- • Self-hosted Convex (iso production)
- • Team interne use it daily, our wallet won't bother
- • If develop passes, prod will pass
- • Staging = Convex Cloud (asymmetry, skip)
- • Prod = real users, skip auto tests