Auto E2E Test Architecture

CLI-centric — real backend pipeline — trace via Convex `logs` table — skill-registry-driven judge

1 — TEST DRIVER (external agent scripts prompts) External Agent LLM scripted (Gemini) • simulates a user • generates prompts per test prompt text auto-cli one_shot --json src/packages/cli • single-shot mode • Privy session reused REST + Convex Test wallet (Privy) ~$300 multi-chain • Polygon · Base · Arb · OP • ETH · Solana · Hyperliquid Env target: develop develop.auto.fun • Convex self-hosted (iso prod) • no real users impacted Playwright UI ~5 critical visual tests only REST/Convex direct isolated action contracts secondary surfaces 2 — AUTO BACKEND (full real pipeline — the thing under test) auto-server + Convex 1. message bus MessageBusService • claims messageProcessing pending → processing • runId assigned 2. skill router selectSkillsViaLlm • picks contextual skills from 75-skill registry • returns intent + routes 3. ACL action loop action-call-loop.ts • LLM decides next action • executes action handler • loops until finish 4. side effects on-chain + Convex writes • Privy app-signer broadcasts • domain tables updated (positions, snapshots, ...) 5. response synthesis finish-decision OBJECT_* • agent text generated • inserted in centralMessages • processing → completed → CLI response stdout JSON • { text, messageId, channelId, runId } • exit code 0/1 3 — TRACE CAPTURE (everything is in Convex `logs`) Convex source-of-truth tables logs scoped by runId • type: run_event • type: useModel:OBJECT_* (router + ACL loop) • type: action (debugBody with params, result) messageProcessing queue state • status: pending → processing → completed/failed ⚠ poll status BEFORE querying logs centralMessages user + agent messages • source: agent_response • source: agent_action • rawMessage.actionResult • rawMessage.actionStatus (failed / completed) writes scripts/reconstruct-chat.ts canonical trace API • bun scripts/reconstruct-chat.ts \ <channelId> --env develop \ --json --out /tmp/trace.json • reads logs + messages + memories • groups by runId chronologically • dereferences blob _bodyStorageId → ready-to-feed structured JSON test harness extracts e2e/lib/run-trace.ts • skillOrchestrator.skills[] • skillOrchestrator.intent • actions[].{name, params, result} • aclLoops[].messages • finalResponse text • processingStatus + Convex domain queries (balances, positions, etc.) 4 — JUDGE (3 deterministic checks + 1 LLM rubric) Skill Registry SKILL_REGISTRY (75) • actions per skill • messageExamples • antiExamples • synthesisFragmentFn • precondition(state) → machine-readable contract Test expectations per-test config • expectedSkills[] • expectedActions[] • expectedIntent • responseRubric (optional) • stateAssertions[] → tolerance per test (judge variation OK) Judge LLM Gemini Reasoning DETERMINISTIC (cheap) • skillsMatched • actionsExecuted + success • stateAssertions pass LLM RUBRIC (optional) • synthesisQuality • gaps[] for bug reports Verdict { pass, reason, gaps[] } ✓ skillsMatched ✓ actionsExecuted ✓ stateAssertions ~ synthesisQuality (LLM) gaps[] = bug reports features promised by registry but not delivered expect(verdict.pass).toBe(true) Cleanup discipline batch afterAll per domain • polymarket: sell positions • hyperliquid: close positions • swap: re-swap to USDC • limit orders: cancel all • send/transfer: refund via API direct, not agent (faster + cheaper + deterministic) bridge = manual only KEY DECISIONS CLI is the primary surface (not Playwright). Trace via existing `logs` table + `reconstruct-chat.ts` — no backend modif needed. Env = develop (iso prod, no user impact). Judge tolerates variation per test. Cleanup = batch afterAll per domain. Test wallet ~$300 multi-chain, $20/day CI budget cap. Real money, no testnet (Polymarket + Hyperliquid are mainnet-only). data source backend pipeline test harness configuration judge / decision trace tool

Why CLI is the primary surface

  • • Exercises the full real backend pipeline
  • • `one_shot --json` is scriptable
  • • No DOM flake (skip UI testing for most flows)
  • • External agent can drive prompts as if it were a user
  • • Privy session reused, no re-login per test

Trace = `logs` table

  • • All processing is captured per runId
  • • `scripts/reconstruct-chat.ts --json` is canonical
  • • Skill router selection, ACL loop, actions, results
  • • Plus `auto-chat-debugger` skill for analysis
  • • No new audit table needed

Poll before reading (critical)

  • • Logs are written async (`log-queue.ts` batches)
  • • Poll `messageProcessing.status` until completed
  • • Then run reconstruct-chat.ts
  • • Timeout = 30s default (LLM can be slow)
  • • `loopExitedByCap: true` = test should fail anyway

Judge — 4 dimensions, tolerant

  • • skillsMatched — registry vs orchestrator output
  • • actionsExecuted — log type === "action", success
  • • stateAssertions — Convex/on-chain queries pass
  • • synthesisQuality — optional LLM rubric per test
  • • Tolerance set per-test; not all tests use LLM judge

Real money discipline

  • • Test wallet ~$300 multi-chain
  • • Per-test budget: $1-2 (small trades only)
  • • Daily CI cap: $20
  • • Batch afterAll cleanup per domain
  • • Polymarket + Hyperliquid = mainnet only (no testnet)

Env = develop

  • • Self-hosted Convex (iso production)
  • • Team interne use it daily, our wallet won't bother
  • • If develop passes, prod will pass
  • • Staging = Convex Cloud (asymmetry, skip)
  • • Prod = real users, skip auto tests