
Research Area 02
Real-World Simulation
A simulation engine for hybrid AI-human task execution — calibrated against production data to model real failure modes, reply dynamics, and third-party behavior.
Coordination Benchmark
Measuring coordination quality across the full task horizon
Real tasks span days, require AI and human operators working in tandem, and break in ways that only show up at scale — vendors who ghost, schedules that conflict, systems that fail, plans that fall apart halfway through. We continuously generate and run scenarios against the production stack, scoring each on progress, coherence, and six behavioral dimensions. Below are two representative examples from our benchmark suite.
Sample Scenarios
Property Manager Day
Multi-property coordination
Coordinate plumber, locksmith, cleaner, and junk removal across three rental properties — handling scheduling conflicts, vendor dependencies, and real-time replanning as availability changes.
MCS Score Curve
normalized 0–100
Hover over dots to see what happened at each inflection point
Dimension Breakdown
Wedding Planner
Cascading dependencies
Book venue, caterer, photographer, florist, and DJ for a June wedding — navigating cascading dependencies where each booking unlocks the next, with hard date constraints and budget limits.
MCS Score Curve
normalized 0–100
Hover over dots to see what happened at each inflection point
Dimension Breakdown
Scoring Methodology
Multi-turn Coordination Score
Composite metric combining milestone achievement (weighted by dependency depth), behavioral coherence across the full turn horizon, and six independently scored sub-dimensions. Normalized 0–100 against scenario-specific theoretical maximums.
Per-turn evaluation
An LLM evaluator reviews the full conversation state at each turn independently — no lookahead, no access to future turns. This produces score curves that reveal where coordination breaks down, not just final outcomes.
Blind-actor protocol
Member and vendor actors are separate LLMs with persona context only — no knowledge of expected outcomes or scoring criteria. Vendor responses arrive asynchronously, with latency distributions calibrated to production data.
Simulation Architecture
Scenario Engine
Production data in, synthetic scenarios out
A generative model transforms real production task data into fully coherent simulation scenarios — preserving the statistical properties of our task distribution while enabling controlled evaluation. Each scenario defines member persona, task constraints, vendor landscape, expected milestones, dependency graphs, and failure conditions.
- Tapping on 100k+ production task trajectories
- Configurable complexity, domain, and failure modes
- Milestone graphs with weighted dependency chains
Multi-Turn Simulation Loop
Full-stack execution under realistic conditions
The system under test runs the complete production orchestration stack — the same state machine, routing engine, guardrails, and escalation logic that handle real tasks. Blind actors replace members; calibrated response models replace vendors. Turns execute asynchronously across simulated time — vendor emails arrive hours later, members go silent overnight, operators rotate between shifts.
Orchestrator Agent
Event-driven state machine with 17+ trigger types, tool invocation, and continuation context across async re-entries
Researcher Agent
Web search, Maps API, vendor discovery, and option-sheet generation with browser automation
Routing & SLA Engine
Skill-affinity matching, dynamic SLA computation, queue assignment, and operator rotation logic
Escalation Pipeline
Complexity prediction, human handoff with origin tracking, and de-escalation patterns
Quality Guardrails
Out-of-scope classification, entity-level hallucination detection, sentiment analysis, and divergent trajectory detection
SOPs & Workflows
Structured playbooks for operator guidance with step validation and deviation detection
Blind-Actor Members
LLM-driven personas shaped by real interaction patterns from production data, held to scenario constraints — with no knowledge of expected outcomes or scoring criteria
Vendor Response Models
Calibrated on proprietary interaction data — reply latency, negotiation patterns, availability constraints, and non-response dynamics
Multi-turn Coordination Score
Per-turn evaluation across six dimensions
An LLM-based evaluation model reviews the full conversation state at each turn, producing a per-turn coordination score (MCS). The composite metric combines milestone achievement — weighted by task difficulty and dependency depth — with behavioral coherence across the full turn horizon. Six sub-dimensions are scored independently: state accuracy, member communication, vendor coordination, cascade awareness, recovery quality, and plan consistency. All scores are normalized against scenario-specific theoretical maximums.
- Independent per-turn evaluation — no lookahead bias
- Score curves capture both progress velocity and behavioral quality
- Dimension weights configurable per scenario class
Each scenario runs against the full production orchestration stack — the same state machine, routing engine, guardrails, and escalation logic that handle real tasks. The MCS evaluator observes every turn independently, producing a score curve that captures both progress velocity and behavioral quality over the full horizon.