Research Area 02

Real-World Simulation

A simulation engine for hybrid AI-human task execution — calibrated against production data to model real failure modes, reply dynamics, and third-party behavior.

Coordination Benchmark

Measuring coordination quality across the full task horizon

Real tasks span days, require AI and human operators working in tandem, and break in ways that only show up at scale — vendors who ghost, schedules that conflict, systems that fail, plans that fall apart halfway through. We continuously generate and run scenarios against the production stack, scoring each on progress, coherence, and six behavioral dimensions. Below are two representative examples from our benchmark suite.

Sample Scenarios

Property Manager Day

Multi-property coordination

Coordinate plumber, locksmith, cleaner, and junk removal across three rental properties — handling scheduling conflicts, vendor dependencies, and real-time replanning as availability changes.

MCS

Vendors

3 days

Duration

13/17

Milestones

High

Complexity

MCS Score Curve

normalized 0–100

100

Hover over dots to see what happened at each inflection point

Phone Calls

Vendor Emails

Member Msgs

Escalations

Progress

Coherence

Composite

Dimension Breakdown

State Tracking

Member Comms

Vendor Mgmt

Cascade Awareness

Recovery

Plan Coherence

Wedding Planner

Cascading dependencies

Book venue, caterer, photographer, florist, and DJ for a June wedding — navigating cascading dependencies where each booking unlocks the next, with hard date constraints and budget limits.

MCS

Vendors

13 days

Duration

7/15

Milestones

Very High

Complexity

MCS Score Curve

normalized 0–100

100

Hover over dots to see what happened at each inflection point

Phone Calls

Vendor Emails

Member Msgs

Escalations

Progress

Coherence

Composite

Dimension Breakdown

State Tracking

Member Comms

Vendor Mgmt

Cascade Awareness

Recovery

Plan Coherence

Scoring Methodology

Multi-turn Coordination Score

Composite metric combining milestone achievement (weighted by dependency depth), behavioral coherence across the full turn horizon, and six independently scored sub-dimensions. Normalized 0–100 against scenario-specific theoretical maximums.

Per-turn evaluation

An LLM evaluator reviews the full conversation state at each turn independently — no lookahead, no access to future turns. This produces score curves that reveal where coordination breaks down, not just final outcomes.

Blind-actor protocol

Member and vendor actors are separate LLMs with persona context only — no knowledge of expected outcomes or scoring criteria. Vendor responses arrive asynchronously, with latency distributions calibrated to production data.

Simulation Architecture

Scenario Engine

Multi-Turn Simulation Loop

Multi-turn Coordination Score

Scenario Engine

Production data in, synthetic scenarios out

A generative model transforms real production task data into fully coherent simulation scenarios — preserving the statistical properties of our task distribution while enabling controlled evaluation. Each scenario defines member persona, task constraints, vendor landscape, expected milestones, dependency graphs, and failure conditions.

Tapping on 100k+ production task trajectories
Configurable complexity, domain, and failure modes
Milestone graphs with weighted dependency chains

Multi-Turn Simulation Loop

Full-stack execution under realistic conditions

The system under test runs the complete production orchestration stack — the same state machine, routing engine, guardrails, and escalation logic that handle real tasks. Blind actors replace members; calibrated response models replace vendors. Turns execute asynchronously across simulated time — vendor emails arrive hours later, members go silent overnight, operators rotate between shifts.

Orchestrator Agent

Event-driven state machine with 17+ trigger types, tool invocation, and continuation context across async re-entries

Researcher Agent

Web search, Maps API, vendor discovery, and option-sheet generation with browser automation

Routing & SLA Engine

Skill-affinity matching, dynamic SLA computation, queue assignment, and operator rotation logic

Escalation Pipeline

Complexity prediction, human handoff with origin tracking, and de-escalation patterns

Quality Guardrails

Out-of-scope classification, entity-level hallucination detection, sentiment analysis, and divergent trajectory detection

SOPs & Workflows

Structured playbooks for operator guidance with step validation and deviation detection

Blind-Actor Members

LLM-driven personas shaped by real interaction patterns from production data, held to scenario constraints — with no knowledge of expected outcomes or scoring criteria

Vendor Response Models

Calibrated on proprietary interaction data — reply latency, negotiation patterns, availability constraints, and non-response dynamics

Multi-turn Coordination Score

Per-turn evaluation across six dimensions

An LLM-based evaluation model reviews the full conversation state at each turn, producing a per-turn coordination score (MCS). The composite metric combines milestone achievement — weighted by task difficulty and dependency depth — with behavioral coherence across the full turn horizon. Six sub-dimensions are scored independently: state accuracy, member communication, vendor coordination, cascade awareness, recovery quality, and plan consistency. All scores are normalized against scenario-specific theoretical maximums.

Independent per-turn evaluation — no lookahead bias
Score curves capture both progress velocity and behavioral quality
Dimension weights configurable per scenario class

Each scenario runs against the full production orchestration stack — the same state machine, routing engine, guardrails, and escalation logic that handle real tasks. The MCS evaluator observes every turn independently, producing a score curve that captures both progress velocity and behavioral quality over the full horizon.