DuckbillResearch
Brutalist environment — Real-World Simulation

Research Area 02

Real-World Simulation

A simulation engine for hybrid AI-human task execution — calibrated against production data to model real failure modes, reply dynamics, and third-party behavior.

Coordination Benchmark

Measuring coordination quality across the full task horizon

Real tasks span days, require AI and human operators working in tandem, and break in ways that only show up at scale — vendors who ghost, schedules that conflict, systems that fail, plans that fall apart halfway through. We continuously generate and run scenarios against the production stack, scoring each on progress, coherence, and six behavioral dimensions. Below are two representative examples from our benchmark suite.

Sample Scenarios

Property Manager Day

Multi-property coordination

Coordinate plumber, locksmith, cleaner, and junk removal across three rental properties — handling scheduling conflicts, vendor dependencies, and real-time replanning as availability changes.

76
MCS
4
Vendors
3 days
Duration
13/17
Milestones
High
Complexity

MCS Score Curve

normalized 0–100

100
75
50
25
0
76
72

Hover over dots to see what happened at each inflection point

11
Phone Calls
18
Vendor Emails
39
Member Msgs
1
Escalations
76
Progress
82
Coherence
76
Composite

Dimension Breakdown

State Tracking
84
Member Comms
76
Vendor Mgmt
72
Cascade Awareness
60
Recovery
80
Plan Coherence
82

Wedding Planner

Cascading dependencies

Book venue, caterer, photographer, florist, and DJ for a June wedding — navigating cascading dependencies where each booking unlocks the next, with hard date constraints and budget limits.

63
MCS
5
Vendors
13 days
Duration
7/15
Milestones
Very High
Complexity

MCS Score Curve

normalized 0–100

100
75
50
25
0
63
68

Hover over dots to see what happened at each inflection point

7
Phone Calls
31
Vendor Emails
39
Member Msgs
2
Escalations
45
Progress
78
Coherence
63
Composite

Dimension Breakdown

State Tracking
72
Member Comms
68
Vendor Mgmt
58
Cascade Awareness
64
Recovery
52
Plan Coherence
78

Scoring Methodology

Multi-turn Coordination Score

Composite metric combining milestone achievement (weighted by dependency depth), behavioral coherence across the full turn horizon, and six independently scored sub-dimensions. Normalized 0–100 against scenario-specific theoretical maximums.

Per-turn evaluation

An LLM evaluator reviews the full conversation state at each turn independently — no lookahead, no access to future turns. This produces score curves that reveal where coordination breaks down, not just final outcomes.

Blind-actor protocol

Member and vendor actors are separate LLMs with persona context only — no knowledge of expected outcomes or scoring criteria. Vendor responses arrive asynchronously, with latency distributions calibrated to production data.

Simulation Architecture

01
Scenario Engine
02
Multi-Turn Simulation Loop
03
Multi-turn Coordination Score
01

Scenario Engine

Production data in, synthetic scenarios out

A generative model transforms real production task data into fully coherent simulation scenarios — preserving the statistical properties of our task distribution while enabling controlled evaluation. Each scenario defines member persona, task constraints, vendor landscape, expected milestones, dependency graphs, and failure conditions.

  • Tapping on 100k+ production task trajectories
  • Configurable complexity, domain, and failure modes
  • Milestone graphs with weighted dependency chains
02

Multi-Turn Simulation Loop

Full-stack execution under realistic conditions

The system under test runs the complete production orchestration stack — the same state machine, routing engine, guardrails, and escalation logic that handle real tasks. Blind actors replace members; calibrated response models replace vendors. Turns execute asynchronously across simulated time — vendor emails arrive hours later, members go silent overnight, operators rotate between shifts.

Orchestrator Agent

Event-driven state machine with 17+ trigger types, tool invocation, and continuation context across async re-entries

Researcher Agent

Web search, Maps API, vendor discovery, and option-sheet generation with browser automation

Routing & SLA Engine

Skill-affinity matching, dynamic SLA computation, queue assignment, and operator rotation logic

Escalation Pipeline

Complexity prediction, human handoff with origin tracking, and de-escalation patterns

Quality Guardrails

Out-of-scope classification, entity-level hallucination detection, sentiment analysis, and divergent trajectory detection

SOPs & Workflows

Structured playbooks for operator guidance with step validation and deviation detection

Blind-Actor Members

LLM-driven personas shaped by real interaction patterns from production data, held to scenario constraints — with no knowledge of expected outcomes or scoring criteria

Vendor Response Models

Calibrated on proprietary interaction data — reply latency, negotiation patterns, availability constraints, and non-response dynamics

03

Multi-turn Coordination Score

Per-turn evaluation across six dimensions

An LLM-based evaluation model reviews the full conversation state at each turn, producing a per-turn coordination score (MCS). The composite metric combines milestone achievement — weighted by task difficulty and dependency depth — with behavioral coherence across the full turn horizon. Six sub-dimensions are scored independently: state accuracy, member communication, vendor coordination, cascade awareness, recovery quality, and plan consistency. All scores are normalized against scenario-specific theoretical maximums.

  • Independent per-turn evaluation — no lookahead bias
  • Score curves capture both progress velocity and behavioral quality
  • Dimension weights configurable per scenario class

Each scenario runs against the full production orchestration stack — the same state machine, routing engine, guardrails, and escalation logic that handle real tasks. The MCS evaluator observes every turn independently, producing a score curve that captures both progress velocity and behavioral quality over the full horizon.