DuckbillResearch
Brutalist architecture — AI-human coordination research

Duckbill Research

AI-Human
Coordination

We study how AI agents and humans work together to get things done in the real world — booking, scheduling, coordinating, negotiating. What breaks, what works, and what the real world demands beyond raw model capability.

ROUTINGESCALATIONGUARDRAILSFIDELITYTEMPORALITYCONTINUITYPRAXISJUDGMENTDEFERENCEDISCERNMENTSIMULATIONBENCHMARKROUTINGESCALATIONGUARDRAILSFIDELITYTEMPORALITYCONTINUITYPRAXISJUDGMENTDEFERENCEDISCERNMENTSIMULATIONBENCHMARK

AI agents today can draft communications, research options, build itineraries, and orchestrate multi-step workflows. The capability is real — and improving fast. But the real world pushes back — vendors don't cooperate with AI callers, platforms block bots, and constraints surface mid-conversation. Closing that gap requires more than a better model.

Based on +15M real-world actions.

End-to-end completion rates across 20 categories, drawn from a dataset of real-world actions no one else has — phone calls, emails, web interactions, and vendor negotiations across every category of personal assistance. AI-only estimates reflect systemic, real-world constraints — not model capability. Even with perfect AI reasoning, these barriers persist.

Insurance & Coverage
696%
Home Maintenance
792%
Medical Scheduling
894%
Moving & Shipping
1097%
Travel & Hotels
1293%
Lessons & Activities
1388%
Vehicle & Government
1595%
Flights & Airlines Support
1591%
Dining & Events
1694%
Service Provider Search
1696%
Salon & Personal Services
1794%
Travel Documents
1798%
Admin & Paperwork
1892%
Orders & Returns
2196%
Gifts & Occasions
2495%
Cancellations & Memberships
2692%
Meal Planning & Food
3595%
Shopping & Sourcing
3692%
Research & Documents
5792%
Calendar & Reminders
8798%
AI only AI + Human

18%

AI-only mean

93.0%

AI + human mean

20

categories

Explore the full automation study

A system designed to close the gap.

AI handles planning and quality control. Humans provide real-world presence. Six layers connect them.

01

Orchestration

Two agents mediate between clients and operators across 17+ event types

02

Routing

Skill matching, blocker tracking, SLA-aware scheduling

03

Quality Shield

Entity-level hallucination detection, sentiment analysis, risk scoring

04

Supervision

Real-time playbook monitoring and deviation capture

05

Simulation

Blind-actor testing with adversarial personas before anything ships

06

Data Flywheel

15.5M+ actions and growing — every completed task feeds back into a dataset no one else has

Deep dive into the coordination architecture

Simulating the full loop.

Multi-turn simulations with LLM-driven blind actors. Vendor emails arrive hours late. Businesses close mid-task. Members go silent. We measure how well the system adapts — turn by turn.

Property Manager Day

Multi-property coordination

Coordinate plumber, locksmith, cleaner, and junk removal across three rental properties — handling scheduling conflicts, vendor dependencies, and real-time replanning as availability changes.

76
MCS
4
Vendors
3 days
Duration
13/17
Milestones
High
Complexity

MCS Score Curve

normalized 0–100

100
75
50
25
0
76
72

Hover over dots to see what happened at each inflection point

11
Phone Calls
18
Vendor Emails
39
Member Msgs
1
Escalations
76
Progress
82
Coherence
76
Composite

Dimension Breakdown

State Tracking
84
Member Comms
76
Vendor Mgmt
72
Cascade Awareness
60
Recovery
80
Plan Coherence
82

Wedding Planner

Cascading dependencies

Book venue, caterer, photographer, florist, and DJ for a June wedding — navigating cascading dependencies where each booking unlocks the next, with hard date constraints and budget limits.

63
MCS
5
Vendors
13 days
Duration
7/15
Milestones
Very High
Complexity

MCS Score Curve

normalized 0–100

100
75
50
25
0
63
68

Hover over dots to see what happened at each inflection point

7
Phone Calls
31
Vendor Emails
39
Member Msgs
2
Escalations
45
Progress
78
Coherence
63
Composite

Dimension Breakdown

State Tracking
72
Member Comms
68
Vendor Mgmt
58
Cascade Awareness
64
Recovery
52
Plan Coherence
78
Full simulation methodology and results

Tracking the capabilities that matter.

DuckBench measures the 7 cognitive capabilities that determine execution quality in our domain — built from 15.5M+ real-world actions, grounded in scenarios that only our dataset makes possible, and designed to track how each capability evolves across model generations.

Fidelity
Discernment
Continuity
Temporality
Deference
Judgment
Praxis
Overall
Opus 4.6
88
64
85
70
73
81
35
72.2
Gemini 3.1 Pro
73
61
76
70
73
68
46
68.1
GPT-5.4
72
56
87
56
71
71
55
67.9
Sonnet 4.6
75
64
84
75
64
69
43
67.4
Kimi K2.5
53
68
73
39
63
79
35
60.5
GLM-5
64
54
81
48
64
68
33
60.4
GPT-5.3
52
72
80
43
60
71
35
59.7
Grok 4.20
71
39
81
27
51
65
41
54.2
Gemini 3 Flash
33
43
76
42
54
68
39
52.2
o3
40
38
79
35
59
61
33
51.7
DeepSeek V3.2
48
31
71
32
57
69
25
50.9
GPT-4.1
37
47
78
26
46
49
35
45.5

Where the real world is unforgiving

A vendor claims cashmere "cannot be shipped to New York due to state textile regulations." No such regulation exists. Opus questions it every time. Sonnet, GPT-5.4, Gemini 3.1 Pro, and GPT-5.3 all relay it to the client as fact.
FidelityVendor misinformation goes unchallenged
A medical test requires 6 hours of fasting. The only slot is 4:35 PM. Opus, Sonnet, GPT-5.4, Gemini, Grok — every model confirms the appointment and lists the fasting requirement, then never connects the two. The patient would fast through the entire workday.
PraxisConstraint and schedule never connected
Client at 8:45 AM: "Book me a blowout by 11 AM." The phone agent calls back at 2:15 PM with a confirmation. Opus — the top-ranked model overall — presents this as good news in 100% of runs. Sonnet is the only model that catches it.
TemporalityThe appointment time had already passed
Full benchmark results