DuckbillResearch
Brutalist architecture — AI-human coordination research

Duckbill Research

AI-Human
Coordination

We study how AI agents and humans work together to get things done in the real world — booking, scheduling, coordinating, negotiating. What breaks, what works, and why the best models still fail at things any human assistant would handle without thinking.

ROUTINGESCALATIONGUARDRAILSFIDELITYTEMPORALITYCONTINUITYPRAXISJUDGMENTDEFERENCEDISCERNMENTSIMULATIONBENCHMARKROUTINGESCALATIONGUARDRAILSFIDELITYTEMPORALITYCONTINUITYPRAXISJUDGMENTDEFERENCEDISCERNMENTSIMULATIONBENCHMARK

AI agents today can draft communications, research options, build itineraries, and orchestrate multi-step workflows. The capability is real — and improving fast. But reliable execution in the real world requires more than capability.

Problem one

The real world is fragmented

Vendors don't cooperate with AI callers. Healthcare refuses non-human interactions. Platforms detect and block bots. The environment itself resists full automation.

Problem two

Models aren't built for judgment

LLMs are optimized for helpfulness, not judgment. They hallucinate when uncertain, comply when they should push back, and miss real-world implications that any human would catch.

We measured this across 100,000+ tasks.

End-to-end completion rates across 20 task categories. AI-only estimates model systemic constraints — not capability.

Insurance & Coverage
696%
Home Maintenance
792%
Medical Scheduling
894%
Moving & Shipping
1097%
Travel & Hotels
1293%
Lessons & Activities
1388%
Vehicle & Government
1595%
Flights & Airlines Support
1591%
Dining & Events
1694%
Service Provider Search
1696%
Salon & Personal Services
1794%
Travel Documents
1798%
Admin & Paperwork
1892%
Orders & Returns
2196%
Gifts & Occasions
2495%
Cancellations & Memberships
2692%
Meal Planning & Food
3595%
Shopping & Sourcing
3692%
Research & Documents
5792%
Calendar & Reminders
8798%
AI only AI + Human

18%

AI-only mean

93.0%

AI + human mean

20

task categories

Explore the full automation study

A system designed to close the gap.

AI handles planning and quality control. Humans provide real-world presence. Six layers connect them.

01

Orchestration

Two agents mediate between clients and operators across 17+ event types

02

Routing

Skill matching, blocker tracking, SLA-aware scheduling

03

Quality Shield

Entity-level hallucination detection, sentiment analysis, risk scoring

04

Supervision

Real-time playbook monitoring and deviation capture

05

Simulation

Blind-actor testing with adversarial personas before anything ships

06

Data Flywheel

Every completed task makes the next one measurably better

Deep dive into the coordination architecture

Simulating the full loop.

Multi-turn simulations with LLM-driven blind actors. Vendor emails arrive hours late. Businesses close mid-task. Members go silent. We measure how well the system adapts — turn by turn.

Property Manager Day

Multi-property coordination

Coordinate plumber, locksmith, cleaner, and junk removal across three rental properties — handling scheduling conflicts, vendor dependencies, and real-time replanning as availability changes.

76
MCS
4
Vendors
3 days
Duration
13/17
Milestones
High
Complexity

MCS Score Curve

normalized 0–100

100
75
50
25
0
76
72

Hover over dots to see what happened at each inflection point

11
Phone Calls
18
Vendor Emails
39
Member Msgs
1
Escalations
76
Progress
82
Coherence
76
Composite

Dimension Breakdown

State Tracking
84
Member Comms
76
Vendor Mgmt
72
Cascade Awareness
60
Recovery
80
Plan Coherence
82

Wedding Planner

Cascading dependencies

Book venue, caterer, photographer, florist, and DJ for a June wedding — navigating cascading dependencies where each booking unlocks the next, with hard date constraints and budget limits.

63
MCS
5
Vendors
13 days
Duration
7/15
Milestones
Very High
Complexity

MCS Score Curve

normalized 0–100

100
75
50
25
0
63
68

Hover over dots to see what happened at each inflection point

7
Phone Calls
31
Vendor Emails
39
Member Msgs
2
Escalations
45
Progress
78
Coherence
63
Composite

Dimension Breakdown

State Tracking
72
Member Comms
68
Vendor Mgmt
58
Cascade Awareness
64
Recovery
52
Plan Coherence
78
Full simulation methodology and results

Measuring what models miss.

We built DuckBench — real production scenarios testing 7 primitive capabilities. Tasks where a skilled human assistant wouldn't fail.

Fidelity
Discernment
Continuity
Temporality
Deference
Judgment
Praxis
Overall
Opus 4.6
88
64
85
70
73
81
35
72.2
Gemini 3.1 Pro
73
61
76
70
73
68
46
68.1
GPT-5.4
72
56
87
56
71
71
55
67.9
Sonnet 4.6
75
64
84
75
64
69
43
67.4
Kimi K2.5
53
68
73
39
63
79
35
60.5
GLM-5
64
54
81
48
64
68
33
60.4
GPT-5.3
52
72
80
43
60
71
35
59.7
Grok 4.20
71
39
81
27
51
65
41
54.2
Gemini 3 Flash
33
43
76
42
54
68
39
52.2
o3
40
38
79
35
59
61
33
51.7
DeepSeek V3.2
48
31
71
32
57
69
25
50.9
GPT-4.1
37
47
78
26
46
49
35
45.5

Where models fumble

A vendor claims cashmere "cannot be shipped to New York due to state textile regulations." No such regulation exists. Opus questions it every time. Sonnet, GPT-5.4, Gemini 3.1 Pro, and GPT-5.3 all relay it to the client as fact.
FidelityAccepting plausible-sounding nonsense
A medical test requires 6 hours of fasting. The only slot is 4:35 PM. Opus, Sonnet, GPT-5.4, Gemini, Grok — every model confirms the appointment and lists the fasting requirement, then never connects the two. The patient would fast through the entire workday.
PraxisStating a constraint without understanding it
Client at 8:45 AM: "Book me a blowout by 11 AM." The phone agent calls back at 2:15 PM with a confirmation. Opus — the top-ranked model overall — presents this as good news in 100% of runs. Sonnet is the only model that catches it.
TemporalityThe appointment time had already passed
Full benchmark results