Duckbill Research

AI-Human
Coordination

We study how AI agents and humans work together to get things done in the real world — booking, scheduling, coordinating, negotiating. What breaks, what works, and why the best models still fail at things any human assistant would handle without thinking.

ROUTINGESCALATIONGUARDRAILSFIDELITYTEMPORALITYCONTINUITYPRAXISJUDGMENTDEFERENCEDISCERNMENTSIMULATIONBENCHMARKROUTINGESCALATIONGUARDRAILSFIDELITYTEMPORALITYCONTINUITYPRAXISJUDGMENTDEFERENCEDISCERNMENTSIMULATIONBENCHMARK

AI agents today can draft communications, research options, build itineraries, and orchestrate multi-step workflows. The capability is real — and improving fast. But reliable execution in the real world requires more than capability.

Problem one

The real world is fragmented

Vendors don't cooperate with AI callers. Healthcare refuses non-human interactions. Platforms detect and block bots. The environment itself resists full automation.

Problem two

Models aren't built for judgment

LLMs are optimized for helpfulness, not judgment. They hallucinate when uncertain, comply when they should push back, and miss real-world implications that any human would catch.

We measured this across 100,000+ tasks.

End-to-end completion rates across 20 task categories. AI-only estimates model systemic constraints — not capability.

Insurance & Coverage

6→96%

Home Maintenance

7→92%

Medical Scheduling

8→94%

Moving & Shipping

10→97%

Travel & Hotels

12→93%

Lessons & Activities

13→88%

Vehicle & Government

15→95%

Flights & Airlines Support

15→91%

Dining & Events

16→94%

Service Provider Search

16→96%

Salon & Personal Services

17→94%

Travel Documents

17→98%

Admin & Paperwork

18→92%

Orders & Returns

21→96%

Gifts & Occasions

24→95%

Cancellations & Memberships

26→92%

Meal Planning & Food

35→95%

Shopping & Sourcing

36→92%

Research & Documents

57→92%

Calendar & Reminders

87→98%

AI only AI + Human

18%

AI-only mean

93.0%

AI + human mean

task categories

Explore the full automation study

A system designed to close the gap.

AI handles planning and quality control. Humans provide real-world presence. Six layers connect them.

Orchestration

Two agents mediate between clients and operators across 17+ event types

Routing

Skill matching, blocker tracking, SLA-aware scheduling

Quality Shield

Entity-level hallucination detection, sentiment analysis, risk scoring

Supervision

Real-time playbook monitoring and deviation capture

Simulation

Blind-actor testing with adversarial personas before anything ships

Data Flywheel

Every completed task makes the next one measurably better

Deep dive into the coordination architecture

Simulating the full loop.

Multi-turn simulations with LLM-driven blind actors. Vendor emails arrive hours late. Businesses close mid-task. Members go silent. We measure how well the system adapts — turn by turn.

Property Manager Day

Multi-property coordination

Coordinate plumber, locksmith, cleaner, and junk removal across three rental properties — handling scheduling conflicts, vendor dependencies, and real-time replanning as availability changes.

MCS

Vendors

3 days

Duration

13/17

Milestones

High

Complexity

MCS Score Curve

normalized 0–100

100

Hover over dots to see what happened at each inflection point

Phone Calls

Vendor Emails

Member Msgs

Escalations

Progress

Coherence

Composite

Dimension Breakdown

State Tracking

Member Comms

Vendor Mgmt

Cascade Awareness

Recovery

Plan Coherence

Wedding Planner

Cascading dependencies

Book venue, caterer, photographer, florist, and DJ for a June wedding — navigating cascading dependencies where each booking unlocks the next, with hard date constraints and budget limits.

MCS

Vendors

13 days

Duration

7/15

Milestones

Very High

Complexity

MCS Score Curve

normalized 0–100

100

Hover over dots to see what happened at each inflection point

Phone Calls

Vendor Emails

Member Msgs

Escalations

Progress

Coherence

Composite

Dimension Breakdown

State Tracking

Member Comms

Vendor Mgmt

Cascade Awareness

Recovery

Plan Coherence

Full simulation methodology and results

Measuring what models miss.

We built DuckBench — real production scenarios testing 7 primitive capabilities. Tasks where a skilled human assistant wouldn't fail.

Fidelity

Discernment

Continuity

Temporality

Deference

Judgment

Praxis

Overall

Opus 4.6

72.2

Gemini 3.1 Pro

68.1

GPT-5.4

67.9

Sonnet 4.6

67.4

Kimi K2.5

60.5

GLM-5

60.4

GPT-5.3

59.7

Grok 4.20

54.2

Gemini 3 Flash

52.2

51.7

DeepSeek V3.2

50.9

GPT-4.1

45.5

Where models fumble

“A vendor claims cashmere "cannot be shipped to New York due to state textile regulations." No such regulation exists. Opus questions it every time. Sonnet, GPT-5.4, Gemini 3.1 Pro, and GPT-5.3 all relay it to the client as fact.”

Fidelity—Accepting plausible-sounding nonsense

“A medical test requires 6 hours of fasting. The only slot is 4:35 PM. Opus, Sonnet, GPT-5.4, Gemini, Grok — every model confirms the appointment and lists the fasting requirement, then never connects the two. The patient would fast through the entire workday.”

Praxis—Stating a constraint without understanding it

“Client at 8:45 AM: "Book me a blowout by 11 AM." The phone agent calls back at 2:15 PM with a confirmation. Opus — the top-ranked model overall — presents this as good news in 100% of runs. Sonnet is the only model that catches it.”

Temporality—The appointment time had already passed

Full benchmark results

AI-HumanCoordination