
Duckbill Research
AI-Human
Coordination
We study how AI agents and humans work together to get things done in the real world — booking, scheduling, coordinating, negotiating. What breaks, what works, and what the real world demands beyond raw model capability.
AI agents today can draft communications, research options, build itineraries, and orchestrate multi-step workflows. The capability is real — and improving fast. But the real world pushes back — vendors don't cooperate with AI callers, platforms block bots, and constraints surface mid-conversation. Closing that gap requires more than a better model.
Based on +15M real-world actions.
End-to-end completion rates across 20 categories, drawn from a dataset of real-world actions no one else has — phone calls, emails, web interactions, and vendor negotiations across every category of personal assistance. AI-only estimates reflect systemic, real-world constraints — not model capability. Even with perfect AI reasoning, these barriers persist.
18%
AI-only mean
93.0%
AI + human mean
20
categories
A system designed to close the gap.
AI handles planning and quality control. Humans provide real-world presence. Six layers connect them.
Orchestration
Two agents mediate between clients and operators across 17+ event types
Routing
Skill matching, blocker tracking, SLA-aware scheduling
Quality Shield
Entity-level hallucination detection, sentiment analysis, risk scoring
Supervision
Real-time playbook monitoring and deviation capture
Simulation
Blind-actor testing with adversarial personas before anything ships
Data Flywheel
15.5M+ actions and growing — every completed task feeds back into a dataset no one else has
Simulating the full loop.
Multi-turn simulations with LLM-driven blind actors. Vendor emails arrive hours late. Businesses close mid-task. Members go silent. We measure how well the system adapts — turn by turn.
Property Manager Day
Multi-property coordination
Coordinate plumber, locksmith, cleaner, and junk removal across three rental properties — handling scheduling conflicts, vendor dependencies, and real-time replanning as availability changes.
MCS Score Curve
normalized 0–100
Hover over dots to see what happened at each inflection point
Dimension Breakdown
Wedding Planner
Cascading dependencies
Book venue, caterer, photographer, florist, and DJ for a June wedding — navigating cascading dependencies where each booking unlocks the next, with hard date constraints and budget limits.
MCS Score Curve
normalized 0–100
Hover over dots to see what happened at each inflection point
Dimension Breakdown
Tracking the capabilities that matter.
DuckBench measures the 7 cognitive capabilities that determine execution quality in our domain — built from 15.5M+ real-world actions, grounded in scenarios that only our dataset makes possible, and designed to track how each capability evolves across model generations.
Where the real world is unforgiving
“A vendor claims cashmere "cannot be shipped to New York due to state textile regulations." No such regulation exists. Opus questions it every time. Sonnet, GPT-5.4, Gemini 3.1 Pro, and GPT-5.3 all relay it to the client as fact.”
“A medical test requires 6 hours of fasting. The only slot is 4:35 PM. Opus, Sonnet, GPT-5.4, Gemini, Grok — every model confirms the appointment and lists the fasting requirement, then never connects the two. The patient would fast through the entire workday.”
“Client at 8:45 AM: "Book me a blowout by 11 AM." The phone agent calls back at 2:15 PM with a confirmation. Opus — the top-ranked model overall — presents this as good news in 100% of runs. Sonnet is the only model that catches it.”