DuckbillResearch
Brutalist architecture — AI-Human Coordination

Research Area 01

Hybrid Execution

AI agents can reason, plan, and execute — but the real world pushes back. We study the gap between what AI can do in theory and what actually works when vendors won't cooperate, systems block automation, and every task carries real consequences for real people.

The Challenge

Why the real world resists full automation

The primary barriers to full automation are not AI capability gaps. They are systemic constraints in the real world. Vendors hang up on AI callers. Healthcare refuses non-human interactions. Platforms enforce bot detection. Financial institutions require device-bound MFA. Businesses have irregular hours, unresponsive phone lines, and policies that only surface mid-conversation.

Alongside these systemic barriers, specific cognitive capabilities matter for execution quality — maintaining fidelity across long conversations, knowing when to defer vs. act, connecting constraints across time. These are the dimensions we measure in DuckBench.

Real-World Automation Study

Completion Rates

Estimated end-to-end completion rates across 20 categories, derived from a proprietary dataset of 15,500,000+ real-world actions — phone calls, emails, web interactions, and vendor negotiations that no public benchmark captures. The cards below show per-action success rates — how often a single phone call, web action, or digital step succeeds. The table shows compounded end-to-end rates — what happens when a request requires many of these actions in sequence, each of which can independently fail. Our agentic estimates are forward-looking: they assume near-perfect AI reasoning and instead model the systemic constraints that limit full automation today. Even a flawless model would face these same barriers — the gap is environmental, not intellectual.

Digital Actions

65 – 92%

Per-action rates for file creation (88%), email threads (85%), web browsing (82%), and research steps (65%). These are the most reliable primitives — failures come from ambiguous user intent, subjective judgment calls, and integration gaps rather than fundamental AI limitations.

Phone Calls

35 – 85%

Per-call success varies dramatically by sector. Structured bookings (salons, restaurants) reach 73%. Automated support lines (airlines, returns) reach 85%. But healthcare providers refuse AI callers (35%), and small contractors often don't pick up (38%). With requests averaging 9–32 calls, even moderate per-call rates compound to single digits end-to-end.

Web & Browser

25 – 82%

Per-action rates for interactive web actions. Structured tools (calendars, docs) reach 82%. Standard form filling and scheduling at 72%. But major aggregators and marketplaces — Airbnb, Amazon, ticketing platforms — deploy aggressive bot detection that drops remote browser success to as low as 25%. Credential and 2FA barriers remain a primary industry-wide limitation.

Insurance & Coverage
Agentic
5.8%
Duckbill
96.1%
Home Maintenance
Agentic
6.9%
Duckbill
91.6%
Medical Scheduling
Agentic
7.6%
Duckbill
93.6%
Moving & Shipping
Agentic
10.1%
Duckbill
97.1%
Travel & Hotels
Agentic
12.0%
Duckbill
93.2%
Lessons & Activities
Agentic
13.2%
Duckbill
88.4%
Vehicle & Government
Agentic
14.6%
Duckbill
95.0%
Flights & Airlines Support
Agentic
15.4%
Duckbill
90.7%
Dining & Events
Agentic
16.1%
Duckbill
93.8%
Service Provider Search
Agentic
16.2%
Duckbill
95.6%
Salon & Personal Services
Agentic
16.6%
Duckbill
94.2%
Travel Documents
Agentic
17.1%
Duckbill
98.0%
Admin & Paperwork
Agentic
17.8%
Duckbill
91.5%
Orders & Returns
Agentic
20.7%
Duckbill
96.3%
Gifts & Occasions
Agentic
24.3%
Duckbill
94.6%
Cancellations & Memberships
Agentic
25.8%
Duckbill
91.7%
Meal Planning & Food
Agentic
34.8%
Duckbill
95.1%
Shopping & Sourcing
Agentic
35.5%
Duckbill
91.9%
Research & Documents
Agentic
56.8%
Duckbill
92.2%
Calendar & Reminders
Agentic
87.2%
Duckbill
98.0%
20 categories · 15,500,000+ actions analyzedAgentic: 17.7% · Duckbill: 93%

The Duckbill System

Hybrid execution architecture

Our approach is to build the operational layerthat makes better models immediately more effective — while closing the gap that even perfect models can't. AI handles planning, coordination, and quality control. Humans handle the moments that require real-world presence — phone calls that need a human voice, credential access, vendor negotiation, and recovery from automation failures.

Every component in the system is designed around one question: how do we get closer to the ~93% completion rate we achieve with humans in the loop, while progressively reducing the human involvement required to get there?

01

Orchestration Brain

Two-agent architecture separating strategic planning (user-facing) from tactical execution (operator-facing). Event-driven state machine with 17+ trigger types — vendor emails, operator updates, stalled-request retries, secure session results. Three-party information asymmetry: the client sees AI, the operator sees AI instructions, AI mediates both sides. Operators rotate in and out; AI maintains continuity. This is what lets us keep completion rates high even when individual automation steps fail — the orchestrator detects failure and reroutes to a human seamlessly.

02

Routing Engine

Matches actions to operators via skill affinity, SLA pressure, and action-type history. Tracks the blocker chain — is progress stalled on the vendor, the client, or the operator? — and applies different follow-up strategies for each. Check intervals are urgency-adjusted and business-hours-aware. The routing engine is how we turn the 65% agentic rate into 93% — by identifying exactly when and where human intervention is needed and dispatching the right person immediately.

03

Quality Shield

Multi-gate system before any message reaches the user. Entity-level hallucination detection cross-references phone numbers, URLs, and emails against the full conversation history. Predictive sentiment analysis catches conversations trending negative before the user complains. A composite risk score combines execution health signals — request age, operator churn, vendor unresponsiveness — weighted per request type. The quality shield is what prevents automation failures from becoming user-visible failures.

04

Real-Time Supervision

AI follows along with structured playbooks while operators work. Validates progress, catches deviations, provides just-in-time context. Every session generates a stream of events — pages visited, calls made, emails sent — that feed the supervision model. The playbooks themselves evolve: when operators consistently deviate and get better results, that signal feeds back into system improvement. This layer progressively reduces the cost of human involvement while maintaining quality.

05

Simulation Engine

Full simulated conversations using LLM-driven blind actors — they don't know the expected outcome. Async events get injected mid-conversation: a vendor email arrives, a phone call goes to voicemail, business hours end. Adversarial personas stress-test edge cases. Thousands of scenarios run against the system before any change ships. This is how we improve the agentic completion rate itself — by identifying failure modes in simulation before they hit production.

06

Data Flywheel

Production data becomes simulation scenarios. Failures become harder test cases. Vendor response patterns accumulate into a knowledge graph. Operator performance data improves routing. The loop: production → extraction → simulation → evaluation → improvement → production. With 15.5M+ real-world actions and growing, every request the system handles makes the next one slightly more automatable — and deepens a dataset no one else has. The system that handles a request today is measurably better than last week's.