DuckbillResearch
Brutalist architecture — AI-Human Coordination

Research Area 01

Hybrid Execution

AI agents can reason, plan, and execute — but the real world pushes back. We study the gap between what AI can do in theory and what actually works when vendors won't cooperate, systems block automation, and tasks demand judgment that models weren't trained for.

The Challenge

Why full automation isn't here yet

The primary barriers to full automation are not AI capability gaps. They are systemic constraints in the real world. Vendors hang up on AI callers. Healthcare refuses non-human interactions. Platforms enforce bot detection. Financial institutions require device-bound MFA. Businesses have irregular hours, unresponsive phone lines, and policies that only surface mid-conversation.

Secondary to these are performance gaps between how models are trained and what good task execution requires — knowing when to stop searching, maintaining fidelity without hallucinating, tracking constraints across long conversations, and exercising judgment about when to act versus when to defer. These are the dimensions we measure in DuckBench.

Task Automation Study

Completion Rates

Estimated end-to-end completion rates across 20 task categories, derived from over 100,000+ real-world tasks. The cards below show per-action success rates — how often a single phone call, web action, or digital step succeeds. The table shows compounded end-to-end rates — what happens when a task requires many of these actions in sequence, each of which can independently fail. Our agentic estimates are forward-looking: they assume near-perfect AI reasoning and instead model the systemic constraints that limit full automation today.

Digital Actions

65 – 92%

Per-action rates for file creation (88%), email threads (85%), web browsing (82%), and research steps (65%). These are the most reliable primitives — failures come from ambiguous user intent, subjective judgment calls, and integration gaps rather than fundamental AI limitations.

Phone Calls

35 – 85%

Per-call success varies dramatically by sector. Structured bookings (salons, restaurants) reach 73%. Automated support lines (airlines, returns) reach 85%. But healthcare providers refuse AI callers (35%), and small contractors often don't pick up (38%). With tasks averaging 9–32 calls, even moderate per-call rates compound to single digits end-to-end.

Web & Browser

25 – 82%

Per-action rates for interactive web tasks. Structured tools (calendars, docs) reach 82%. Standard form filling and scheduling at 72%. But major aggregators and marketplaces — Airbnb, Amazon, ticketing platforms — deploy aggressive bot detection that drops remote browser success to as low as 25%. Credential and 2FA barriers remain a primary industry-wide limitation.

Insurance & Coverage
Agentic
5.8%
Duckbill
96.1%
Home Maintenance
Agentic
6.9%
Duckbill
91.6%
Medical Scheduling
Agentic
7.6%
Duckbill
93.6%
Moving & Shipping
Agentic
10.1%
Duckbill
97.1%
Travel & Hotels
Agentic
12.0%
Duckbill
93.2%
Lessons & Activities
Agentic
13.2%
Duckbill
88.4%
Vehicle & Government
Agentic
14.6%
Duckbill
95.0%
Flights & Airlines Support
Agentic
15.4%
Duckbill
90.7%
Dining & Events
Agentic
16.1%
Duckbill
93.8%
Service Provider Search
Agentic
16.2%
Duckbill
95.6%
Salon & Personal Services
Agentic
16.6%
Duckbill
94.2%
Travel Documents
Agentic
17.1%
Duckbill
98.0%
Admin & Paperwork
Agentic
17.8%
Duckbill
91.5%
Orders & Returns
Agentic
20.7%
Duckbill
96.3%
Gifts & Occasions
Agentic
24.3%
Duckbill
94.6%
Cancellations & Memberships
Agentic
25.8%
Duckbill
91.7%
Meal Planning & Food
Agentic
34.8%
Duckbill
95.1%
Shopping & Sourcing
Agentic
35.5%
Duckbill
91.9%
Research & Documents
Agentic
56.8%
Duckbill
92.2%
Calendar & Reminders
Agentic
87.2%
Duckbill
98.0%
20 categories · 100,000+ tasks analyzedAgentic: 17.7% · Duckbill: 93%

The Duckbill System

Hybrid execution architecture

Our approach to closing the gap between agentic and human-level completion rates is not to wait for better models. It's to build a hybrid execution system where AI handles planning, coordination, and quality control while humans handle the moments that require real-world presence — phone calls that need a human voice, credential access, vendor negotiation, and recovery from automation failures.

Every component in the system is designed around one question: how do we get closer to the ~96% completion rate we achieve with humans in the loop, while progressively reducing the human involvement required to get there?

01

Orchestration Brain

Two-agent architecture separating strategic planning (user-facing) from tactical execution (operator-facing). Event-driven state machine with 17+ trigger types — vendor emails, operator updates, stuck-task retries, secure session results. Three-party information asymmetry: the client sees AI, the operator sees AI instructions, AI mediates both sides. Operators rotate in and out; AI maintains continuity. This is what lets us keep completion rates high even when individual automation steps fail — the orchestrator detects failure and reroutes to a human seamlessly.

02

Routing Engine

Matches tasks to operators via skill affinity, SLA pressure, and task-type history. Tracks the blocker chain — is progress stalled on the vendor, the client, or the operator? — and applies different follow-up strategies for each. Check intervals are urgency-adjusted and business-hours-aware. The routing engine is how we turn the 65% agentic rate into 96% — by identifying exactly when and where human intervention is needed and dispatching the right person immediately.

03

Quality Shield

Multi-gate system before any message reaches the user. Entity-level hallucination detection cross-references phone numbers, URLs, and emails against the full conversation history. Predictive sentiment analysis catches conversations trending negative before the user complains. A composite risk score combines execution health signals — task age, operator churn, vendor unresponsiveness — weighted per task type. The quality shield is what prevents automation failures from becoming user-visible failures.

04

Real-Time Supervision

AI follows along with structured playbooks while operators work. Validates progress, catches deviations, provides just-in-time context. Every session generates a stream of events — pages visited, calls made, emails sent — that feed the supervision model. The playbooks themselves evolve: when operators consistently deviate and get better results, that signal feeds back into system improvement. This layer progressively reduces the cost of human involvement while maintaining quality.

05

Simulation Engine

Full simulated conversations using LLM-driven blind actors — they don't know the expected outcome. Async events get injected mid-conversation: a vendor email arrives, a phone call goes to voicemail, business hours end. Adversarial personas stress-test edge cases. Thousands of scenarios run against the system before any change ships. This is how we improve the agentic completion rate itself — by identifying failure modes in simulation before they hit production.

06

Data Flywheel

Production tasks become simulation scenarios. Failures become harder test cases. Vendor response patterns accumulate into a knowledge graph. Operator performance data improves routing. The loop: production → extraction → simulation → evaluation → improvement → production. Every task the system handles makes the next task slightly more automatable. The system that handles a task today is measurably better than last week's.