
Research Area 01
Hybrid Execution
AI agents can reason, plan, and execute — but the real world pushes back. We study the gap between what AI can do in theory and what actually works when vendors won't cooperate, systems block automation, and tasks demand judgment that models weren't trained for.
The Challenge
Why full automation isn't here yet
The primary barriers to full automation are not AI capability gaps. They are systemic constraints in the real world. Vendors hang up on AI callers. Healthcare refuses non-human interactions. Platforms enforce bot detection. Financial institutions require device-bound MFA. Businesses have irregular hours, unresponsive phone lines, and policies that only surface mid-conversation.
Secondary to these are performance gaps between how models are trained and what good task execution requires — knowing when to stop searching, maintaining fidelity without hallucinating, tracking constraints across long conversations, and exercising judgment about when to act versus when to defer. These are the dimensions we measure in DuckBench.
Task Automation Study
Completion Rates
Estimated end-to-end completion rates across 20 task categories, derived from over 100,000+ real-world tasks. The cards below show per-action success rates — how often a single phone call, web action, or digital step succeeds. The table shows compounded end-to-end rates — what happens when a task requires many of these actions in sequence, each of which can independently fail. Our agentic estimates are forward-looking: they assume near-perfect AI reasoning and instead model the systemic constraints that limit full automation today.
Digital Actions
65 – 92%
Per-action rates for file creation (88%), email threads (85%), web browsing (82%), and research steps (65%). These are the most reliable primitives — failures come from ambiguous user intent, subjective judgment calls, and integration gaps rather than fundamental AI limitations.
Phone Calls
35 – 85%
Per-call success varies dramatically by sector. Structured bookings (salons, restaurants) reach 73%. Automated support lines (airlines, returns) reach 85%. But healthcare providers refuse AI callers (35%), and small contractors often don't pick up (38%). With tasks averaging 9–32 calls, even moderate per-call rates compound to single digits end-to-end.
Web & Browser
25 – 82%
Per-action rates for interactive web tasks. Structured tools (calendars, docs) reach 82%. Standard form filling and scheduling at 72%. But major aggregators and marketplaces — Airbnb, Amazon, ticketing platforms — deploy aggressive bot detection that drops remote browser success to as low as 25%. Credential and 2FA barriers remain a primary industry-wide limitation.
The Duckbill System
Hybrid execution architecture
Our approach to closing the gap between agentic and human-level completion rates is not to wait for better models. It's to build a hybrid execution system where AI handles planning, coordination, and quality control while humans handle the moments that require real-world presence — phone calls that need a human voice, credential access, vendor negotiation, and recovery from automation failures.
Every component in the system is designed around one question: how do we get closer to the ~96% completion rate we achieve with humans in the loop, while progressively reducing the human involvement required to get there?
Orchestration Brain
Two-agent architecture separating strategic planning (user-facing) from tactical execution (operator-facing). Event-driven state machine with 17+ trigger types — vendor emails, operator updates, stuck-task retries, secure session results. Three-party information asymmetry: the client sees AI, the operator sees AI instructions, AI mediates both sides. Operators rotate in and out; AI maintains continuity. This is what lets us keep completion rates high even when individual automation steps fail — the orchestrator detects failure and reroutes to a human seamlessly.
Routing Engine
Matches tasks to operators via skill affinity, SLA pressure, and task-type history. Tracks the blocker chain — is progress stalled on the vendor, the client, or the operator? — and applies different follow-up strategies for each. Check intervals are urgency-adjusted and business-hours-aware. The routing engine is how we turn the 65% agentic rate into 96% — by identifying exactly when and where human intervention is needed and dispatching the right person immediately.
Quality Shield
Multi-gate system before any message reaches the user. Entity-level hallucination detection cross-references phone numbers, URLs, and emails against the full conversation history. Predictive sentiment analysis catches conversations trending negative before the user complains. A composite risk score combines execution health signals — task age, operator churn, vendor unresponsiveness — weighted per task type. The quality shield is what prevents automation failures from becoming user-visible failures.
Real-Time Supervision
AI follows along with structured playbooks while operators work. Validates progress, catches deviations, provides just-in-time context. Every session generates a stream of events — pages visited, calls made, emails sent — that feed the supervision model. The playbooks themselves evolve: when operators consistently deviate and get better results, that signal feeds back into system improvement. This layer progressively reduces the cost of human involvement while maintaining quality.
Simulation Engine
Full simulated conversations using LLM-driven blind actors — they don't know the expected outcome. Async events get injected mid-conversation: a vendor email arrives, a phone call goes to voicemail, business hours end. Adversarial personas stress-test edge cases. Thousands of scenarios run against the system before any change ships. This is how we improve the agentic completion rate itself — by identifying failure modes in simulation before they hit production.
Data Flywheel
Production tasks become simulation scenarios. Failures become harder test cases. Vendor response patterns accumulate into a knowledge graph. Operator performance data improves routing. The loop: production → extraction → simulation → evaluation → improvement → production. Every task the system handles makes the next task slightly more automatable. The system that handles a task today is measurably better than last week's.