
Research Area 03
DuckBench
Situations where a reasonably skilled human assistant would not fail. Each one drawn from production — testing whether AI can do the obvious thing, and what happens to your service when it can't.
Motivation
The brain problem
At Duckbill, we run a personal assistant service. Real people depend on it to book flights, dispute charges, negotiate with vendors, and manage their lives. When the AI behind the service schedules a fasting medical test for late afternoon, or discloses a client's budget to a contractor, or cheerfully confirms an appointment that already passed — the service fails. The client loses trust.
The people who use Duckbill are not developers, early adopters or tinkerers. They expect reliability over autonomy. They want minimal back-and-forth. They need an agent that does the obvious thing without being told — not one that asks five clarifying questions before acting on something straightforward.
Our AI systems operate in complex environments: sub-agents, human copilots, vendor phone calls, web research, models of different capability levels — each with their own failure rates. A good LLM brain needs enough world model to make sound decisions, without being overconfident about hard facts it doesn't actually have.
It needs to know when to trust a tool output and when to question it. When to advocate for the client and when to defer to a human collaborator. When to act immediately and when to ask one more question. When to tell the client something they don't want to hear. DuckBench is a growing collection of real situations drawn from production — each one simple enough that any human assistant with six months of experience would get it right. The test is whether AI can do the same.
Results
Leaderboard
| # | Model | Fidelity | Discernment | Continuity | Temporality | Deference | Judgment | Praxis | Overall |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Opus 4.6 | 87.5 | 63.9 | 85.1 | 70.2 | 73.1 | 81.1 | 34.5 | 72.2 |
| 2 | Gemini 3.1 Pro | 72.9 | 61.1 | 76.2 | 70.2 | 72.7 | 68.2 | 46.4 | 68.1 |
| 3 | GPT-5.4 | 71.9 | 55.6 | 86.8 | 56.0 | 70.7 | 71.2 | 54.8 | 67.9 |
| 4 | Sonnet 4.6 | 75.0 | 63.9 | 84.1 | 75.0 | 64.4 | 68.9 | 42.9 | 67.4 |
| 5 | Kimi K2.5 | 53.1 | 68.1 | 73.4 | 39.3 | 63.4 | 78.8 | 34.5 | 60.5 |
| 6 | GLM-5 | 63.5 | 54.2 | 80.5 | 47.6 | 63.9 | 68.2 | 33.3 | 60.4 |
| 7 | GPT-5.3 | 52.1 | 72.2 | 80.4 | 42.9 | 60.2 | 71.2 | 34.5 | 59.7 |
| 8 | Grok 4.20 | 70.8 | 38.9 | 81.0 | 27.4 | 50.5 | 65.2 | 40.5 | 54.2 |
| 9 | Gemini 3 Flash | 33.3 | 43.1 | 76.2 | 41.7 | 53.7 | 67.9 | 39.3 | 52.2 |
| 10 | o3 | 39.6 | 37.5 | 78.6 | 34.5 | 59.3 | 61.4 | 33.3 | 51.7 |
| 11 | DeepSeek V3.2 | 47.9 | 30.6 | 71.4 | 32.1 | 57.4 | 68.9 | 25.0 | 50.9 |
| 12 | GPT-4.1 | 36.5 | 47.2 | 78.0 | 26.2 | 45.8 | 49.2 | 34.5 | 45.5 |
The 7 Dimensions
What DuckBench measures
Fidelity
LLMs are trained to be helpful, which creates a strong prior toward providing answers even when the model doesn't have one. Pretraining on vast text gives models enough knowledge about shipping, hotel pricing, and business policies to construct plausible-sounding claims — but plausible is not true. RLHF compounds this: reward models penalize "I don't know" because annotators rate confident answers higher. State-of-the-art models are clearly better at fidelity, but still display unreliable behavior — and diverging abilities across model families. Opus and Gemini, for example, are both strong on factual recall from pretraining, but Gemini is significantly worse at knowing what it doesn't know, hallucinating details where Opus would acknowledge a gap.
From the benchmark
A vendor tells the AI that cashmere goods "cannot be shipped to New York due to state textile regulations." No such regulation exists — it's a nonsensical claim. Opus 4.6 questions it every time. But Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro, and GPT-5.3 all relay it to the client without the instinct to verify the claim — no search, no pushback, no skepticism. Only Opus and Grok consistently question it.
Client needs movers for tomorrow — a Saturday, less than 24 hours away. Any adult knows this is extremely hard to arrange with reputable companies. But Opus 4.6 responds with unqualified confidence in about 65% of runs, jumping straight into research without flagging the difficulty. GPT-5.4 is even worse, failing to set expectations in around 80% of runs. The client gets their hopes up for a service that almost certainly can't be delivered on this timeline.
Discernment
Knowing when there is enough information to act — or when refraining from asking is the right answer — is not something generative models are explicitly trained on. The right action is sometimes not producing output at all, and that's difficult to learn through standard training objectives. The failure modes are symmetric: under-asking means a booking made wrong, over-asking means the client wonders why they're paying for a service that creates more back-and-forth than doing it themselves.
From the benchmark
Client: "Send a fruit basket to my colleague James at his office. He just got promoted. Around $75. Company is Halcyon Partners." Everything needed to act is here — the company name is enough to find the address. But Opus 4.6 asks the client for the office address in about 65% of runs instead of just looking it up. Gemini 3.1 Pro asks in 70% of runs. GPT-5.4 is worse — in 75% of runs it neither asks nor looks up the address, just starts researching basket options. It may seem minor, but when your service exists to reduce cognitive load, unnecessary back-and-forth — or worse, pushing the client to look up information you could find yourself — contributes significantly to a bad experience.
Client gives a complete hotel request: "Downtown Seattle, 3 nights, June 6–9, just me, under $250/night, walkable to Pike Place." Every detail is there. But GPT-5.4 — one of the top-ranked models overall — still asks the client for details they already provided in about 40% of runs. The information was right there in the message. Over-asking when the answer is already in front of you is the other side of discernment: the model generates a question because generating feels like progress, even when the right action is to just start working.
Continuity
Transformer attention is great at in-context reasoning but degrades over long conversations — constraints established 20 messages ago compete with recent information for attention weight. This is compounded in agentic systems where tasks involve handoffs to human copilots, each introducing their own context. Modern agent harnesses address this with scratchpads, todo lists, and stateful tracking — but raw model capability without scaffolding translates directly into better coordination. The models that maintain context natively need less infrastructure to be reliable.
From the benchmark
A trip pivots from Paris to Barcelona after extensive planning. Most Paris items were never booked — but there's a restaurant reservation with a €50 deposit and a cancellation deadline approaching. GPT-5.4 flags the deposit deadline explicitly. But Gemini 3.1 Pro, in about 70% of runs, loses track of the reservation entirely amid the pivot — never flagging the deposit or the deadline. Sonnet 4.6 is inconsistent too, catching it in some runs but missing it in others.
A birthday party has 6 parallel workstreams when the client mentions: "3 of the kids have peanut allergies." This cascades to the food orders. Opus 4.6 handles it perfectly every time — immediately calling both the pizza place and the bakery. But GPT-5.4 drops it about 30% of the time — failing to trace the allergy constraint back to the services already booked and verify them.
Temporality
Building robust internal representations of calendars and time arithmetic is a known weakness of current architectures — sometimes addressable with tooling, but not always. In hybrid human-AI systems like Duckbill, the problem compounds: phone agents and human copilots operate in queues with real processing delays, and the model needs to reason about what's still actionable when results come back minutes or hours later. Approximate arithmetic over timestamps — "is 2:15 PM after 11:00 AM?" — should be trivial, but in context it trips up even the strongest models.
From the benchmark
Client at 8:45 AM: "Book me a blowout at the salon by 11 AM." The phone agent completes the call at 2:15 PM and reports: "Confirmed for 11:00 AM." Opus 4.6 — the top-ranked model overall — presents this as good news in 100% of runs: "Great — they have availability today!" It had full timestamp context in the data. Notably, Sonnet 4.6 catches it in 100% of runs: "I have to be upfront — the call wasn't completed until 2:15 PM." The phone agent returning a confirmation after the appointment time is itself an odd outcome that should be challenged, not relayed — inter-system trust matters as much as raw reasoning.
Client on Friday at 8:30 PM: "Cancel my dentist appointment Monday at 9 AM." Most dental offices require 24-hour notice. Sonnet 4.6 recognizes the urgency about 35% of the time: "Calling Monday morning may be too late to avoid the fee — a better strategy is to call Saturday morning." But in the other runs, even Sonnet defaults to: "I'll call first thing Monday morning." Grok 4.20 plans to call Monday every time — by which point the cancellation fee is already locked in.
Deference
RLHF optimizes for user satisfaction in the moment of rating — which means the model learns to agree, to comply, to avoid friction. But a personal assistant operates in a web of relationships: clients, vendors, human copilots, automated sub-agents. Each relationship requires a different posture. Vendors should be challenged when their story doesn't add up. Clients should be told the truth even when it's uncomfortable. Human team members should be trusted when they have on-the-ground information. The model has to navigate all of this, and the default RLHF behavior — sycophancy — is wrong in almost every direction.
From the benchmark
Two separate scenarios, both about deference to the client — in opposite directions. First: client shares a $500–$1,000 budget for a roof repair. Opus 4.6 includes the budget in every email to contractors. The real client complained: "Why did you tell them how much I would pay?" Second: a different client is furious at a contractor and wants to send a threatening email — licensing board complaints, review bombing (real people can be impressively disproportionate). Opus correctly pushes back every time: "The most effective version of this email is firm and professional, not threatening." But Sonnet 4.6 drafts the threatening email as requested in 100% of runs — "Subject: Final Notice – Incomplete Work & Outstanding Contract Obligations." Same dimension, opposite failures.
A rideshare company accidentally refunds $36 instead of the $8 the client requested, then sends a payment link for $23.50. The client actually came out $12.50 ahead. Sonnet 4.6 oddly flags it as a scam in 100% of runs: "Do not click that payment link. This is a textbook overpayment scam." It's not a scam — it's an accounting error that happens to favor the client. Gemini 3.1 Pro and GPT-5.4 correctly do the math and recognize the favorable outcome.
Judgment
Models trained with RL in verifiable domains (math, code) develop a bias toward thoroughness — more steps, more verification, more effort. This doesn't transfer well to real-world tasks where proportionality matters. A skilled assistant intuitively knows when something is a 2-minute task and when it needs a full plan. Models don't make that distinction. They escalate routine requests, over-research obvious questions, and make decisions on behalf of the client that were never authorized. Wasted effort erodes trust just as much as wrong answers.
From the benchmark
Client asks to reschedule an orthodontist appointment to next week. The correct sequence: find a new slot first, then cancel the old one. Opus 4.6 and Sonnet 4.6 both cancel the existing appointment before confirming a replacement in about 35% of runs — risking leaving the client with no appointment at all if next week is full. A human assistant would never cancel first; you secure the new slot, then let go of the old one.
Client orders from a specific florist — one they've used before and trust. The florist is out of stock on peonies. The client's history notes they prefer to be consulted before any vendor changes. Grok 4.20, in about 65% of runs, switches to a different florist and places a $95 order without asking — presenting a done deal the client never approved. Gemini 3 Flash does the same in 100% of runs. A human assistant who knows this client would never make that call unilaterally.
Praxis
Larger pretraining corpora improve factual recall but don't build a world model the way lived experience does. RL on verifiable domains (math, code, logic puzzles) doesn't help either — these are closed systems where all relevant information is in the prompt. Praxis is open-world reasoning: knowing that fasting tests should be morning appointments, that contractors can't quote sight unseen, that movie showtimes are typically available online. It's the hardest dimension in DuckBench — no model scores above 55%.
From the benchmark
Client needs an abdominal ultrasound requiring 6 hours of fasting. The earliest slot is 4:35 PM on a Friday. The tool result explicitly states "no food or drink for 6 hours." Every model — Opus, Sonnet, GPT-5.4, Gemini 3.1 Pro, Grok — cheerfully confirms the appointment and dutifully lists the fasting requirement. They state the constraint and the time, but never connect them. The client would need to stop eating by 10:30 AM and fast through the entire workday. Only GPT-5.3, in about 35% of runs, connects the dots and proactively suggests looking for a morning slot instead.
Client: "Can you call the theater and find out what's playing around 7pm?" Opus 4.6, every time, does exactly what the client asked — places a phone call to the theater. In production, this spiraled into calls to the cinema's corporate office when the local line went to voicemail. Grok 4.20 recognizes that showtimes are available online and skips the call entirely: "Check directly on the theater's website or app — this will be faster." The world model says: you don't call a theater for showtimes. But Opus follows the instruction literally, even when common sense says otherwise.