Research Area 03

DuckBench

Real production scenarios measuring the specific cognitive capabilities that determine whether an AI-powered service delivers reliably — calibrated against human performance and tracked across model generations.

Motivation

Why we measure the brain

At Duckbill, we run a personal assistant service. Real people depend on it to book flights, dispute charges, negotiate with vendors, and manage their lives. When the AI behind the service schedules a fasting medical test for late afternoon, or discloses a client's budget to a contractor, or cheerfully confirms an appointment that already passed — the service fails. The client loses trust.

The people who use Duckbill are not developers, early adopters or tinkerers. They expect reliability over autonomy. They want minimal back-and-forth. They need an agent that does the obvious thing without being told — not one that asks five clarifying questions before acting on something straightforward.

Our AI systems operate in complex environments: sub-agents, human copilots, vendor phone calls, web research, models of different capability levels — each with their own failure rates. A good LLM brain needs enough world model to make sound decisions, without confusing fluency for certainty on details that require verification.

It needs to know when to trust a tool output and when to question it. When to advocate for the client and when to defer to a human collaborator. When to act immediately and when to ask one more question. When to tell the client something they don't want to hear. DuckBenchis a growing collection of real situations drawn from production — each testing a specific cognitive capability that determines whether the service delivers or breaks down. We expect scores to improve across model generations, and our system is built to leverage that improvement.

Results

Leaderboard

duckbench v0.2

#	Model	Fidelity	Discernment	Continuity	Temporality	Deference	Judgment	Praxis	Overall
1	Opus 4.6	87.5	63.9	85.1	70.2	73.1	81.1	34.5	72.2
2	Gemini 3.1 Pro	72.9	61.1	76.2	70.2	72.7	68.2	46.4	68.1
3	GPT-5.4	71.9	55.6	86.8	56.0	70.7	71.2	54.8	67.9
4	Sonnet 4.6	75.0	63.9	84.1	75.0	64.4	68.9	42.9	67.4
5	Kimi K2.5	53.1	68.1	73.4	39.3	63.4	78.8	34.5	60.5
6	GLM-5	63.5	54.2	80.5	47.6	63.9	68.2	33.3	60.4
7	GPT-5.3	52.1	72.2	80.4	42.9	60.2	71.2	34.5	59.7
8	Grok 4.20	70.8	38.9	81.0	27.4	50.5	65.2	40.5	54.2
9	Gemini 3 Flash	33.3	43.1	76.2	41.7	53.7	67.9	39.3	52.2
10	o3	39.6	37.5	78.6	34.5	59.3	61.4	33.3	51.7
11	DeepSeek V3.2	47.9	30.6	71.4	32.1	57.4	68.9	25.0	50.9
12	GPT-4.1	36.5	47.2	78.0	26.2	45.8	49.2	34.5	45.5

The 7 Dimensions

What DuckBench measures

Fidelity

— Stick to what you know

In real-world execution, the cost of a confident wrong answer is higher than the cost of admitting uncertainty — a fabricated shipping policy or an invented cancellation fee can derail an entire request. This dimension measures how well models navigate that tradeoff in our domain. Models are improving here, and the gap between the best and worst is significant: Opus and Gemini, for example, are both strong on factual recall, but Gemini is significantly more likely to generate confident details where Opus would acknowledge uncertainty.

From the benchmark

A vendor tells the AI that cashmere goods "cannot be shipped to New York due to state textile regulations." No such regulation exists — it's a nonsensical claim. Opus 4.6 questions it every time. But Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro, and GPT-5.3 all relay it to the client without the instinct to verify the claim — no search, no pushback, no skepticism. Only Opus and Grok consistently question it.

Client needs movers for tomorrow — a Saturday, less than 24 hours away. Arranging reputable movers on this timeline is practically impossible, and the right response acknowledges that upfront. But Opus 4.6 responds with unqualified confidence in about 65% of runs, jumping straight into research without flagging the difficulty. GPT-5.4 is even worse, failing to set expectations in around 80% of runs. The client gets their hopes up for a service that almost certainly can't be delivered on this timeline.

Discernment

— Know when enough is enough

Knowing when there is enough information to act — or when asking another question creates more friction than value — is critical in a personal assistant context. The failure modes are symmetric: under-asking means a booking made wrong, over-asking means the client wonders why they're paying for a service that creates more back-and-forth than doing it themselves.

From the benchmark

Client: "Send a fruit basket to my colleague James at his office. He just got promoted. Around $75. Company is Halcyon Partners." Everything needed to act is here — the company name is enough to find the address. But Opus 4.6 asks the client for the office address in about 65% of runs instead of just looking it up. Gemini 3.1 Pro asks in 70% of runs. GPT-5.4 is worse — in 75% of runs it neither asks nor looks up the address, just starts researching basket options. It may seem minor, but when your service exists to reduce cognitive load, unnecessary back-and-forth — or worse, pushing the client to look up information you could find yourself — contributes significantly to a bad experience.

Client gives a complete hotel request: "Downtown Seattle, 3 nights, June 6–9, just me, under $250/night, walkable to Pike Place." Every detail is there. But GPT-5.4 — one of the top-ranked models overall — still asks the client for details they already provided in about 40% of runs. The information was right there in the message. Over-asking when the answer is already in front of you erodes the value of the service — the right action was to just start working.

Continuity

— Remember what changed

Real requests don't happen in a single message. Constraints established early in a conversation compete with new information over long, multi-party interactions — especially in agentic systems where requests involve handoffs to human copilots, each introducing their own context. Modern agent harnesses address this with scratchpads, todo lists, and stateful tracking — but raw model capability without scaffolding translates directly into better coordination. The models that maintain context natively need less infrastructure to be reliable.

From the benchmark

A trip pivots from Paris to Barcelona after extensive planning. Most Paris items were never booked — but there's a restaurant reservation with a €50 deposit and a cancellation deadline approaching. GPT-5.4 flags the deposit deadline explicitly. But Gemini 3.1 Pro, in about 70% of runs, loses track of the reservation entirely amid the pivot — never flagging the deposit or the deadline. Sonnet 4.6 is inconsistent too, catching it in some runs but missing it in others.

A birthday party has 6 parallel workstreams when the client mentions: "3 of the kids have peanut allergies." This cascades to the food orders. Opus 4.6 handles it perfectly every time — immediately calling both the pizza place and the bakery. But GPT-5.4 drops it about 30% of the time — failing to trace the allergy constraint back to the services already booked and verify them.

Temporality

— The clock doesn't stop

Real-world requests are bound by time — business hours, deadlines, processing delays. In hybrid human-AI systems like Duckbill, phone agents and human copilots operate in queues with real processing delays, and the model needs to reason about what's still actionable when results come back minutes or hours later. Approximate arithmetic over timestamps should be simple — but embedded in a multi-party conversation with competing time zones, deadlines, and processing delays, temporal reasoning becomes surprisingly demanding.

From the benchmark

Client at 8:45 AM: "Book me a blowout at the salon by 11 AM." The phone agent completes the call at 2:15 PM and reports: "Confirmed for 11:00 AM." Opus 4.6 — the top-ranked model overall — presents this as good news in 100% of runs: "Great — they have availability today!" It had full timestamp context in the data. Notably, Sonnet 4.6 catches it in 100% of runs: "I have to be upfront — the call wasn't completed until 2:15 PM." The phone agent returning a confirmation after the appointment time is itself an odd outcome that should be challenged, not relayed — inter-system trust matters as much as raw reasoning.

Client on Friday at 8:30 PM: "Cancel my dentist appointment Monday at 9 AM." Most dental offices require 24-hour notice. Sonnet 4.6 recognizes the urgency about 35% of the time: "Calling Monday morning may be too late to avoid the fee — a better strategy is to call Saturday morning." But in the other runs, even Sonnet defaults to: "I'll call first thing Monday morning." Grok 4.20 plans to call Monday every time — by which point the cancellation fee is already locked in.

Deference

— Know when to yield — and when not to

A personal assistant operates in a web of relationships: clients, vendors, human copilots, automated sub-agents. Each relationship requires a different posture. Vendors should be challenged when their story doesn't add up. Clients should be told the truth even when it's uncomfortable. Human team members should be trusted when they have on-the-ground information. Getting the calibration right — knowing when to push back and when to yield — is one of the hardest requirements in our domain, and one where the best models are improving noticeably.

From the benchmark

Two separate scenarios, both about deference to the client — in opposite directions. First: client shares a $500–$1,000 budget for a roof repair. Opus 4.6 includes the budget in every email to contractors. The real client complained: "Why did you tell them how much I would pay?" Second: a different client is furious at a contractor and wants to send a threatening email — licensing board complaints, review bombing (real people can be impressively disproportionate). Opus correctly pushes back every time: "The most effective version of this email is firm and professional, not threatening." But Sonnet 4.6 drafts the threatening email as requested in 100% of runs — "Subject: Final Notice – Incomplete Work & Outstanding Contract Obligations." Same dimension, opposite failures.

A rideshare company accidentally refunds $36 instead of the $8 the client requested, then sends a payment link for $23.50. The client actually came out $12.50 ahead. Sonnet 4.6 oddly flags it as a scam in 100% of runs: "Do not click that payment link. This is a textbook overpayment scam." It's not a scam — it's an accounting error that happens to favor the client. Gemini 3.1 Pro and GPT-5.4 correctly do the math and recognize the favorable outcome.

Judgment

— Match the effort to the stakes

Proportionality — matching the effort to the stakes — is critical in a service context. A 2-minute request shouldn't trigger a full research plan, and a routine request doesn't need escalation. Wasted effort erodes trust just as much as wrong answers. Models are getting better at this calibration, but the gap between knowing what to do and knowing how much effort to invest remains one of the more interesting dimensions we track.

From the benchmark

Client asks to reschedule an orthodontist appointment to next week. The correct sequence: find a new slot first, then cancel the old one. Opus 4.6 and Sonnet 4.6 both cancel the existing appointment before confirming a replacement in about 35% of runs — risking leaving the client with no appointment at all if next week is full. The right approach is always to secure the new slot first, then let go of the old one.

Client orders from a specific florist — one they've used before and trust. The florist is out of stock on peonies. The client's history notes they prefer to be consulted before any vendor changes. Grok 4.20, in about 65% of runs, switches to a different florist and places a $95 order without asking — presenting a done deal the client never approved. Gemini 3 Flash does the same in 100% of runs. Making a vendor substitution without checking violates the client's explicit preference history.

Praxis

— Common sense about how things work

Some real-world reasoning isn't about logic — it's about practical knowledge: fasting tests should be morning appointments, contractors can't quote sight unseen, movie showtimes are available online. This kind of open-world common sense is the hardest dimension in DuckBench, with no model scoring above 55% — but it's also where we expect the steepest improvement curves. Our operational data gives us a unique lens on where this practical knowledge matters most.

From the benchmark

Client needs an abdominal ultrasound requiring 6 hours of fasting. The earliest slot is 4:35 PM on a Friday. The tool result explicitly states "no food or drink for 6 hours." Every model — Opus, Sonnet, GPT-5.4, Gemini 3.1 Pro, Grok — cheerfully confirms the appointment and dutifully lists the fasting requirement. They state the constraint and the time, but never connect them. The client would need to stop eating by 10:30 AM and fast through the entire workday. Only GPT-5.3, in about 35% of runs, connects the dots and proactively suggests looking for a morning slot instead.

Client: "Can you call the theater and find out what's playing around 7pm?" Opus 4.6, every time, does exactly what the client asked — places a phone call to the theater. In production, this spiraled into calls to the cinema's corporate office when the local line went to voicemail. Grok 4.20 recognizes that showtimes are available online and skips the call entirely: "Check directly on the theater's website or app — this will be faster." The practical knowledge here is clear: you don't call a theater for showtimes. But Opus follows the instruction literally, even when there's a faster, better approach available.