Each scenario was run 3 times. A scenario passes only if every evaluation step within it passes. Latencies are full scenario duration (model + simulated tool round-trips).
| # | Scenario | Category | Steps | Qwen 3.5 | Qwen 3.6 | Gemma 4 | Lat (3.5) | Lat (3.6) | Lat (G4) |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Search then email results | React Chain | 2 | 3/3 | 3/3 | 3/3 | 5,479ms | 4,842ms | 7,151ms |
| 2 | Check calendar then schedule around conflicts | React Chain | 2 | 1/3 | 3/3 | 3/3 | 5,716ms | 9,360ms | 14,007ms |
| 3 | Research then store findings | React Chain | 2 | 3/3 | 3/3 | 3/3 | 3,516ms | 3,976ms | 4,341ms |
| 4 | Get memory details then delete | React Chain | 2 | 3/3 | 3/3 | 3/3 | 3,071ms | 4,426ms | 5,565ms |
| 5 | Search web then scrape specific page | React Chain | 2 | 3/3 | 3/3 | 3/3 | 3,458ms | 3,499ms | 6,314ms |
| 6 | Calculate then act on result | React Chain | 2 | 3/3 | 3/3 | 3/3 | 4,713ms | 6,486ms | 6,147ms |
| 7 | Lifelog search then store insight | React Chain | 2 | 3/3 | 0/3 | 3/3 | 4,999ms | 6,652ms | 7,598ms |
| 8 | Multi-step research pipeline | React Chain | 3 | 1/3 | 3/3 | 3/3 | 5,397ms | 9,936ms | 9,522ms |
| 9 | Memory search returns empty — try web | Error Recovery | 2 | 3/3 | 3/3 | 3/3 | 2,810ms | 3,691ms | 5,437ms |
| 10 | Tool returns error — retry differently | Error Recovery | 2 | 3/3 | 0/3 | 3/3 | 3,049ms | 2,961ms | 3,365ms |
| 11 | Web search fails — try alternative | Error Recovery | 2 | 3/3 | 3/3 | 1/3 | 2,620ms | 2,684ms | 9,390ms |
| 12 | Malformed tool result — still function | Error Recovery | 1 | 3/3 | 3/3 | 3/3 | 2,955ms | 2,893ms | 3,711ms |
| 13 | Partial results — ask for more | Error Recovery | 2 | 3/3 | 3/3 | 3/3 | 3,168ms | 3,727ms | 6,304ms |
| 14 | Calendar check — busy vs free | Conditional | 2 | 0/3 | 3/3 | 3/3 | 4,081ms | 10,597ms | 9,269ms |
| 15 | Calculation threshold — email vs store | Conditional | 2 | 3/3 | 3/3 | 3/3 | 3,992ms | 5,759ms | 5,754ms |
| 16 | Memory exists — update vs create | Conditional | 2 | 0/3 | 3/3 | 0/3 | 2,913ms | 3,932ms | 6,065ms |
| 17 | Search result quality — deep dive vs summarize | Conditional | 2 | 3/3 | 0/3 | 3/3 | 3,474ms | 4,479ms | 7,817ms |
| 18 | Lifelog found vs not found | Conditional | 2 | 0/3 | 0/3 | 0/3 | 4,037ms | 9,586ms | 5,471ms |
| 19 | Three-source briefing | Accumulation | 2 | 0/3 | 0/3 | 0/3 | 3,510ms | 5,714ms | 7,231ms |
| 20 | Step-by-step financial analysis | Accumulation | 3 | 3/3 | 3/3 | 3/3 | 4,452ms | 7,811ms | 14,970ms |
| 21 | Cross-reference memories and web | Accumulation | 3 | 0/3 | 3/3 | 0/3 | 3,192ms | 6,323ms | 5,522ms |
| 22 | Scrape then analyze with code | Accumulation | 2 | 3/3 | 3/3 | 0/3 | 6,275ms | 10,754ms | 11,161ms |
| 23 | Simple answer — don't over-tool | Termination | 2 | 3/3 | 3/3 | 3/3 | 2,789ms | 3,525ms | 2,821ms |
| 24 | Memory search sufficient — don't web search | Termination | 2 | 3/3 | 3/3 | 3/3 | 1,553ms | 2,424ms | 1,423ms |
| 25 | Task complete — report and stop | Termination | 3 | 0/3 | 0/3 | 0/3 | 4,779ms | 10,550ms | 5,050ms |
| 26 | Ambiguous request — ask don't assume | Termination | 1 | 3/3 | 3/3 | 0/3 | 3,027ms | 5,475ms | 6,184ms |
Each scenario defines a sequence of turns. The benchmark drives the loop: sends a user message, receives tool calls, injects a simulated tool response, and repeats. Pass criteria require every evaluation step to succeed across the full chain.
server-cuda Docker image