Three open-source AI models. Same hardware (a single high-end consumer graphics card you could put in a desktop PC). Same job: act as an agent — a model that doesn't just chat, but uses tools (search the web, send email, read your calendar, save things to memory) to actually do work for you.
We ran the same battery of tests on all three. Below is the verdict for someone who wants to use one of these, not benchmark them.
Qwen 3.6 27B is the best all-rounder — highest overall score, and the strongest at the hardest reasoning tasks (combining info from multiple sources, knowing when to act on what it remembers). But each model has a real strength worth knowing about.
Pick this if: you want one model to handle a wide range of work and you're not sure exactly what you'll throw at it.
Best at: combining information from multiple sources, remembering things you tell it without being asked.
Watch out: slightly worse than 3.5 at simple "send this email" requests. Slowest of the Qwens.
Pick this if: speed matters and you're using it for known, well-defined tasks ("send this email," "look this up," "add this to my calendar").
Best at: recovering when something goes wrong — 100% on error-recovery tests, the only model to nail this category. Fastest of the three (3.8s avg vs 5.8s for 3.6 and 6.8s for Gemma).
Watch out: weaker at multi-step reasoning. Won't save things you mention unless you tell it to.
Pick this if: you do a lot of two-step workflows — search then email, calculate then act, look something up then store the answer.
Best at: chained tool calls (100% on react-chain scenarios) — it reliably reasons about a tool's output before deciding what to do next.
Watch out: biggest model (24 GB graphics memory). Slowest. Struggles to keep going when its first attempt at a task hits a snag.
| If you want to… | Pick |
|---|---|
| Build a general-purpose assistant that handles whatever | Qwen 3.6 27B |
| Have it remember things from conversations automatically | Qwen 3.6 27B |
| Run agentic workflows fast (latency-sensitive) | Qwen 3.5 27B |
| Hand off tasks where things often go wrong (flaky APIs, errors) | Qwen 3.5 27B |
| Fit on a smaller graphics card (16–20 GB) | Qwen 3.6 27B |
| Heavy multi-step research workflows (search→summarize→store→email) | Gemma 4 31B |
| Synthesize across 3+ data sources | Qwen 3.6 27B |
Everything below this point is the raw benchmark data. Skip it unless you want category-level pass rates, per-scenario tables, or the methodology used to compute the numbers above.
26 scenarios across 5 categories: react chains, error recovery, conditional branching, multi-source accumulation, termination judgment. 3 runs, 78 scenario-runs total per model.
42 tool-adherence tests (5 runs = 210 calls) + 35 proactive intelligence tests (3 runs = 105 calls). Disambiguation, parallelism, restraint, storage asymmetry, judgment.
Can a model run agent-grade tool calling on hardware a normal person can buy?
All benchmarks use the same Q4_K_M GGUF quant from Unsloth, the same
llama.cpp inference stack with --jinja and
parallel_tool_calls=true, and the same RTX 5090 (32 GB,
consumer flagship). Only the model changes between runs.
3.5's notable weakness — 53% on proactive storage — is now 80–87%. The model proactively saves decisions, strategic insights, and contact info that previously required explicit instruction.
Tool adherence baseline dropped from 95.2% → 92.9%, optimized from
97.6% → 90.0%. New failure cluster: the model skips send_email
in disambiguation cases where it should fire (0/5 on "email not SMS").
Token-generation throughput went from 60 → 72 tok/s on the same hardware (a 20% gain). Note: end-to-end multi-turn scenario latency is actually longer for 3.6 (5.8s vs 3.8s for 3.5) — the new hybrid architecture generates faster per-token but takes more reasoning steps per task. Restraint stayed at 12/12 — no over-acting.