Qwen 3.5 vs 3.6 — Consumer GPU Benchmark

Consumer-GPU Tool-Calling Evaluation · Apr 25, 2026
RTX 5090 · llama.cpp · Q4_K_M GGUF

What this is, in plain English

Three open-source AI models. Same hardware (a single high-end consumer graphics card you could put in a desktop PC). Same job: act as an agent — a model that doesn't just chat, but uses tools (search the web, send email, read your calendar, save things to memory) to actually do work for you.

We ran the same battery of tests on all three. Below is the verdict for someone who wants to use one of these, not benchmark them.

The headline

Qwen 3.6 27B is the best all-rounder — highest overall score, and the strongest at the hardest reasoning tasks (combining info from multiple sources, knowing when to act on what it remembers). But each model has a real strength worth knowing about.

Which one should you use?

Qwen 3.6 27B
All-rounder
76.9% overall · wins 1 category outright, ties 2 more

Pick this if: you want one model to handle a wide range of work and you're not sure exactly what you'll throw at it.

Best at: combining information from multiple sources, remembering things you tell it without being asked.

Watch out: slightly worse than 3.5 at simple "send this email" requests. Slowest of the Qwens.

Qwen 3.5 27B
The Workhorse
71.8% overall · fastest, perfect on errors

Pick this if: speed matters and you're using it for known, well-defined tasks ("send this email," "look this up," "add this to my calendar").

Best at: recovering when something goes wrong — 100% on error-recovery tests, the only model to nail this category. Fastest of the three (3.8s avg vs 5.8s for 3.6 and 6.8s for Gemma).

Watch out: weaker at multi-step reasoning. Won't save things you mention unless you tell it to.

Gemma 4 31B
The Specialist
70.5% overall · perfect react chains

Pick this if: you do a lot of two-step workflows — search then email, calculate then act, look something up then store the answer.

Best at: chained tool calls (100% on react-chain scenarios) — it reliably reasons about a tool's output before deciding what to do next.

Watch out: biggest model (24 GB graphics memory). Slowest. Struggles to keep going when its first attempt at a task hits a snag.

Quick picker by use case

If you want to… Pick
Build a general-purpose assistant that handles whateverQwen 3.6 27B
Have it remember things from conversations automaticallyQwen 3.6 27B
Run agentic workflows fast (latency-sensitive)Qwen 3.5 27B
Hand off tasks where things often go wrong (flaky APIs, errors)Qwen 3.5 27B
Fit on a smaller graphics card (16–20 GB)Qwen 3.6 27B
Heavy multi-step research workflows (search→summarize→store→email)Gemma 4 31B
Synthesize across 3+ data sourcesQwen 3.6 27B

The full reports

Everything below this point is the raw benchmark data. Skip it unless you want category-level pass rates, per-scenario tables, or the methodology used to compute the numbers above.

Multi-Turn Tool Loop

Qwen 3.5 vs Qwen 3.6 vs Gemma 4 31B

26 scenarios across 5 categories: react chains, error recovery, conditional branching, multi-source accumulation, termination judgment. 3 runs, 78 scenario-runs total per model.

Single-Turn Tool Adherence + Proactive

Qwen 3.5 vs Qwen 3.6

42 tool-adherence tests (5 runs = 210 calls) + 35 proactive intelligence tests (3 runs = 105 calls). Disambiguation, parallelism, restraint, storage asymmetry, judgment.

The Thesis

Can a model run agent-grade tool calling on hardware a normal person can buy? All benchmarks use the same Q4_K_M GGUF quant from Unsloth, the same llama.cpp inference stack with --jinja and parallel_tool_calls=true, and the same RTX 5090 (32 GB, consumer flagship). Only the model changes between runs.

What changed (3.5 → 3.6)

Storage asymmetry largely fixed in Qwen 3.6

3.5's notable weakness — 53% on proactive storage — is now 80–87%. The model proactively saves decisions, strategic insights, and contact info that previously required explicit instruction.

Tool dispatch regressed slightly

Tool adherence baseline dropped from 95.2% → 92.9%, optimized from 97.6% → 90.0%. New failure cluster: the model skips send_email in disambiguation cases where it should fire (0/5 on "email not SMS").

20% faster per-token, restraint perfect

Token-generation throughput went from 60 → 72 tok/s on the same hardware (a 20% gain). Note: end-to-end multi-turn scenario latency is actually longer for 3.6 (5.8s vs 3.8s for 3.5) — the new hybrid architecture generates faster per-token but takes more reasoning steps per task. Restraint stayed at 12/12 — no over-acting.