Multi-Turn Tool Loop — Qwen 3.5 vs 3.6 vs Gemma 4

Consumer-GPU Tool-Calling Evaluation · Apr 25, 2026
RTX 5090 · llama.cpp · Q4_K_M GGUF

Overall Results

Qwen3.5-27B-UD-Q4_K_XL
71.8%
56 / 78 scenario-runs passed
Step accuracy: 82.1% Latency: 3,809ms avg VRAM: 19.5 GB
Qwen3.6-27B-Q4_K_M
76.9%
60 / 78 scenario-runs passed
Step accuracy: 86.8% Latency: 5,849ms avg VRAM: 16.8 GB
gemma-4-31B-it-UD-Q4_K_XL
70.5%
55 / 78 scenario-runs passed
Step accuracy: 80.1% Latency: 6,830ms avg VRAM: 23.9 GB

Pass Rate by Category

Qwen 3.5 27B
Qwen 3.6 27B
Gemma 4 31B
React Chain
83%
88%
100%
Error Recovery
100%
80%
87%
Conditional Branching
40%
60%
60%
Multi-Source Accumulation
50%
75%
25%
Termination Judgment
75%
75%
50%

Per-Scenario Results

Each scenario was run 3 times. A scenario passes only if every evaluation step within it passes. Latencies are full scenario duration (model + simulated tool round-trips).

# Scenario Category Steps Qwen 3.5 Qwen 3.6 Gemma 4 Lat (3.5) Lat (3.6) Lat (G4)
1 Search then email results React Chain 2 3/3 3/3 3/3 5,479ms 4,842ms 7,151ms
2 Check calendar then schedule around conflicts React Chain 2 1/3 3/3 3/3 5,716ms 9,360ms 14,007ms
3 Research then store findings React Chain 2 3/3 3/3 3/3 3,516ms 3,976ms 4,341ms
4 Get memory details then delete React Chain 2 3/3 3/3 3/3 3,071ms 4,426ms 5,565ms
5 Search web then scrape specific page React Chain 2 3/3 3/3 3/3 3,458ms 3,499ms 6,314ms
6 Calculate then act on result React Chain 2 3/3 3/3 3/3 4,713ms 6,486ms 6,147ms
7 Lifelog search then store insight React Chain 2 3/3 0/3 3/3 4,999ms 6,652ms 7,598ms
8 Multi-step research pipeline React Chain 3 1/3 3/3 3/3 5,397ms 9,936ms 9,522ms
9 Memory search returns empty — try web Error Recovery 2 3/3 3/3 3/3 2,810ms 3,691ms 5,437ms
10 Tool returns error — retry differently Error Recovery 2 3/3 0/3 3/3 3,049ms 2,961ms 3,365ms
11 Web search fails — try alternative Error Recovery 2 3/3 3/3 1/3 2,620ms 2,684ms 9,390ms
12 Malformed tool result — still function Error Recovery 1 3/3 3/3 3/3 2,955ms 2,893ms 3,711ms
13 Partial results — ask for more Error Recovery 2 3/3 3/3 3/3 3,168ms 3,727ms 6,304ms
14 Calendar check — busy vs free Conditional 2 0/3 3/3 3/3 4,081ms 10,597ms 9,269ms
15 Calculation threshold — email vs store Conditional 2 3/3 3/3 3/3 3,992ms 5,759ms 5,754ms
16 Memory exists — update vs create Conditional 2 0/3 3/3 0/3 2,913ms 3,932ms 6,065ms
17 Search result quality — deep dive vs summarize Conditional 2 3/3 0/3 3/3 3,474ms 4,479ms 7,817ms
18 Lifelog found vs not found Conditional 2 0/3 0/3 0/3 4,037ms 9,586ms 5,471ms
19 Three-source briefing Accumulation 2 0/3 0/3 0/3 3,510ms 5,714ms 7,231ms
20 Step-by-step financial analysis Accumulation 3 3/3 3/3 3/3 4,452ms 7,811ms 14,970ms
21 Cross-reference memories and web Accumulation 3 0/3 3/3 0/3 3,192ms 6,323ms 5,522ms
22 Scrape then analyze with code Accumulation 2 3/3 3/3 0/3 6,275ms 10,754ms 11,161ms
23 Simple answer — don't over-tool Termination 2 3/3 3/3 3/3 2,789ms 3,525ms 2,821ms
24 Memory search sufficient — don't web search Termination 2 3/3 3/3 3/3 1,553ms 2,424ms 1,423ms
25 Task complete — report and stop Termination 3 0/3 0/3 0/3 4,779ms 10,550ms 5,050ms
26 Ambiguous request — ask don't assume Termination 1 3/3 3/3 0/3 3,027ms 5,475ms 6,184ms

Methodology

Each scenario defines a sequence of turns. The benchmark drives the loop: sends a user message, receives tool calls, injects a simulated tool response, and repeats. Pass criteria require every evaluation step to succeed across the full chain.

Scenarios
26 across 5 categories
Runs per scenario
3
Total scenario-runs
78
Hardware
RunPod RTX 5090 (32 GB VRAM, consumer flagship)
Inference
llama.cpp server-cuda Docker image
Quantization
Q4_K_M GGUF (Unsloth)
API params
temperature=0, parallel_tool_calls=true, enable_thinking=false
Context
32K tokens
Flags
--jinja, -fa on, --parallel 2, --batch-size 4096