Multi-Turn Tool Loop — Qwen 3.5 vs 3.6 vs Gemma 4

Overall Results

Qwen3.5-27B-UD-Q4_K_XL

71.8%

56 / 78 scenario-runs passed

Step accuracy: 82.1% Latency: 3,809ms avg VRAM: 19.5 GB

Qwen3.6-27B-Q4_K_M

76.9%

60 / 78 scenario-runs passed

Step accuracy: 86.8% Latency: 5,849ms avg VRAM: 16.8 GB

gemma-4-31B-it-UD-Q4_K_XL

70.5%

55 / 78 scenario-runs passed

Step accuracy: 80.1% Latency: 6,830ms avg VRAM: 23.9 GB

Pass Rate by Category

Qwen 3.5 27B

Qwen 3.6 27B

Gemma 4 31B

React Chain

83%

88%

100%

Error Recovery

100%

80%

87%

Conditional Branching

40%

60%

Multi-Source Accumulation

50%

75%

25%

Termination Judgment

75%

50%

Per-Scenario Results

Each scenario was run 3 times. A scenario passes only if every evaluation step within it passes. Latencies are full scenario duration (model + simulated tool round-trips).

#	Scenario	Category	Steps	Qwen 3.5	Qwen 3.6	Gemma 4	Lat (3.5)	Lat (3.6)	Lat (G4)
1	Search then email results	React Chain	2	3/3	3/3	3/3	5,479ms	4,842ms	7,151ms
2	Check calendar then schedule around conflicts	React Chain	2	1/3	3/3	3/3	5,716ms	9,360ms	14,007ms
3	Research then store findings	React Chain	2	3/3	3/3	3/3	3,516ms	3,976ms	4,341ms
4	Get memory details then delete	React Chain	2	3/3	3/3	3/3	3,071ms	4,426ms	5,565ms
5	Search web then scrape specific page	React Chain	2	3/3	3/3	3/3	3,458ms	3,499ms	6,314ms
6	Calculate then act on result	React Chain	2	3/3	3/3	3/3	4,713ms	6,486ms	6,147ms
7	Lifelog search then store insight	React Chain	2	3/3	0/3	3/3	4,999ms	6,652ms	7,598ms
8	Multi-step research pipeline	React Chain	3	1/3	3/3	3/3	5,397ms	9,936ms	9,522ms
9	Memory search returns empty — try web	Error Recovery	2	3/3	3/3	3/3	2,810ms	3,691ms	5,437ms
10	Tool returns error — retry differently	Error Recovery	2	3/3	0/3	3/3	3,049ms	2,961ms	3,365ms
11	Web search fails — try alternative	Error Recovery	2	3/3	3/3	1/3	2,620ms	2,684ms	9,390ms
12	Malformed tool result — still function	Error Recovery	1	3/3	3/3	3/3	2,955ms	2,893ms	3,711ms
13	Partial results — ask for more	Error Recovery	2	3/3	3/3	3/3	3,168ms	3,727ms	6,304ms
14	Calendar check — busy vs free	Conditional	2	0/3	3/3	3/3	4,081ms	10,597ms	9,269ms
15	Calculation threshold — email vs store	Conditional	2	3/3	3/3	3/3	3,992ms	5,759ms	5,754ms
16	Memory exists — update vs create	Conditional	2	0/3	3/3	0/3	2,913ms	3,932ms	6,065ms
17	Search result quality — deep dive vs summarize	Conditional	2	3/3	0/3	3/3	3,474ms	4,479ms	7,817ms
18	Lifelog found vs not found	Conditional	2	0/3	0/3	0/3	4,037ms	9,586ms	5,471ms
19	Three-source briefing	Accumulation	2	0/3	0/3	0/3	3,510ms	5,714ms	7,231ms
20	Step-by-step financial analysis	Accumulation	3	3/3	3/3	3/3	4,452ms	7,811ms	14,970ms
21	Cross-reference memories and web	Accumulation	3	0/3	3/3	0/3	3,192ms	6,323ms	5,522ms
22	Scrape then analyze with code	Accumulation	2	3/3	3/3	0/3	6,275ms	10,754ms	11,161ms
23	Simple answer — don't over-tool	Termination	2	3/3	3/3	3/3	2,789ms	3,525ms	2,821ms
24	Memory search sufficient — don't web search	Termination	2	3/3	3/3	3/3	1,553ms	2,424ms	1,423ms
25	Task complete — report and stop	Termination	3	0/3	0/3	0/3	4,779ms	10,550ms	5,050ms
26	Ambiguous request — ask don't assume	Termination	1	3/3	3/3	0/3	3,027ms	5,475ms	6,184ms

Methodology

Each scenario defines a sequence of turns. The benchmark drives the loop: sends a user message, receives tool calls, injects a simulated tool response, and repeats. Pass criteria require every evaluation step to succeed across the full chain.

Scenarios: 26 across 5 categories
Runs per scenario: 3
Total scenario-runs: 78
Hardware: RunPod RTX 5090 (32 GB VRAM, consumer flagship)
Inference: llama.cpp server-cuda Docker image
Quantization: Q4_K_M GGUF (Unsloth)
API params: temperature=0, parallel_tool_calls=true, enable_thinking=false
Context: 32K tokens
Flags: --jinja, -fa on, --parallel 2, --batch-size 4096