Single-Turn Battery — Qwen 3.5 vs 3.6

Consumer-GPU Tool-Calling Evaluation · Apr 25, 2026
RTX 5090 · llama.cpp · Q4_K_M GGUF

Tool Adherence (42 tests · 11 tools · 5 runs · 210 calls)

Qwen 3.5 27B (baseline prompt)
95.2%
200 / 210
Single 100%Disambig 100%Complex 67%
Qwen 3.6 27B (baseline prompt)
92.9%
195 / 210
Single 100% Disambig 88% Complex 83%

Pass Rate by Category — baseline prompt

Qwen 3.5
Qwen 3.6
Single Tool
100%
100%
Disambiguation
100%
88%
Parallel 2-5
100%
93%
Sequential Chain
100%
100%
Complex Real-World
67%
83%

Proactive Intelligence (35 tests · 11 tools · 3 runs · 105 calls)

Qwen 3.5 27B (optimized prompt)
74.3%
78 / 105
Proactive 53%Restraint 100%Judgment 75%
Qwen 3.6 27B (optimized prompt)
88.6%
93 / 105
Proactive 87% Restraint 100% Judgment 75%

Pass Rate by Category — optimized prompt

Qwen 3.5
Qwen 3.6
Proactive
53%
87%
Restraint
100%
100%
Judgment
75%
75%

Per-Test Results — Tool Adherence (Qwen 3.6 27B)

Each test was run 5 times under both a baseline prompt and an optimized prompt. A test passes only if the model emits the correct tool call(s) with valid arguments. Showing Qwen 3.6 only — Qwen 3.5 results in mem_08402779.

#TestCategory Baseline Optimized
1 Simple memory search single 5/5 5/5
2 Store with bucket single 5/5 5/5
3 Calendar (ambiguous date) single 5/5 5/5
4 Email (all fields) single 5/5 4/5
5 Delete by ID single 5/5 5/5
6 Create one-time task single 5/5 5/5
7 Create recurring task single 5/5 5/5
8 Perplexity research single 5/5 5/5
9 Firecrawl scrape URL single 5/5 5/5
10 Send SMS single 5/5 5/5
11 Limitless daily summary single 5/5 5/5
12 Sandbox run Python single 5/5 5/5
13 Disambig: SMS not email disambiguation 5/5 5/5
14 Disambig: email not SMS disambiguation 0/5 0/5
15 Disambig: firecrawl not search disambiguation 5/5 5/5
16 Disambig: perplexity not web_search disambiguation 5/5 5/5
17 Disambig: limitless not memory disambiguation 5/5 5/5
18 Disambig: memory not limitless disambiguation 5/5 5/5
19 Disambig: schedule not task disambiguation 5/5 5/5
20 Disambig: task not schedule disambiguation 5/5 5/5
21 Two memory searches parallel 5/5 5/5
22 Store + email parallel 5/5 5/5
23 Web search + store parallel 5/5 5/5
24 Two emails parallel 5/5 1/5
25 Search + calc + store parallel 5/5 5/5
26 Store + schedule + email parallel 5/5 5/5
27 2 searches + calendar parallel 5/5 5/5
28 4-tool workflow parallel 0/5 4/5
29 4 independent stores parallel 5/5 5/5
30 5-way mixed parallel 5/5 5/5
31 SMS + email (two channels) parallel 5/5 5/5
32 Perplexity + memory store parallel 5/5 5/5
33 Limitless + memory search parallel 5/5 5/5
34 Schedule + create_task parallel 5/5 5/5
35 Cancel + reschedule sequential 5/5 5/5
36 Get + delete sequential 5/5 5/5
37 EOD: 2 stores + email complex 5/5 5/5
38 3 stores + email complex 5/5 0/5
39 Morning briefing complex 5/5 5/5
40 Research + store + task complex 5/5 5/5
41 Lifelog + memory + email complex 0/5 0/5
42 SMS + email + schedule complex 5/5 5/5

Per-Test Results — Proactive Intelligence (Qwen 3.6 27B)

Three categories: proactive (model SHOULD call a tool from conversational cues), restraint (model should NOT call any tool), judgment (ambiguous; both options can be correct).

#TestCategory Baseline Optimized
1 Store: key decision shared proactive 0/3 3/3
2 Store: strategic insight proactive 0/3 3/3
3 Store: multiple important facts proactive 3/3 0/3
4 Search: user trying to remember proactive 3/3 3/3
5 Search: prep implies need proactive 3/3 3/3
6 SMS: user running late proactive 3/3 3/3
7 SMS: something down urgently proactive 3/3 3/3
8 Lifelog: references a call proactive 3/3 3/3
9 Lifelog: what happened yesterday proactive 3/3 3/3
10 Schedule: casual mention of event proactive 3/3 3/3
11 Scrape: references specific URL proactive 3/3 3/3
12 Research: genuine curiosity proactive 3/3 3/3
13 Task: deferred follow-up proactive 3/3 3/3
14 Store: contact info dropped proactive 3/3 3/3
15 Search + store: correction proactive 0/3 0/3
16 Restraint: thinking out loud restraint 3/3 3/3
17 Restraint: venting frustration restraint 3/3 3/3
18 Restraint: rhetorical question restraint 3/3 3/3
19 Restraint: telling a story restraint 3/3 3/3
20 Restraint: humor restraint 3/3 3/3
21 Restraint: positive emotion restraint 3/3 3/3
22 Restraint: known general knowledge restraint 3/3 3/3
23 Restraint: asking AI's opinion restraint 3/3 3/3
24 Restraint: wrapping up restraint 3/3 3/3
25 Restraint: compliment restraint 3/3 3/3
26 Restraint: hypothetical question restraint 3/3 3/3
27 Restraint: casual agreement restraint 3/3 3/3
28 Judgment: allusion to forgotten detail judgment 3/3 3/3
29 Judgment: deadline mentioned in passing judgment 0/3 0/3
30 Judgment: wants to verify a number judgment 3/3 3/3
31 Judgment: wants to notify later judgment 0/3 0/3
32 Judgment: competitive intel opportunity judgment 3/3 3/3
33 Judgment: implicit calendar check judgment 3/3 3/3
34 Judgment: pattern worth preserving judgment 0/3 3/3
35 Judgment: morning context loading judgment 3/3 3/3

Methodology

Tool tests
42 tests × 5 runs × 2 prompts = 420 calls
Proactive tests
35 tests × 3 runs × 2 prompts = 210 calls
Tool library
memory_box, send_email, send_sms, schedule, create_task, web_search, perplexity, firecrawl, calculate, sandbox_run, limitless
Hardware
RunPod RTX 5090
Inference
llama.cpp server-cuda
Quantization
Q4_K_M GGUF (Unsloth)
API params
temperature=0, parallel_tool_calls=true, enable_thinking=false