Single-Turn Battery — Qwen 3.5 vs 3.6

Tool Adherence (42 tests · 11 tools · 5 runs · 210 calls)

Qwen 3.5 27B (baseline prompt)

95.2%

200 / 210

Single 100%Disambig 100%Complex 67%

Qwen 3.6 27B (baseline prompt)

92.9%

195 / 210

Single 100% Disambig 88% Complex 83%

Pass Rate by Category — baseline prompt

Qwen 3.5

Qwen 3.6

Single Tool

100%

Disambiguation

100%

88%

Parallel 2-5

100%

93%

Sequential Chain

100%

Complex Real-World

67%

83%

Proactive Intelligence (35 tests · 11 tools · 3 runs · 105 calls)

Qwen 3.5 27B (optimized prompt)

74.3%

78 / 105

Proactive 53%Restraint 100%Judgment 75%

Qwen 3.6 27B (optimized prompt)

88.6%

93 / 105

Proactive 87% Restraint 100% Judgment 75%

Pass Rate by Category — optimized prompt

Qwen 3.5

Qwen 3.6

Proactive

53%

87%

Restraint

100%

Judgment

75%

Per-Test Results — Tool Adherence (Qwen 3.6 27B)

Each test was run 5 times under both a baseline prompt and an optimized prompt. A test passes only if the model emits the correct tool call(s) with valid arguments. Showing Qwen 3.6 only — Qwen 3.5 results in mem_08402779.

#	Test	Category	Baseline	Optimized
1	Simple memory search	single	5/5	5/5
2	Store with bucket	single	5/5	5/5
3	Calendar (ambiguous date)	single	5/5	5/5
4	Email (all fields)	single	5/5	4/5
5	Delete by ID	single	5/5	5/5
6	Create one-time task	single	5/5	5/5
7	Create recurring task	single	5/5	5/5
8	Perplexity research	single	5/5	5/5
9	Firecrawl scrape URL	single	5/5	5/5
10	Send SMS	single	5/5	5/5
11	Limitless daily summary	single	5/5	5/5
12	Sandbox run Python	single	5/5	5/5
13	Disambig: SMS not email	disambiguation	5/5	5/5
14	Disambig: email not SMS	disambiguation	0/5	0/5
15	Disambig: firecrawl not search	disambiguation	5/5	5/5
16	Disambig: perplexity not web_search	disambiguation	5/5	5/5
17	Disambig: limitless not memory	disambiguation	5/5	5/5
18	Disambig: memory not limitless	disambiguation	5/5	5/5
19	Disambig: schedule not task	disambiguation	5/5	5/5
20	Disambig: task not schedule	disambiguation	5/5	5/5
21	Two memory searches	parallel	5/5	5/5
22	Store + email	parallel	5/5	5/5
23	Web search + store	parallel	5/5	5/5
24	Two emails	parallel	5/5	1/5
25	Search + calc + store	parallel	5/5	5/5
26	Store + schedule + email	parallel	5/5	5/5
27	2 searches + calendar	parallel	5/5	5/5
28	4-tool workflow	parallel	0/5	4/5
29	4 independent stores	parallel	5/5	5/5
30	5-way mixed	parallel	5/5	5/5
31	SMS + email (two channels)	parallel	5/5	5/5
32	Perplexity + memory store	parallel	5/5	5/5
33	Limitless + memory search	parallel	5/5	5/5
34	Schedule + create_task	parallel	5/5	5/5
35	Cancel + reschedule	sequential	5/5	5/5
36	Get + delete	sequential	5/5	5/5
37	EOD: 2 stores + email	complex	5/5	5/5
38	3 stores + email	complex	5/5	0/5
39	Morning briefing	complex	5/5	5/5
40	Research + store + task	complex	5/5	5/5
41	Lifelog + memory + email	complex	0/5	0/5
42	SMS + email + schedule	complex	5/5	5/5

Per-Test Results — Proactive Intelligence (Qwen 3.6 27B)

Three categories: proactive (model SHOULD call a tool from conversational cues), restraint (model should NOT call any tool), judgment (ambiguous; both options can be correct).

#	Test	Category	Baseline	Optimized
1	Store: key decision shared	proactive	0/3	3/3
2	Store: strategic insight	proactive	0/3	3/3
3	Store: multiple important facts	proactive	3/3	0/3
4	Search: user trying to remember	proactive	3/3	3/3
5	Search: prep implies need	proactive	3/3	3/3
6	SMS: user running late	proactive	3/3	3/3
7	SMS: something down urgently	proactive	3/3	3/3
8	Lifelog: references a call	proactive	3/3	3/3
9	Lifelog: what happened yesterday	proactive	3/3	3/3
10	Schedule: casual mention of event	proactive	3/3	3/3
11	Scrape: references specific URL	proactive	3/3	3/3
12	Research: genuine curiosity	proactive	3/3	3/3
13	Task: deferred follow-up	proactive	3/3	3/3
14	Store: contact info dropped	proactive	3/3	3/3
15	Search + store: correction	proactive	0/3	0/3
16	Restraint: thinking out loud	restraint	3/3	3/3
17	Restraint: venting frustration	restraint	3/3	3/3
18	Restraint: rhetorical question	restraint	3/3	3/3
19	Restraint: telling a story	restraint	3/3	3/3
20	Restraint: humor	restraint	3/3	3/3
21	Restraint: positive emotion	restraint	3/3	3/3
22	Restraint: known general knowledge	restraint	3/3	3/3
23	Restraint: asking AI's opinion	restraint	3/3	3/3
24	Restraint: wrapping up	restraint	3/3	3/3
25	Restraint: compliment	restraint	3/3	3/3
26	Restraint: hypothetical question	restraint	3/3	3/3
27	Restraint: casual agreement	restraint	3/3	3/3
28	Judgment: allusion to forgotten detail	judgment	3/3	3/3
29	Judgment: deadline mentioned in passing	judgment	0/3	0/3
30	Judgment: wants to verify a number	judgment	3/3	3/3
31	Judgment: wants to notify later	judgment	0/3	0/3
32	Judgment: competitive intel opportunity	judgment	3/3	3/3
33	Judgment: implicit calendar check	judgment	3/3	3/3
34	Judgment: pattern worth preserving	judgment	0/3	3/3
35	Judgment: morning context loading	judgment	3/3	3/3

Methodology

Tool tests: 42 tests × 5 runs × 2 prompts = 420 calls
Proactive tests: 35 tests × 3 runs × 2 prompts = 210 calls
Tool library: memory_box, send_email, send_sms, schedule, create_task, web_search, perplexity, firecrawl, calculate, sandbox_run, limitless
Hardware: RunPod RTX 5090
Inference: llama.cpp server-cuda
Quantization: Q4_K_M GGUF (Unsloth)
API params: temperature=0, parallel_tool_calls=true, enable_thinking=false