Each test was run 5 times under both a baseline prompt and an optimized prompt. A test passes only if the model emits the correct tool call(s) with valid arguments. Showing Qwen 3.6 only — Qwen 3.5 results in mem_08402779.
| # | Test | Category | Baseline | Optimized |
|---|---|---|---|---|
| 1 | Simple memory search | single | 5/5 | 5/5 |
| 2 | Store with bucket | single | 5/5 | 5/5 |
| 3 | Calendar (ambiguous date) | single | 5/5 | 5/5 |
| 4 | Email (all fields) | single | 5/5 | 4/5 |
| 5 | Delete by ID | single | 5/5 | 5/5 |
| 6 | Create one-time task | single | 5/5 | 5/5 |
| 7 | Create recurring task | single | 5/5 | 5/5 |
| 8 | Perplexity research | single | 5/5 | 5/5 |
| 9 | Firecrawl scrape URL | single | 5/5 | 5/5 |
| 10 | Send SMS | single | 5/5 | 5/5 |
| 11 | Limitless daily summary | single | 5/5 | 5/5 |
| 12 | Sandbox run Python | single | 5/5 | 5/5 |
| 13 | Disambig: SMS not email | disambiguation | 5/5 | 5/5 |
| 14 | Disambig: email not SMS | disambiguation | 0/5 | 0/5 |
| 15 | Disambig: firecrawl not search | disambiguation | 5/5 | 5/5 |
| 16 | Disambig: perplexity not web_search | disambiguation | 5/5 | 5/5 |
| 17 | Disambig: limitless not memory | disambiguation | 5/5 | 5/5 |
| 18 | Disambig: memory not limitless | disambiguation | 5/5 | 5/5 |
| 19 | Disambig: schedule not task | disambiguation | 5/5 | 5/5 |
| 20 | Disambig: task not schedule | disambiguation | 5/5 | 5/5 |
| 21 | Two memory searches | parallel | 5/5 | 5/5 |
| 22 | Store + email | parallel | 5/5 | 5/5 |
| 23 | Web search + store | parallel | 5/5 | 5/5 |
| 24 | Two emails | parallel | 5/5 | 1/5 |
| 25 | Search + calc + store | parallel | 5/5 | 5/5 |
| 26 | Store + schedule + email | parallel | 5/5 | 5/5 |
| 27 | 2 searches + calendar | parallel | 5/5 | 5/5 |
| 28 | 4-tool workflow | parallel | 0/5 | 4/5 |
| 29 | 4 independent stores | parallel | 5/5 | 5/5 |
| 30 | 5-way mixed | parallel | 5/5 | 5/5 |
| 31 | SMS + email (two channels) | parallel | 5/5 | 5/5 |
| 32 | Perplexity + memory store | parallel | 5/5 | 5/5 |
| 33 | Limitless + memory search | parallel | 5/5 | 5/5 |
| 34 | Schedule + create_task | parallel | 5/5 | 5/5 |
| 35 | Cancel + reschedule | sequential | 5/5 | 5/5 |
| 36 | Get + delete | sequential | 5/5 | 5/5 |
| 37 | EOD: 2 stores + email | complex | 5/5 | 5/5 |
| 38 | 3 stores + email | complex | 5/5 | 0/5 |
| 39 | Morning briefing | complex | 5/5 | 5/5 |
| 40 | Research + store + task | complex | 5/5 | 5/5 |
| 41 | Lifelog + memory + email | complex | 0/5 | 0/5 |
| 42 | SMS + email + schedule | complex | 5/5 | 5/5 |
Three categories: proactive (model SHOULD call a tool from conversational cues), restraint (model should NOT call any tool), judgment (ambiguous; both options can be correct).
| # | Test | Category | Baseline | Optimized |
|---|---|---|---|---|
| 1 | Store: key decision shared | proactive | 0/3 | 3/3 |
| 2 | Store: strategic insight | proactive | 0/3 | 3/3 |
| 3 | Store: multiple important facts | proactive | 3/3 | 0/3 |
| 4 | Search: user trying to remember | proactive | 3/3 | 3/3 |
| 5 | Search: prep implies need | proactive | 3/3 | 3/3 |
| 6 | SMS: user running late | proactive | 3/3 | 3/3 |
| 7 | SMS: something down urgently | proactive | 3/3 | 3/3 |
| 8 | Lifelog: references a call | proactive | 3/3 | 3/3 |
| 9 | Lifelog: what happened yesterday | proactive | 3/3 | 3/3 |
| 10 | Schedule: casual mention of event | proactive | 3/3 | 3/3 |
| 11 | Scrape: references specific URL | proactive | 3/3 | 3/3 |
| 12 | Research: genuine curiosity | proactive | 3/3 | 3/3 |
| 13 | Task: deferred follow-up | proactive | 3/3 | 3/3 |
| 14 | Store: contact info dropped | proactive | 3/3 | 3/3 |
| 15 | Search + store: correction | proactive | 0/3 | 0/3 |
| 16 | Restraint: thinking out loud | restraint | 3/3 | 3/3 |
| 17 | Restraint: venting frustration | restraint | 3/3 | 3/3 |
| 18 | Restraint: rhetorical question | restraint | 3/3 | 3/3 |
| 19 | Restraint: telling a story | restraint | 3/3 | 3/3 |
| 20 | Restraint: humor | restraint | 3/3 | 3/3 |
| 21 | Restraint: positive emotion | restraint | 3/3 | 3/3 |
| 22 | Restraint: known general knowledge | restraint | 3/3 | 3/3 |
| 23 | Restraint: asking AI's opinion | restraint | 3/3 | 3/3 |
| 24 | Restraint: wrapping up | restraint | 3/3 | 3/3 |
| 25 | Restraint: compliment | restraint | 3/3 | 3/3 |
| 26 | Restraint: hypothetical question | restraint | 3/3 | 3/3 |
| 27 | Restraint: casual agreement | restraint | 3/3 | 3/3 |
| 28 | Judgment: allusion to forgotten detail | judgment | 3/3 | 3/3 |
| 29 | Judgment: deadline mentioned in passing | judgment | 0/3 | 0/3 |
| 30 | Judgment: wants to verify a number | judgment | 3/3 | 3/3 |
| 31 | Judgment: wants to notify later | judgment | 0/3 | 0/3 |
| 32 | Judgment: competitive intel opportunity | judgment | 3/3 | 3/3 |
| 33 | Judgment: implicit calendar check | judgment | 3/3 | 3/3 |
| 34 | Judgment: pattern worth preserving | judgment | 0/3 | 3/3 |
| 35 | Judgment: morning context loading | judgment | 3/3 | 3/3 |