准确度测试

使用locomo10测试集评估Agent长期对话记忆能力，下载数据集和测试脚本。
```
git clone https://github.com/ZaynJarvis/openclaw-eval.git
```

将精简数据集locomo10_small.json的对话记录存入记忆。

#同步虚拟环境
uv sync
#将数据集存入记忆
uv run eval.py --base-url <your_base_url>   --token <your_gateway_token> ingest ./locomo10_small.json --output output/trial.txt --tail "[remember what's said, keep existing memory]"

执行eval.py脚本运行QA测试，它会把OpenClaw的回答和预期答案一起记录下来。

uv run eval.py --base-url <your_base_url>   --token <your_gateway_token> qa ./locomo10_small.json --output output/answers.txt --count 100

执行judge.py，调用大模型作为裁判，给长期对话的结果进行打分和统计。

uv run judge.py output/answers.txt.json  --base-url <your_base_url>  --token <LLM_API_key> --model <your_model_name>

父主题： 端到端测试指南