π Paper β’ π Website β’ π€ Dataset
π English version benchmark coming soon.
Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. We introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions.
VitaBench 2.0 extends VitaBench from one-shot tasks to long-term, multi-session user interactions, where an agent must infer, utilize, and update user preference across fragmented conversations and behaviors that span days, weeks, or months. While VitaBench 1.0 measures whether an agent can complete a single complex life-serving request, VitaBench 2.0 further asks the question: can an agent understand the user from daily interactions, anticipate their evolving needs, and act on their behalf β over time?
Each evaluation in VitaBench 2.0 simulates a continuing relationship between an agent and a user across multiple sessions in daily scenarios. Across these sessions, user preferences drift, prior commitments must be honored, and earlier context must be retrieved or reconstructed to act correctly in the present. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across three representative memory architectures:
- Full Context β the entire interaction history is appended to the prompt, an upper-bound on what the model can possibly leverage.
- Agentic Memory β the agent autonomously decides what to write to and read from a structured memory store.
- RAG Memory β past interactions are chunked, embedded, and retrieved on demand.
Our results show that even the SOTA models reach only ~50% Avg@4 under Full Context and degrade further under realistic memory settings, indicating that long-horizon personalization and proactivity remain open challenges for current LLM agents.
git clone https://github.com/meituan-longcat/VitaBench-2.0.git
cd VitaBench-2.0
pip install -e .This installs the vita CLI.
VitaBench 2.0 tasks are hosted on Hugging Face: meituan-longcat/VitaBench-2.0.
pip install -U "huggingface_hub[cli]"
huggingface-cli download meituan-longcat/VitaBench-2.0 \
--repo-type dataset \
--local-dir data/vita/domains/personalizationAfter downloading, you should have data/vita/domains/personalization/tasks.json (56 users, 771 subtasks).
cp src/vita/models.yaml.example src/vita/models.yaml
export OPENAI_API_KEY=sk-...src/vita/models.yaml supports any OpenAI-compatible endpoint β change default.base_url to point at Azure, vLLM, Together, llama.cpp, etc. The YAML supports ${VAR} placeholders expanded from your shell.
For RAG / embeddings:
| Env var | Default | Purpose |
|---|---|---|
VITA_EMBEDDING_URL |
models.yaml default.base_url |
Embedding endpoint |
VITA_EMBEDDING_KEY |
models.yaml default.api_key |
Embedding API key |
VITA_EMBEDDING_MODEL |
text-embedding-3-large |
Embedding model name |
VITA_EMBEDDING_MAX_CONCURRENCY |
64 |
Per-event-loop semaphore size |
vita run \
--domain personalization \
--memory-type rewrite \
--agent-llm gpt-4.1 \
--user-llm gpt-4.1 \
--evaluator-llm gpt-4.1 \
--num-tasks 1 --max-steps 50--save-to <name>.json writes results under data/simulations/.
bash scripts/run_memory_benchmark.sh
# or a subset:
bash scripts/run_memory_benchmark.sh full_context rewrite rag--memory-type |
Behaviour |
|---|---|
null |
No memory (baseline) |
groundtruth |
Injects the canonical preference memory directly (upper bound) |
full_context |
Dumps every prior interaction as context |
rewrite |
LLM rewrites a single consolidated memory string each update |
rag |
Async vector retrieval (text-embedding-3-large by default) |
rag_cache |
RAG with a precomputed embedding cache (see scripts/precompute_rag_cache.py) |
Per-backend defaults live in src/vita/memory.yaml; constructor kwargs override.
Performance of non-thinking and thinking models under three memory settings. The leaderboard is sorted by Avg@4 under Full Context. Best results in each column are in bold.
|
Model
|
Full Context
|
Agentic Memory
|
RAG Memory
|
||||||
|---|---|---|---|---|---|---|---|---|---|
| Avg@4 | Pass@4 | Pass^4 | Avg@4 | Pass@4 | Pass^4 | Avg@4 | Pass@4 | Pass^4 | |
| GPT-4o-mini | 0.067 | 0.180 | 0.006 | 0.084 | 0.229 | 0.008 | 0.094 | 0.227 | 0.011 |
| GPT-3.5-Turbo | 0.140 | 0.314 | 0.019 | 0.231 | 0.467 | 0.056 | 0.205 | 0.409 | 0.059 |
| LongCat-Flash-Chat | 0.298 | 0.510 | 0.123 | 0.302 | 0.537 | 0.105 | 0.290 | 0.471 | 0.136 |
| GLM-4.5 | 0.307 | 0.529 | 0.127 | 0.330 | 0.569 | 0.112 | 0.316 | 0.523 | 0.152 |
| Doubao-Seed-1.6 | 0.326 | 0.512 | 0.171 | 0.340 | 0.576 | 0.129 | 0.351 | 0.543 | 0.174 |
| GLM-4.6 | 0.342 | 0.612 | 0.113 | 0.336 | 0.623 | 0.084 | 0.317 | 0.555 | 0.123 |
| Kimi-K2.6 | 0.378 | 0.632 | 0.147 | 0.397 | 0.674 | 0.145 | 0.383 | 0.621 | 0.163 |
| GLM-5.1 | 0.420 | 0.654 | 0.204 | 0.423 | 0.664 | 0.182 | 0.383 | 0.585 | 0.200 |
| Doubao-Seed-2.0-pro | 0.428 | 0.649 | 0.218 | 0.426 | 0.665 | 0.198 | 0.406 | 0.625 | 0.208 |
| DeepSeek-V4-Pro | 0.456 | 0.652 | 0.267 | 0.427 | 0.658 | 0.207 | 0.424 | 0.618 | 0.247 |
|
Model
|
Full Context
|
Agentic Memory
|
RAG Memory
|
||||||
|---|---|---|---|---|---|---|---|---|---|
| Avg@4 | Pass@4 | Pass^4 | Avg@4 | Pass@4 | Pass^4 | Avg@4 | Pass@4 | Pass^4 | |
| o4-mini | 0.210 | 0.433 | 0.047 | 0.270 | 0.533 | 0.073 | 0.261 | 0.452 | 0.091 |
| Gemini-2.5-Flash | 0.282 | 0.556 | 0.063 | 0.312 | 0.567 | 0.098 | 0.309 | 0.544 | 0.107 |
| Qwen3-Max | 0.284 | 0.499 | 0.105 | 0.324 | 0.599 | 0.091 | 0.315 | 0.519 | 0.134 |
| Kimi-K2.6 | 0.293 | 0.533 | 0.099 | 0.280 | 0.508 | 0.088 | 0.303 | 0.511 | 0.118 |
| Gemini-2.5-Pro | 0.331 | 0.605 | 0.109 | 0.378 | 0.638 | 0.138 | 0.320 | 0.579 | 0.109 |
| MiniMax-M2.7 | 0.345 | 0.584 | 0.145 | 0.351 | 0.609 | 0.124 | 0.314 | 0.518 | 0.143 |
| GLM-4.6 | 0.359 | 0.612 | 0.116 | 0.351 | 0.625 | 0.107 | 0.336 | 0.574 | 0.135 |
| GLM-4.5 | 0.364 | 0.623 | 0.156 | 0.311 | 0.596 | 0.106 | 0.336 | 0.555 | 0.147 |
| Doubao-Seed-1.6 | 0.373 | 0.599 | 0.176 | 0.383 | 0.646 | 0.123 | 0.375 | 0.591 | 0.179 |
| GLM-5.1 | 0.394 | 0.587 | 0.213 | 0.352 | 0.556 | 0.150 | 0.328 | 0.485 | 0.185 |
| DeepSeek-R1-0528 | 0.396 | 0.691 | 0.131 | 0.412 | 0.712 | 0.118 | 0.390 | 0.643 | 0.153 |
| o3 | 0.403 | 0.653 | 0.169 | 0.401 | 0.669 | 0.154 | 0.362 | 0.587 | 0.158 |
| Claude-4.5-Sonnet | 0.417 | 0.658 | 0.197 | 0.397 | 0.642 | 0.178 | 0.374 | 0.573 | 0.186 |
| GPT-5 | 0.441 | 0.658 | 0.226 | 0.421 | 0.647 | 0.204 | 0.410 | 0.591 | 0.236 |
| DeepSeek-V4-Pro | 0.472 | 0.649 | 0.295 | 0.449 | 0.656 | 0.255 | 0.430 | 0.584 | 0.271 |
| Doubao-Seed-2.0-pro | 0.474 | 0.683 | 0.270 | 0.428 | 0.650 | 0.225 | 0.339 | 0.496 | 0.205 |
| Claude-Opus-4.6 | 0.503 | 0.664 | 0.337 | 0.454 | 0.645 | 0.259 | 0.430 | 0.566 | 0.299 |
Avg@4 β mean success rate over 4 independent rollouts per task (single-attempt success). Pass@4 β fraction of tasks solved in at least one of 4 rollouts (best-of-4). Pass^4 β fraction of tasks solved in all 4 rollouts (consistency).
If you use VitaBench 2.0 in your research, please cite:
@article{chen2026vitabench,
title={VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions},
author={Chen, Yuxin and Zhang, Yi and Cai, Zhengzhou and Shi, Yaorui and Yao, Zhiyuan and Cui, Chenhang and Zheng, Jingnan and Huo, Yaqi and Su, Xi and Gu, Qi and others},
journal={arXiv preprint arXiv:2605.27141},
year={2026}
}This project is licensed under the MIT License.
