VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

🌍 English version benchmark coming soon.

📖 Introduction

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. We introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions.

VitaBench 2.0 extends VitaBench from one-shot tasks to long-term, multi-session user interactions, where an agent must infer, utilize, and update user preference across fragmented conversations and behaviors that span days, weeks, or months. While VitaBench 1.0 measures whether an agent can complete a single complex life-serving request, VitaBench 2.0 further asks the question: can an agent understand the user from daily interactions, anticipate their evolving needs, and act on their behalf — over time?

Each evaluation in VitaBench 2.0 simulates a continuing relationship between an agent and a user across multiple sessions in daily scenarios. Across these sessions, user preferences drift, prior commitments must be honored, and earlier context must be retrieved or reconstructed to act correctly in the present. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across three representative memory architectures:

Full Context — the entire interaction history is appended to the prompt, an upper-bound on what the model can possibly leverage.
Agentic Memory — the agent autonomously decides what to write to and read from a structured memory store.
RAG Memory — past interactions are chunked, embedded, and retrieved on demand.

Our results show that even the SOTA models reach only ~50% Avg@4 under Full Context and degrade further under realistic memory settings, indicating that long-horizon personalization and proactivity remain open challenges for current LLM agents.

🛠️ Quick Start

1. Install

git clone https://github.com/meituan-longcat/VitaBench-2.0.git
cd VitaBench-2.0
pip install -e .

This installs the vita CLI.

2. Download the dataset

VitaBench 2.0 tasks are hosted on Hugging Face: meituan-longcat/VitaBench-2.0.

pip install -U "huggingface_hub[cli]"
huggingface-cli download meituan-longcat/VitaBench-2.0 \
  --repo-type dataset \
  --local-dir data/vita/domains/personalization

After downloading, you should have data/vita/domains/personalization/tasks.json (56 users, 771 subtasks).

3. Configure the LLM

cp src/vita/models.yaml.example src/vita/models.yaml
export OPENAI_API_KEY=sk-...

src/vita/models.yaml supports any OpenAI-compatible endpoint — change default.base_url to point at Azure, vLLM, Together, llama.cpp, etc. The YAML supports ${VAR} placeholders expanded from your shell.

For RAG / embeddings:

Env var	Default	Purpose
`VITA_EMBEDDING_URL`	`models.yaml default.base_url`	Embedding endpoint
`VITA_EMBEDDING_KEY`	`models.yaml default.api_key`	Embedding API key
`VITA_EMBEDDING_MODEL`	`text-embedding-3-large`	Embedding model name
`VITA_EMBEDDING_MAX_CONCURRENCY`	`64`	Per-event-loop semaphore size

4. Run an evaluation

vita run \
  --domain personalization \
  --memory-type rewrite \
  --agent-llm gpt-4.1 \
  --user-llm gpt-4.1 \
  --evaluator-llm gpt-4.1 \
  --num-tasks 1 --max-steps 50

--save-to <name>.json writes results under data/simulations/.

5. Run all memory backends

bash scripts/run_memory_benchmark.sh
# or a subset:
bash scripts/run_memory_benchmark.sh full_context rewrite rag

Memory backends

`--memory-type`	Behaviour
`null`	No memory (baseline)
`groundtruth`	Injects the canonical preference memory directly (upper bound)
`full_context`	Dumps every prior interaction as context
`rewrite`	LLM rewrites a single consolidated memory string each update
`rag`	Async vector retrieval (`text-embedding-3-large` by default)
`rag_cache`	RAG with a precomputed embedding cache (see `scripts/precompute_rag_cache.py`)

Per-backend defaults live in src/vita/memory.yaml; constructor kwargs override.

🏆 Leaderboard

Performance of non-thinking and thinking models under three memory settings. The leaderboard is sorted by Avg@4 under Full Context. Best results in each column are in bold.

Non-thinking Models

Model	Full Context			Agentic Memory			RAG Memory
Model	Avg@4	Pass@4	Pass^4	Avg@4	Pass@4	Pass^4	Avg@4	Pass@4	Pass^4
GPT-4o-mini	0.067	0.180	0.006	0.084	0.229	0.008	0.094	0.227	0.011
GPT-3.5-Turbo	0.140	0.314	0.019	0.231	0.467	0.056	0.205	0.409	0.059
LongCat-Flash-Chat	0.298	0.510	0.123	0.302	0.537	0.105	0.290	0.471	0.136
GLM-4.5	0.307	0.529	0.127	0.330	0.569	0.112	0.316	0.523	0.152
Doubao-Seed-1.6	0.326	0.512	0.171	0.340	0.576	0.129	0.351	0.543	0.174
GLM-4.6	0.342	0.612	0.113	0.336	0.623	0.084	0.317	0.555	0.123
Kimi-K2.6	0.378	0.632	0.147	0.397	0.674	0.145	0.383	0.621	0.163
GLM-5.1	0.420	0.654	0.204	0.423	0.664	0.182	0.383	0.585	0.200
Doubao-Seed-2.0-pro	0.428	0.649	0.218	0.426	0.665	0.198	0.406	0.625	0.208
DeepSeek-V4-Pro	0.456	0.652	0.267	0.427	0.658	0.207	0.424	0.618	0.247

Thinking Models

Model	Full Context			Agentic Memory			RAG Memory
Model	Avg@4	Pass@4	Pass^4	Avg@4	Pass@4	Pass^4	Avg@4	Pass@4	Pass^4
o4-mini	0.210	0.433	0.047	0.270	0.533	0.073	0.261	0.452	0.091
Gemini-2.5-Flash	0.282	0.556	0.063	0.312	0.567	0.098	0.309	0.544	0.107
Qwen3-Max	0.284	0.499	0.105	0.324	0.599	0.091	0.315	0.519	0.134
Kimi-K2.6	0.293	0.533	0.099	0.280	0.508	0.088	0.303	0.511	0.118
Gemini-2.5-Pro	0.331	0.605	0.109	0.378	0.638	0.138	0.320	0.579	0.109
MiniMax-M2.7	0.345	0.584	0.145	0.351	0.609	0.124	0.314	0.518	0.143
GLM-4.6	0.359	0.612	0.116	0.351	0.625	0.107	0.336	0.574	0.135
GLM-4.5	0.364	0.623	0.156	0.311	0.596	0.106	0.336	0.555	0.147
Doubao-Seed-1.6	0.373	0.599	0.176	0.383	0.646	0.123	0.375	0.591	0.179
GLM-5.1	0.394	0.587	0.213	0.352	0.556	0.150	0.328	0.485	0.185
DeepSeek-R1-0528	0.396	0.691	0.131	0.412	0.712	0.118	0.390	0.643	0.153
o3	0.403	0.653	0.169	0.401	0.669	0.154	0.362	0.587	0.158
Claude-4.5-Sonnet	0.417	0.658	0.197	0.397	0.642	0.178	0.374	0.573	0.186
GPT-5	0.441	0.658	0.226	0.421	0.647	0.204	0.410	0.591	0.236
DeepSeek-V4-Pro	0.472	0.649	0.295	0.449	0.656	0.255	0.430	0.584	0.271
Doubao-Seed-2.0-pro	0.474	0.683	0.270	0.428	0.650	0.225	0.339	0.496	0.205
Claude-Opus-4.6	0.503	0.664	0.337	0.454	0.645	0.259	0.430	0.566	0.299

Avg@4 — mean success rate over 4 independent rollouts per task (single-attempt success). Pass@4 — fraction of tasks solved in at least one of 4 rollouts (best-of-4). Pass^4 — fraction of tasks solved in all 4 rollouts (consistency).

📚 Citation

If you use VitaBench 2.0 in your research, please cite:

@article{chen2026vitabench,
  title={VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions},
  author={Chen, Yuxin and Zhang, Yi and Cai, Zhengzhou and Shi, Yaorui and Yao, Zhiyuan and Cui, Chenhang and Zheng, Jingnan and Huo, Yaqi and Su, Xi and Gu, Qi and others},
  journal={arXiv preprint arXiv:2605.27141},
  year={2026}
}

📜 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
figure		figure
scripts		scripts
src/vita		src/vita
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

📖 Introduction

🛠️ Quick Start

1. Install

2. Download the dataset

3. Configure the LLM

4. Run an evaluation

5. Run all memory backends

Memory backends

🏆 Leaderboard

Non-thinking Models

Thinking Models

📚 Citation

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

📖 Introduction

🛠️ Quick Start

1. Install

2. Download the dataset

3. Configure the LLM

4. Run an evaluation

5. Run all memory backends

Memory backends

🏆 Leaderboard

Non-thinking Models

Thinking Models

📚 Citation

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages