Skip to content

meituan-longcat/VitaBench-2.0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

πŸ“ƒ Paper β€’ 🌐 Website β€’ πŸ€— Dataset

🌍 English version benchmark coming soon.

πŸ“– Introduction

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. We introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions.

VitaBench 2.0 extends VitaBench from one-shot tasks to long-term, multi-session user interactions, where an agent must infer, utilize, and update user preference across fragmented conversations and behaviors that span days, weeks, or months. While VitaBench 1.0 measures whether an agent can complete a single complex life-serving request, VitaBench 2.0 further asks the question: can an agent understand the user from daily interactions, anticipate their evolving needs, and act on their behalf β€” over time?

VitaBench 2.0 Overview

Each evaluation in VitaBench 2.0 simulates a continuing relationship between an agent and a user across multiple sessions in daily scenarios. Across these sessions, user preferences drift, prior commitments must be honored, and earlier context must be retrieved or reconstructed to act correctly in the present. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across three representative memory architectures:

  • Full Context β€” the entire interaction history is appended to the prompt, an upper-bound on what the model can possibly leverage.
  • Agentic Memory β€” the agent autonomously decides what to write to and read from a structured memory store.
  • RAG Memory β€” past interactions are chunked, embedded, and retrieved on demand.

Our results show that even the SOTA models reach only ~50% Avg@4 under Full Context and degrade further under realistic memory settings, indicating that long-horizon personalization and proactivity remain open challenges for current LLM agents.

πŸ› οΈ Quick Start

1. Install

git clone https://github.com/meituan-longcat/VitaBench-2.0.git
cd VitaBench-2.0
pip install -e .

This installs the vita CLI.

2. Download the dataset

VitaBench 2.0 tasks are hosted on Hugging Face: meituan-longcat/VitaBench-2.0.

pip install -U "huggingface_hub[cli]"
huggingface-cli download meituan-longcat/VitaBench-2.0 \
  --repo-type dataset \
  --local-dir data/vita/domains/personalization

After downloading, you should have data/vita/domains/personalization/tasks.json (56 users, 771 subtasks).

3. Configure the LLM

cp src/vita/models.yaml.example src/vita/models.yaml
export OPENAI_API_KEY=sk-...

src/vita/models.yaml supports any OpenAI-compatible endpoint β€” change default.base_url to point at Azure, vLLM, Together, llama.cpp, etc. The YAML supports ${VAR} placeholders expanded from your shell.

For RAG / embeddings:

Env var Default Purpose
VITA_EMBEDDING_URL models.yaml default.base_url Embedding endpoint
VITA_EMBEDDING_KEY models.yaml default.api_key Embedding API key
VITA_EMBEDDING_MODEL text-embedding-3-large Embedding model name
VITA_EMBEDDING_MAX_CONCURRENCY 64 Per-event-loop semaphore size

4. Run an evaluation

vita run \
  --domain personalization \
  --memory-type rewrite \
  --agent-llm gpt-4.1 \
  --user-llm gpt-4.1 \
  --evaluator-llm gpt-4.1 \
  --num-tasks 1 --max-steps 50

--save-to <name>.json writes results under data/simulations/.

5. Run all memory backends

bash scripts/run_memory_benchmark.sh
# or a subset:
bash scripts/run_memory_benchmark.sh full_context rewrite rag

Memory backends

--memory-type Behaviour
null No memory (baseline)
groundtruth Injects the canonical preference memory directly (upper bound)
full_context Dumps every prior interaction as context
rewrite LLM rewrites a single consolidated memory string each update
rag Async vector retrieval (text-embedding-3-large by default)
rag_cache RAG with a precomputed embedding cache (see scripts/precompute_rag_cache.py)

Per-backend defaults live in src/vita/memory.yaml; constructor kwargs override.

πŸ† Leaderboard

Performance of non-thinking and thinking models under three memory settings. The leaderboard is sorted by Avg@4 under Full Context. Best results in each column are in bold.

Non-thinking Models

Model
Full Context
Agentic Memory
RAG Memory
Avg@4 Pass@4 Pass^4 Avg@4 Pass@4 Pass^4 Avg@4 Pass@4 Pass^4
GPT-4o-mini 0.067 0.180 0.006 0.084 0.229 0.008 0.094 0.227 0.011
GPT-3.5-Turbo 0.140 0.314 0.019 0.231 0.467 0.056 0.205 0.409 0.059
LongCat-Flash-Chat 0.298 0.510 0.123 0.302 0.537 0.105 0.290 0.471 0.136
GLM-4.5 0.307 0.529 0.127 0.330 0.569 0.112 0.316 0.523 0.152
Doubao-Seed-1.6 0.326 0.512 0.171 0.340 0.576 0.129 0.351 0.543 0.174
GLM-4.6 0.342 0.612 0.113 0.336 0.623 0.084 0.317 0.555 0.123
Kimi-K2.6 0.378 0.632 0.147 0.397 0.674 0.145 0.383 0.621 0.163
GLM-5.1 0.420 0.654 0.204 0.423 0.664 0.182 0.383 0.585 0.200
Doubao-Seed-2.0-pro 0.428 0.649 0.218 0.426 0.665 0.198 0.406 0.625 0.208
DeepSeek-V4-Pro 0.456 0.652 0.267 0.427 0.658 0.207 0.424 0.618 0.247

Thinking Models

Model
Full Context
Agentic Memory
RAG Memory
Avg@4 Pass@4 Pass^4 Avg@4 Pass@4 Pass^4 Avg@4 Pass@4 Pass^4
o4-mini 0.210 0.433 0.047 0.270 0.533 0.073 0.261 0.452 0.091
Gemini-2.5-Flash 0.282 0.556 0.063 0.312 0.567 0.098 0.309 0.544 0.107
Qwen3-Max 0.284 0.499 0.105 0.324 0.599 0.091 0.315 0.519 0.134
Kimi-K2.6 0.293 0.533 0.099 0.280 0.508 0.088 0.303 0.511 0.118
Gemini-2.5-Pro 0.331 0.605 0.109 0.378 0.638 0.138 0.320 0.579 0.109
MiniMax-M2.7 0.345 0.584 0.145 0.351 0.609 0.124 0.314 0.518 0.143
GLM-4.6 0.359 0.612 0.116 0.351 0.625 0.107 0.336 0.574 0.135
GLM-4.5 0.364 0.623 0.156 0.311 0.596 0.106 0.336 0.555 0.147
Doubao-Seed-1.6 0.373 0.599 0.176 0.383 0.646 0.123 0.375 0.591 0.179
GLM-5.1 0.394 0.587 0.213 0.352 0.556 0.150 0.328 0.485 0.185
DeepSeek-R1-0528 0.396 0.691 0.131 0.412 0.712 0.118 0.390 0.643 0.153
o3 0.403 0.653 0.169 0.401 0.669 0.154 0.362 0.587 0.158
Claude-4.5-Sonnet 0.417 0.658 0.197 0.397 0.642 0.178 0.374 0.573 0.186
GPT-5 0.441 0.658 0.226 0.421 0.647 0.204 0.410 0.591 0.236
DeepSeek-V4-Pro 0.472 0.649 0.295 0.449 0.656 0.255 0.430 0.584 0.271
Doubao-Seed-2.0-pro 0.474 0.683 0.270 0.428 0.650 0.225 0.339 0.496 0.205
Claude-Opus-4.6 0.503 0.664 0.337 0.454 0.645 0.259 0.430 0.566 0.299

Avg@4 β€” mean success rate over 4 independent rollouts per task (single-attempt success). Pass@4 β€” fraction of tasks solved in at least one of 4 rollouts (best-of-4). Pass^4 β€” fraction of tasks solved in all 4 rollouts (consistency).

πŸ“š Citation

If you use VitaBench 2.0 in your research, please cite:

@article{chen2026vitabench,
  title={VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions},
  author={Chen, Yuxin and Zhang, Yi and Cai, Zhengzhou and Shi, Yaorui and Yao, Zhiyuan and Cui, Chenhang and Zheng, Jingnan and Huo, Yaqi and Su, Xi and Gu, Qi and others},
  journal={arXiv preprint arXiv:2605.27141},
  year={2026}
}

πŸ“œ License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors