AlpsBench:一种用于真实对话记忆和偏好对齐的 LLM 个性化基准

📄 中文摘要

随着大型语言模型(LLMs)逐渐演变为终身人工智能助手,LLM 个性化已成为一个关键前沿。然而,当前的进展受到缺乏黄金标准评估基准的制约。现有基准要么忽视了个性化所需的个性化信息管理,要么过于依赖合成对话,这与真实世界对话存在固有的分布差距。为了解决这一问题,提出了 AlpsBench,这是一个基于真实人类与 LLM 对话的个性化基准。AlpsBench 包含从 WildChat 精心策划的 2500 个长期互动序列,并配有经过人工验证的结构化记忆,涵盖了显性和隐性个性化信号。

📄 English Summary

AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has emerged as a critical frontier. However, progress is currently hindered by the lack of a gold-standard evaluation benchmark. Existing benchmarks either overlook the essential personalized information management necessary for effective personalization or rely heavily on synthetic dialogues, which exhibit a significant distribution gap from real-world conversations. To address this issue, AlpsBench is introduced as an LLM personalization benchmark derived from authentic human-LLM dialogues. AlpsBench consists of 2,500 long-term interaction sequences curated from WildChat, accompanied by human-verified structured memories that encapsulate both explicit and implicit personalization signals.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等