📄 中文摘要
近期事件凸显了人机交互中出现的负面心理后果,包括心理健康危机和用户伤害等问题。随着大型语言模型(LLMs)作为指导、情感支持甚至非正式治疗的来源,这些风险有可能加剧。然而,研究有害的人机交互机制面临显著的方法论挑战,因为有机的有害交互通常在持续的互动中发展,需要大量的对话上下文,这在控制环境中难以模拟。为了解决这一空白,开发了一种多特征子空间引导(MultiTraitsss)框架,利用已建立的危机相关特征和新颖的子空间策略,以更好地理解和应对人机交互中的潜在危害。
📄 English Summary
Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction
Recent incidents have highlighted alarming cases of negative psychological outcomes resulting from human-AI interactions, including mental health crises and user harm. As large language models (LLMs) serve as sources of guidance, emotional support, and even informal therapy, the risks associated with these interactions are poised to escalate. However, studying the mechanisms underlying harmful human-AI interactions presents significant methodological challenges, as organic harmful interactions typically develop over sustained engagement, requiring extensive conversational context that is difficult to simulate in controlled settings. To address this gap, a Multi-Trait Subspace Steering (MultiTraitsss) framework has been developed, leveraging established crisis-associated traits and novel subspace strategies to better understand and mitigate potential harms in human-AI interactions.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等