语言模型角色的抽象红队测试

📄 中文摘要

为了使语言模型助手符合角色规范,该规范规定了模型在多样化用户交互中的行为。尽管模型通常遵循这些角色规范,但在大规模部署中,它们偶尔会违反这些规范。本研究旨在识别在部署过程中可能导致角色违规的查询类型,使用的计算资源远低于部署级别。为此,提出了一种抽象红队测试的方法,通过搜索自然语言查询类别,例如“查询为中文。查询询问家庭角色”,这些类别能够经常引发违规。这些类别抽象出可能在实际使用中出现的多种查询变体。

📄 English Summary

Abstractive Red-Teaming of Language Model Character

This work aims to ensure that language model assistants adhere to character specifications that dictate how the model should behave across various user interactions. While models generally follow these specifications, they can occasionally violate them during large-scale deployments. The study focuses on identifying types of queries that are likely to elicit such character violations at deployment, utilizing significantly less computational resources than those required for deployment. To achieve this, an approach called abstractive red-teaming is introduced, which involves searching for categories of natural-language queries, such as 'The query is in Chinese. The query asks about family roles,' that routinely trigger violations. These categories abstract over the numerous potential variants of queries that could arise in real-world scenarios.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等