基于响应的知识蒸馏在多语言越狱预防中的应用无意中妨碍安全性
📄 中文摘要
大型语言模型(LLMs)在全球范围内的部署日益增加,但其安全性对齐仍主要集中在英语环境中。这使得低资源语言的非英语环境存在脆弱性。研究提出了一种在多语言越狱预防背景下应用知识蒸馏(KD)的新方法,评估其有效性。通过使用来自XSafety的约28,000个多语言越狱提示,采用黑箱响应基础的参数高效微调(PEFT),将一个专有教师模型(OpenAI o1-mini)的拒绝行为蒸馏到三个开源学生模型:Meta-Llama-3-8B-Instruct、Gemma-2-2B-IT和Qwen3-8B。在MultiJail基准测试上的评估结果显示...
📄 English Summary
Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric, exposing vulnerabilities in non-English contexts, particularly for low-resource languages. A novel application of knowledge distillation (KD) is proposed in the context of multilingual jailbreak prevention, examining its efficacy. The refusal behaviors of a proprietary teacher model (OpenAI o1-mini) are distilled into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, utilizing approximately 28,000 multilingual jailbreak prompts from XSafety through black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals...
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等