ChiEngMixBench:评估大型语言模型中英混码生成能力

📄 中文摘要

在人机交互中,语码混合现象日益普遍,但现有研究常将其简化为翻译或可转换性问题,难以评估模型切换行为是否符合语境并与人类习惯一致。ChiEngMixBench 是首个旨在评估真实社区语境下语码混合能力的基准测试,其构建基于一个通用框架,该框架旨在捕捉语码混合的复杂性和自然性。该基准通过收集和标注大量真实世界的中英混码对话数据,涵盖了多种领域和语境,以确保评估的全面性和代表性。数据采集过程严格遵循隐私保护和伦理规范。ChiEngMixBench 特别关注模型在不同语境下选择混合语码的时机、位置和方式,并引入了多维度评估指标,包括语言流畅度、语码混合的自然度、语义一致性以及对人类意图的理解。这些指标通过结合自动评估和人工评估的方式进行量化,以提供更全面和细致的分析。此外,基准还设计了多种挑战性任务,例如在特定主题、情感或语境约束下的混码生成,以及在不同语言熟练度用户之间的混码对话模拟。通过这些任务,可以更深入地揭示大型语言模型在处理语码混合时的优势和不足。ChiEngMixBench 的目标是推动大型语言模型在语码混合生成领域的发展,使其能够更好地理解和生成符合人类习惯和语境的混码文本,从而提升人机交互的自然性和有效性。该基准也为未来研究提供了宝贵的数据集和评估工具,促进了对语码混合现象在计算语言学和人工智能领域的理解。

📄 English Summary

ChiEngMixBench: Evaluating Large Language Models on Spontaneous and Natural Chinese-English Code-Mixed Generation

Code-mixing is increasingly prevalent in human-LLM interactions, yet existing work often reduces it to a translation or convertibility problem, making it difficult to assess whether a model's switching behavior is context-appropriate and aligned with human conventions. ChiEngMixBench is the first benchmark designed to evaluate code-mixing ability in authentic community contexts, built upon a general framework aimed at capturing the complexity and naturalness of code-mixing. This benchmark is constructed by collecting and annotating a large volume of real-world Chinese-English code-mixed conversational data, spanning various domains and contexts to ensure comprehensive and representative evaluation. The data collection process strictly adheres to privacy protection and ethical guidelines. ChiEngMixBench specifically focuses on the model's timing, location, and manner of choosing to mix codes in different contexts, introducing multi-dimensional evaluation metrics including linguistic fluency, naturalness of code-mixing, semantic consistency, and understanding of human intent. These metrics are quantified through a combination of automatic and human evaluation to provide a more comprehensive and nuanced analysis. Furthermore, the benchmark designs various challenging tasks, such as code-mixed generation under specific thematic, emotional, or contextual constraints, and simulation of code-mixed conversations between users with different language proficiencies. These tasks aim to more deeply reveal the strengths and weaknesses of large language models in handling code-mixing. The goal of ChiEngMixBench is to advance the development of large language models in the field of code-mixed generation, enabling them to better understand and generate code-mixed text that aligns with human habits and context, thereby enhancing the naturalness and effectiveness of human-computer interaction. The benchmark also provides valuable datasets and evaluation tools for future research, fostering a deeper understanding of code-mixing phenomena in computational linguistics and artificial intelligence.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等