📄 中文摘要
大型语言模型(LLMs)的最新进展显著提升了其推理能力。然而,由于缺乏标准化的基准测试,评估这些模型在低资源语言中的表现仍然面临挑战。具体而言,乌尔都语推理评估一直受限于机器翻译的敏感性,并且更侧重于通用语言任务而非推理基准。针对这一问题,UrduBench通过引入上下文集成翻译方法,并结合人工参与(Human-in-the-Loop)机制,构建了一个专门用于乌尔都语推理任务的基准测试。该基准旨在克服传统机器翻译的局限性,例如语义漂移和语境丢失,通过整合来自多个翻译模型的输出并进行人工校对与精炼,确保翻译的准确性和对原文推理逻辑的忠实度。
📄 English Summary
UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop
Recent advancements in large language models (LLMs) have led to remarkable reasoning capabilities; however, evaluating these models in low-resource languages remains challenging due to the absence of standardized benchmarks. Specifically, Urdu reasoning evaluation has been hampered by the sensitivity of machine translation and an overemphasis on general language tasks rather than dedicated reasoning benchmarks. Addressing this gap, UrduBench introduces a novel approach for Urdu reasoning evaluation by leveraging contextually ensembled translations augmented with a human-in-the-loop mechanism. This benchmark aims to overcome the inherent limitations of conventional machine translation, such as semantic drift and loss of contextual nuances, by integrating outputs from multiple translation models and subsequently refining them through expert human review. The construction of UrduBench involves meticulously selecting suitable reasoning samples from existing English datasets that align with Urdu cultural and linguistic contexts. A multi-model translation strategy generates initial translations, which are then thoroughly reviewed, revised, and annotated by native Urdu-speaking experts. This rigorous human-in-the-loop process ensures the high quality, accuracy, and fidelity of the translated reasoning tasks to their original logical structures. UrduBench thus provides a reliable and precise tool for assessing the true reasoning capabilities of LLMs in complex Urdu tasks, fostering progress in Urdu natural language processing, and offering a new paradigm for AI model evaluation in low-resource languages.