KazakhOCR：评估低资源哈萨克文字符号识别的多模态模型的合成基准

出处: KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR

发布: 2026年3月17日

📄 中文摘要

哈萨克语是一种使用阿拉伯字母、斯拉夫字母和拉丁字母的突厥语言，因而在光学字符识别（OCR）方面具有独特性。针对低资源哈萨克文字符号识别的研究非常稀缺，且目前尚无阿拉伯字母和拉丁字母的OCR基准或图像。构建了一个包含7219幅图像的合成OCR数据集，涵盖三种字母的字体、颜色和噪声变化，以模拟真实的OCR任务。对三个多模态大型语言模型（MLLMs）在OCR和语言识别的基准子集上进行了评估：Gemma-3-12B-it、Qwen2.5-VL-7B-Instruct和Llama-3.2-11B-Vision-Instruct。所有模型在拉丁字母和阿拉伯字母的OCR任务中均未成功，且未能将阿拉伯字母识别为哈萨克文本，错误分类为其他内容。

🏷️ 相关标签

#哈萨克语 #光学字符识别 #多模态模型 #合成数据集 #低资源语言

📄 English Summary

KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR

Kazakh is a Turkic language that utilizes Arabic, Cyrillic, and Latin scripts, presenting unique challenges for optical character recognition (OCR). Research on OCR for low-resource Kazakh scripts is limited, and no benchmarks or images exist for Arabic and Latin scripts. A synthetic OCR dataset comprising 7,219 images was constructed, featuring variations in font, color, and noise to simulate real OCR tasks. Three multimodal large language models (MLLMs) were evaluated on a subset of the benchmark for OCR and language identification: Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct. All models failed to perform OCR on Latin and Arabic scripts and misclassified Arabic script as non-Kazakh text.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误