KazakhOCR:评估低资源哈萨克文字符号识别的多模态模型的合成基准
📄 中文摘要
哈萨克语是一种使用阿拉伯字母、斯拉夫字母和拉丁字母的突厥语言,因而在光学字符识别(OCR)方面具有独特性。针对低资源哈萨克文字符号识别的研究非常稀缺,且目前尚无阿拉伯字母和拉丁字母的OCR基准或图像。构建了一个包含7219幅图像的合成OCR数据集,涵盖三种字母的字体、颜色和噪声变化,以模拟真实的OCR任务。对三个多模态大型语言模型(MLLMs)在OCR和语言识别的基准子集上进行了评估:Gemma-3-12B-it、Qwen2.5-VL-7B-Instruct和Llama-3.2-11B-Vision-Instruct。所有模型在拉丁字母和阿拉伯字母的OCR任务中均未成功,且未能将阿拉伯字母识别为哈萨克文本,错误分类为其他内容。
📄 English Summary
KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR
Kazakh is a Turkic language that utilizes Arabic, Cyrillic, and Latin scripts, presenting unique challenges for optical character recognition (OCR). Research on OCR for low-resource Kazakh scripts is limited, and no benchmarks or images exist for Arabic and Latin scripts. A synthetic OCR dataset comprising 7,219 images was constructed, featuring variations in font, color, and noise to simulate real OCR tasks. Three multimodal large language models (MLLMs) were evaluated on a subset of the benchmark for OCR and language identification: Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct. All models failed to perform OCR on Latin and Arabic scripts and misclassified Arabic script as non-Kazakh text.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等