减少自动驾驶中合成生成的多项选择问答的文本偏见

📄 中文摘要

多项选择问答(MCQA)基准已成为评估视觉语言模型(VLM)在驾驶任务中表现的标准。然而,合成生成的MCQA容易受到隐藏文本线索的影响,使得模型能够利用语言模式而非视觉上下文进行推理。研究结果表明,经过此类数据微调的VLM,即使在没有视觉输入的情况下,也能达到与人类验证基准相当的准确性。提出的方法将盲目准确率从随机水平的+66.9%降低至+2.9%,消除了大多数可利用的文本捷径。通过将正确答案与语言伪影解耦,并采用课程学习策略,迫使模型关注视觉信息而非语言特征。

📄 English Summary

Reducing Text Bias in Synthetically Generated MCQAs for VLMs in Autonomous Driving

Multiple Choice Question Answering (MCQA) benchmarks serve as a standard for evaluating Vision Language Model (VLM) performance in driving tasks. However, synthetically generated MCQAs are highly vulnerable to hidden textual cues, allowing models to exploit linguistic patterns instead of visual context. Results indicate that a VLM fine-tuned on such data can achieve accuracy comparable to human-validated benchmarks even without visual input. The proposed method reduces blind accuracy from +66.9% above random to +2.9%, effectively eliminating most exploitable textual shortcuts. By decoupling the correct answer from linguistic artifacts and employing a curriculum learning strategy, the model is compelled to focus on visual information rather than linguistic features.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等