评分的不确定性有多大?基于大型语言模型的自动评估不确定性指标基准

📄 中文摘要

大型语言模型(LLMs)的快速崛起正在重塑教育中自动评估的格局。这些系统在适应多样化问题类型和输出格式的灵活性方面展现出显著优势,但也带来了与输出不确定性相关的新挑战。这种不确定性源于LLMs固有的概率特性,成为自动评估中不可避免的难题。评估结果通常在指导后续教学行动中发挥关键作用,例如为学生提供反馈或指导教学决策。不可靠或校准不良的不确定性估计可能导致不稳定的干预措施,从而干扰学生的学习过程。

📄 English Summary

How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

The rapid rise of large language models (LLMs) is reshaping the landscape of automatic assessment in education. These systems demonstrate significant advantages in adaptability to diverse question types and flexibility in output formats, yet they also introduce new challenges related to output uncertainty, which arises from the inherently probabilistic nature of LLMs. Output uncertainty is an unavoidable challenge in automatic assessment, as assessment results often play a critical role in informing subsequent pedagogical actions, such as providing feedback to students or guiding instructional decisions. Unreliable or poorly calibrated uncertainty estimates can lead to unstable downstream interventions, potentially disrupting students' learning processes.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等