PaperAudit-Bench: 自动化同行评审中研究论文错误检测基准

📄 中文摘要

大型语言模型能够生成流畅的同行评审意见,但当实质性问题较为隐晦并分布于论文各处时,其评估往往缺乏足够的批判性严谨性。为解决这一问题,PaperAudit-Bench被提出,它包含两个核心组件:首先是PaperAudit-Dataset,这是一个精心构建的错误数据集,旨在覆盖两类主要错误:一类是可在论文单一章节内识别的错误,例如语法错误、数据呈现不一致或方法描述不清等局部性问题;另一类则需要跨章节推理才能发现的复杂错误,例如实验结果与引言中提出的假设不符、方法论与结论之间存在逻辑断裂、或不同章节中关键概念的定义前后矛盾等全局性问题。

📄 English Summary

PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review

Large language models (LLMs) can produce fluent peer reviews; however, their assessments often lack critical rigor when substantive issues are subtle and distributed throughout a research paper. To address this limitation, PaperAudit-Bench is introduced, comprising two key components. Firstly, PaperAudit-Dataset is a meticulously constructed error dataset designed to encompass two primary categories of errors: those identifiable within individual sections, such as grammatical errors, inconsistent data presentation, or unclear methodological descriptions, representing localized issues; and those necessitating cross-sectional reasoning for detection, such as discrepancies between experimental results and hypotheses presented in the introduction, logical discontinuities between methodology and conclusions, or inconsistent definitions of key concepts across different sections, representing global issues. The design objective of PaperAudit-Dataset is to provide a controlled evaluation environment for automated peer review systems, enabling precise measurement of their performance across varying complexities of error detection tasks. This dataset facilitates the assessment of models' capabilities in identifying errors ranging from simple punctuation mistakes to intricate logical flaws. Secondly, PaperAudit-Bench incorporates a comprehensive set of evaluation metrics and a framework for quantifying and comparing the performance of different error detection models on the PaperAudit-Dataset. This framework extends beyond mere accuracy and recall, delving into models' abilities to identify error types, pinpoint their locations, and provide constructive feedback. Through PaperAudit-Bench, researchers can systematically evaluate and enhance automated peer review tools, driving their evolution towards greater criticality and rigor, with the ultimate goal of improving the quality and efficiency of academic paper review.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等