📄 中文摘要
大型语言模型(LLMs)在处理复杂工作流程中的应用日益广泛,但其保持指令流的能力尚未得到充分探索。现有基准测试往往将任务复杂性与结构排序混为一谈,导致难以单独评估提示拓扑结构对模型性能的影响。针对此问题,RIFT(Reordered Instruction Following Testbed)测试平台被引入,旨在通过将结构与内容分离的方式,评估LLMs的指令遵循能力。RIFT通过对指令进行重新排序和改写,创建了多种指令序列变体,从而系统地探究了指令顺序、指令间依赖关系以及指令复杂性对LLM执行结果的影响。测试平台设计了一系列包含多步骤指令的场景,涵盖了数据处理、逻辑推理、文本生成等多个领域,每个场景都允许对指令的顺序进行独立调整,同时保持指令的语义内容不变。通过RIFT,研究人员能够精确识别LLMs在处理非标准指令顺序、嵌套指令结构或带有循环依赖的指令时所表现出的优势与局限性。此平台提供了一种标准化的方法,用于量化不同LLM架构和训练策略在面对复杂指令结构时的鲁棒性。RIFT的评估结果有助于指导LLM的进一步开发,使其能够更好地理解和执行人类所提供的、可能并非严格按逻辑顺序排列的复杂指令序列,从而提升LLMs在实际应用中的可靠性和效率。
📄 English Summary
RIFT: Reordered Instruction Following Testbed To Evaluate Instruction Following in Singular Multistep Prompt Structures
Large Language Models (LLMs) are increasingly relied upon for complex workflows, yet their ability to maintain the flow of instructions remains underexplored. Existing benchmarks conflate task complexity with structural ordering, making it difficult to isolate the impact of prompt topology on performance. RIFT, the Reordered Instruction Following Testbed, is introduced to assess instruction following by disentangling structure from content. RIFT systematically investigates the influence of instruction order, inter-instruction dependencies, and instruction complexity on LLM execution by creating various instruction sequence variants through reordering and rephrasing. The testbed designs a series of multi-step instruction scenarios spanning data processing, logical reasoning, and text generation, where the order of instructions can be independently adjusted while preserving their semantic content. Through RIFT, researchers can precisely identify the strengths and limitations of LLMs when handling non-standard instruction sequences, nested instruction structures, or instructions with cyclic dependencies. This platform offers a standardized methodology for quantifying the robustness of different LLM architectures and training strategies in confronting complex instruction structures. Evaluation results from RIFT contribute to guiding further LLM development, enabling them to better comprehend and execute intricate instruction sequences provided by humans—sequences that may not always adhere to strict logical order. This ultimately enhances the reliability and efficiency of LLMs in practical applications.