MASEval：将多智能体评估从模型扩展到系统

出处: MASEval: Extending Multi-Agent Evaluation from Models to Systems

发布: 2026年3月11日

📄 中文摘要

随着基于大型语言模型（LLM）的智能体系统的快速普及，形成了丰富的框架生态系统，如smolagents、LangGraph、AutoGen、CAMEL和LlamaIndex等。然而，现有的基准测试主要集中于模型，固定了智能体的设置，未能比较其他系统组件。实施决策对性能的影响显著，包括拓扑结构、调度逻辑和错误处理等选择。MASEval填补了这一评估空白，提供了一个框架无关的库，将整个系统视为分析单元。通过在三个基准、三个模型和三个框架之间进行系统级比较，发现框架选择对性能的影响与模型选择同样重要。MASEval使研究人员能够探索所有组件的影响。

🏷️ 相关标签

#多智能体评估 #系统性能 #框架选择

📄 English Summary

MASEval: Extending Multi-Agent Evaluation from Models to Systems

The rapid adoption of LLM-based agentic systems has led to a diverse ecosystem of frameworks, including smolagents, LangGraph, AutoGen, CAMEL, and LlamaIndex. However, existing benchmarks are model-centric, fixing the agentic setup without comparing other system components. Implementation decisions significantly impact performance, including choices related to topology, orchestration logic, and error handling. MASEval addresses this evaluation gap by providing a framework-agnostic library that treats the entire system as the unit of analysis. Through a systematic comparison across three benchmarks, three models, and three frameworks, it is found that the choice of framework is as critical as the choice of model. MASEval enables researchers to explore the influence of all components.

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等

📄 中文摘要

🏷️ 相关标签

📄 English Summary

MASEval: Extending Multi-Agent Evaluation from Models to Systems

🏷️ Related Tags

📚 相关文章

AI 编程创造了新一类创作者。我就是其中之一。

人工智能成为我学习的助手

Claude CLI "泄露": 没有人赢，AI 仍然幻觉，企业仍在犯同样的错误