📄 中文摘要
随着基于大型语言模型(LLM)的智能体系统的快速普及,形成了丰富的框架生态系统,如smolagents、LangGraph、AutoGen、CAMEL和LlamaIndex等。然而,现有的基准测试主要集中于模型,固定了智能体的设置,未能比较其他系统组件。实施决策对性能的影响显著,包括拓扑结构、调度逻辑和错误处理等选择。MASEval填补了这一评估空白,提供了一个框架无关的库,将整个系统视为分析单元。通过在三个基准、三个模型和三个框架之间进行系统级比较,发现框架选择对性能的影响与模型选择同样重要。MASEval使研究人员能够探索所有组件的影响。
📄 English Summary
MASEval: Extending Multi-Agent Evaluation from Models to Systems
The rapid adoption of LLM-based agentic systems has led to a diverse ecosystem of frameworks, including smolagents, LangGraph, AutoGen, CAMEL, and LlamaIndex. However, existing benchmarks are model-centric, fixing the agentic setup without comparing other system components. Implementation decisions significantly impact performance, including choices related to topology, orchestration logic, and error handling. MASEval addresses this evaluation gap by providing a framework-agnostic library that treats the entire system as the unit of analysis. Through a systematic comparison across three benchmarks, three models, and three frameworks, it is found that the choice of framework is as critical as the choice of model. MASEval enables researchers to explore the influence of all components.