TraderBench:AI代理在对抗性资本市场中的稳健性如何?

📄 中文摘要

TraderBench是一个新提出的基准,旨在解决金融领域中评估AI代理的两个主要挑战。传统的静态基准需要昂贵的专家注释,且无法捕捉到真实交易中的动态决策过程。同时,基于大型语言模型的评判者在特定领域任务中引入了不可控的方差。TraderBench结合了经过专家验证的静态任务(知识检索、分析推理)与完全基于实现绩效(夏普比率、收益和回撤)的对抗性交易模拟,从而消除了评判者的方差。该框架包含两个新颖的赛道:加密货币交易,具有四种渐进的市场操纵变换,以及期权衍生品的评分,涵盖了盈亏准确性、希腊字母和风险管理。

📄 English Summary

TraderBench: How Robust Are AI Agents in Adversarial Capital Markets?

TraderBench is a newly proposed benchmark designed to address two major challenges in evaluating AI agents in finance. Traditional static benchmarks require costly expert annotation and fail to capture the dynamic decision-making process central to real-world trading. At the same time, LLM-based judges introduce uncontrolled variance in domain-specific tasks. TraderBench combines expert-verified static tasks (knowledge retrieval, analytical reasoning) with adversarial trading simulations scored purely on realized performance metrics such as Sharpe ratio, returns, and drawdown, thereby eliminating judge variance entirely. The framework features two novel tracks: crypto trading with four progressive market-manipulation transformations, and options derivatives scoring based on P&L accuracy, Greeks, and risk management.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等