自我对弈仅在自我合成管道确保可学习信息增益时才能演化
📄 中文摘要
大型语言模型(LLMs)使得构建通过自我演化循环改进的系统成为可能,但许多现有的提案更应理解为自我对弈,且往往迅速达到瓶颈。一个核心的失败模式是循环合成更多数据而未能增加下一次迭代的可学习信息。通过在自我对弈编码任务上的实验,揭示了可持续的自我演化需要一个自我合成的数据管道,该管道在迭代中可学习信息不断增加。研究识别了自我演化的LLMs所扮演的三种角色:提议者(Proposer),生成任务;求解者(Solver),尝试解决方案;验证者(Verifier),提供训练信号,并识别了三种系统设计,这些设计共同针对可学习信息的增益。
📄 English Summary
Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain
Large language models (LLMs) enable the construction of systems that improve through self-evolving loops, yet many existing proposals are better characterized as self-play and often reach a plateau quickly. A central failure mode is the synthesis of more data without increasing learnable information for subsequent iterations. Experiments conducted on a self-play coding task reveal that sustainable self-evolution necessitates a self-synthesized data pipeline that ensures an increase in learnable information across iterations. The study identifies triadic roles played by self-evolving LLMs: the Proposer, which generates tasks; the Solver, which attempts solutions; and the Verifier, which provides training signals. Additionally, three system designs are identified that jointly target the enhancement of learnable information.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等