📄 中文摘要
现代的表格机器学习模型在高维、共线且易出错的数据上取得了卓越的性能,这一现象挑战了“垃圾进,垃圾出”的传统观念。研究表明,预测鲁棒性并非仅仅依赖于数据的整洁性,而是数据架构与模型能力之间的协同作用。通过将预测变量空间中的“噪声”划分为“预测误差”和“结构不确定性”,证明了利用高维的易出错预测变量集能够渐进地克服这两种噪声,而清理低维的预测变量集则无法实现同样的效果。
📄 English Summary
From Garbage to Gold: A Data-Architectural Theory of Predictive Robustness
This research presents a paradox in tabular machine learning where modern models achieve state-of-the-art performance using high-dimensional, collinear, and error-prone data, challenging the 'Garbage In, Garbage Out' principle. Predictive robustness is shown to arise not merely from data cleanliness but from the synergy between data architecture and model capacity. By partitioning the 'noise' in predictor space into 'Predictor Error' and 'Structural Uncertainty', it is demonstrated that leveraging high-dimensional sets of error-prone predictors asymptotically overcomes both types of noise, while cleaning a low-dimensional set does not yield similar results.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等