你知道你的模型现在在做什么吗?

📄 中文摘要

一项实证研究分析了约45,000个模型仓库,涵盖五个主要的模型共享平台,发现自定义模型加载代码普遍存在。这些代码在初始化时常常执行任意Python代码,并且频繁包含安全隐患,可能导致远程代码执行(RCE)。模型中心虽然使机器学习开发变得更加便捷,但这种便利性也带来了供应链攻击的风险:导入模型可能隐式地导入并执行远程Python模块(如tokenizer、modeling_*.py、hubconf等),这些模块以用户进程的权限运行。这种情况提升了数据泄露到完全主机妥协的风险,尤其是在具有挂载凭证或广泛网络出口的环境中。

📄 English Summary

Do You Know What Your Model Is Doing Right Now?

An empirical study analyzing approximately 45,000 model repositories across five major model-sharing platforms reveals that custom model-loading code is prevalent. This code often executes arbitrary Python during initialization and frequently contains security vulnerabilities that could enable remote code execution (RCE). While model hubs have made machine learning development more frictionless, this convenience introduces a supply-chain attack surface: importing a model can implicitly import and execute remote Python modules (such as tokenizer, modeling_*.py, hubconf, etc.), which run with the privileges of the user's process. This scenario elevates risks ranging from data leakage to full host compromise, particularly in environments with mounted credentials or broad network egress.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等