MultiGraSCCo:具有个人标识符注释的多语言匿名化基准

📄 中文摘要

访问敏感的患者数据进行机器学习面临隐私问题的挑战。带有个人可识别信息注释的数据集对于开发和测试匿名化系统至关重要,以实现符合隐私法规的安全数据共享。由于获取真实患者数据的瓶颈,合成数据提供了一种有效的解决方案,能够绕过适用于真实数据的隐私法规。此外,神经机器翻译能够通过将经过验证的真实或合成数据从高资源语言翻译为低资源语言,帮助创建高质量的数据。本研究创建了一个涵盖十种语言的多语言匿名化基准,采用机器翻译方法进行数据生成。

📄 English Summary

MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

Accessing sensitive patient data for machine learning is fraught with privacy concerns. Datasets annotated with personally identifiable information (PII) are essential for developing and testing anonymization systems that facilitate safe data sharing in compliance with privacy regulations. Given the bottleneck in accessing real patient data, synthetic data presents an efficient alternative to address data scarcity while circumventing privacy regulations applicable to real data. Furthermore, neural machine translation can enhance the quality of data for low-resource languages by translating validated real or synthetic data from high-resource languages. This study establishes a multilingual anonymization benchmark in ten languages, utilizing a machine translation approach for data generation.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等