爬取海湾合作委员会政府文件:我遇到的障碍

📄 中文摘要

构建海湾合作委员会LexAI需要从阿联酋和沙特阿拉伯政府网站获取AI法规文件。然而,尽管技术栈运作良好,网站并不总是配合。沙特阿拉伯的政府网站完全阻止非沙特流量,导致连接超时,而不是返回403或重定向。无论是从日本、马来西亚还是美国访问,结果都是一样的。这表明沙特政府网站在网络层面上封锁了所有非沙特IP地址。更改爬虫的位置无济于事,虽然在海湾国家使用代理是理论上的解决方案,但实际操作中却面临挑战。

📄 English Summary

Crawling GCC Government Documents: What Blocked Me

Building GCC LexAI required ingesting AI regulation documents from government websites in the UAE and Saudi Arabia. While the tech stack functioned well, the websites did not always cooperate. Saudi government sites block all non-Saudi traffic entirely, resulting in connection timeouts instead of 403 errors or redirects. Attempts to access from Japan, Malaysia, and the US yielded the same results, indicating that Saudi government sites block all non-Saudi IP ranges at the network level. Changing the crawler's location does not help, and while using proxies in GCC countries is a theoretical fix, practical implementation poses challenges.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等