如何将任何网页转换为适用于 AI 工作流的干净 Markdown

📄 中文摘要

网页内容在转换为 AI 使用时常常会遇到格式混乱的问题,尤其是当直接粘贴 HTML 时,输出质量不稳定。大多数网页的 HTML 文件中,实际有价值的内容仅占很小一部分,导致大量的 token 被浪费。经过测试,新闻文章、React 文档和 Reddit 线程的原始 HTML 与干净的 Markdown 之间的 token 浪费率分别高达 86%、74% 和 84%。Markdown 格式因其结构清晰且无杂音,能够有效保留标题、列表和代码块,成为更优的选择。

📄 English Summary

How to Convert Any Webpage to Clean Markdown for AI Workflows

When converting web content for AI use, issues often arise with formatting, particularly when pasting raw HTML, leading to inconsistent output quality. Most HTML files contain a small fraction of valuable content, resulting in significant token waste. Tests showed that the token waste rates between raw HTML and clean Markdown for a news article, React documentation, and a Reddit thread were as high as 86%, 74%, and 84%, respectively. Markdown stands out for its clear structure and lack of noise, effectively preserving headings, lists, and code blocks, making it a superior choice.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等