📄 中文摘要
网络是一个丰富的结构化数据源,包含产品目录、知识库和科学数据集等表格。然而,这些表格的结构和语义的异质性使得构建统一的方法以有效利用其包含的信息变得具有挑战性。大型语言模型(LLMs)正日益成为网络基础设施的重要组成部分,广泛应用于语义搜索等任务。因此,能否利用这些已经部署的LLMs来对网络原生表格(如产品目录、知识库导出、科学数据门户)中的结构化数据进行分类,从而避免需要专门模型或大量重新训练的需求,成为一个关键问题。本研究旨在探讨这一问题,并提出了一种轻量级的方法。通过对现有LLMs的有效利用,能够在少量样本的情况下实现高效的表格分类。
📄 English Summary
Language Model Representations for Efficient Few-Shot Tabular Classification
The web serves as a rich source of structured data in the form of tables, including product catalogs, knowledge bases, and scientific datasets. However, the heterogeneity in the structure and semantics of these tables poses significant challenges in developing a unified method to effectively leverage the information they contain. Large language models (LLMs) are increasingly integral to web infrastructure for tasks such as semantic search. This raises a critical question: can we utilize these already-deployed LLMs to classify structured data in web-native tables (e.g., product catalogs, knowledge base exports, scientific data portals) without the need for specialized models or extensive retraining? This study investigates this question and proposes a lightweight approach. By effectively leveraging existing LLMs, efficient table classification can be achieved even with few samples.
Powered by Cloudflare Workers + Payload CMS + Claude 3.5
数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等