构建基于本地优先的 RAG 研究工具:使用 Nemotron + vLLM + 工具调用

📄 中文摘要

构建了一款完全在单个 GPU 上运行的本地优先 RAG 研究工具。该工具结合了工具调用和 RAG 方法,经过了一番探索。技术栈包括使用 Nemotron Nano 9B v2 日文模型和 vLLM(FP16,RTX 5090),后端采用 FastAPI、SQLite FTS5 和 Jinja2,所有功能集成在一个 app.py 文件中,同时使用 NVIDIA 的官方解析器插件进行工具调用和推理。系统在接收到问题后,首先通过 LLM 提取双语关键词(英语和日语),然后在本地源和 DuckDuckGo 网络搜索中进行 FTS5 搜索。

📄 English Summary

Built a Local-First RAG Research Tool with Nemotron + vLLM + Tool Calling

A local-first RAG research tool has been built to run entirely on a single GPU. The combination of tool calling and RAG required some exploration. The tech stack includes the Nemotron Nano 9B v2 Japanese model on vLLM (FP16, RTX 5090), with the backend utilizing FastAPI, SQLite FTS5, and Jinja2, all integrated into a single app.py file. NVIDIA's official parser plugins are used for tool calling and reasoning. When a question is asked, the system first extracts bilingual keywords (EN+JA) via LLM and then performs an FTS5 search on local sources and DuckDuckGo web search.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等