Taalas以每秒17,000个令牌提供Llama 3.1 8B服务

📄 中文摘要

一家新的加拿大硬件初创公司Taalas刚刚宣布其首款产品——Llama 3.1 8B模型的定制硬件实现,能够以惊人的每秒17,000个令牌的速度运行。该公司将其硬件称为“硅Llama”,采用了激进的量化技术,结合了3位和6位参数。下一代产品将使用4位参数,预计在新模型的开发上有较长的提前期。用户可以在chatjimmy.ai上体验该技术,尽管演示视频速度极快,观看时更像是截图。

📄 English Summary

Taalas serves Llama 3.1 8B at 17,000 tokens/second

A new Canadian hardware startup, Taalas, has announced its first product: a custom hardware implementation of the Llama 3.1 8B model, capable of running at an impressive speed of 17,000 tokens per second. The company describes its hardware as 'Silicon Llama,' which utilizes aggressive quantization by combining 3-bit and 6-bit parameters. The next generation of their product is expected to use 4-bit parameters, indicating a long lead time for developing new models. Users can try out the technology at chatjimmy.ai, although the demo video is so fast that it resembles a screenshot.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等