LLM提供商故障的断路器

出处: Circuit breaker for LLM provider failure

发布: 2026年3月23日

📄 中文摘要

每个基于大语言模型(LLM)的应用程序都依赖于外部服务提供商,如OpenAI、Anthropic、Google或自托管模型。然而,这些服务提供商可能会出现故障,导致请求失败、速率限制增加和延迟加大。在没有断路器的情况下,应用程序会不断向无响应的API发送请求,消耗预算、积累超时,并给用户带来糟糕的体验。断路器能够检测到下游服务的故障,并在冷却期内停止请求。这种机制并不是简单地加大重试力度,而是快速而有意识地失败,从而保护系统的其他部分。通过Redis支持的故障状态,可以实现快速负载削减和自动恢复,确保在重启时保持一致性。

📄 English Summary

Circuit breaker for LLM provider failure

Every application powered by large language models (LLMs) relies on external providers such as OpenAI, Anthropic, Google, or self-hosted models. These providers can experience outages, leading to request failures, increased rate limits, and ballooning latency. Without a circuit breaker, applications continue to send requests to a non-responsive API, wasting budget, accumulating timeouts, and providing a poor user experience. A circuit breaker detects when the downstream service is failing and stops requests for a cooldown period. This approach is not about retrying harder; it is about failing fast and deliberately to protect the rest of the system. Utilizing a Redis-backed failure state allows for rapid load shedding and automatic recovery, ensuring consistency across restarts.

Powered by Cloudflare Workers + Payload CMS + Claude 3.5

数据源: OpenAI, Google AI, DeepMind, AWS ML Blog, HuggingFace 等