33.2 LiteLLM 网关部署
33.2.1 LiteLLM 简介
LiteLLM 是一个开源的 LLM 网关,支持 100+ 个 LLM 提供商,包括 Anthropic、OpenAI、Cohere 等。它提供了统一的 API 接口,简化了多提供商的使用和管理。
LiteLLM 的核心特性
- 多提供商支持 :支持 100+ LLM 提供商
- 统一 API :一致的 API 接口,简化集成
- 智能缓存 :内置缓存机制,减少成本和延迟
- 速率限制 :可配置的速率限制,控制使用
- 成本跟踪 :详细的使用情况和成本分析
- 负载均衡 :在多个 API 密钥之间分配请求
- 失败重试 :自动重试失败的请求
- 流式响应 :支持流式输出
LiteLLM 架构
┌─────────────────────────────────────────┐ │ Claude Code 客户端 │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ LiteLLM Proxy │ │ ┌──────────────────────────────┐ │ │ │ API 层 │ │ │ │ (Anthropic、OpenAI 等) │ │ │ └──────────────────────────────┘ │ │ ┌──────────────────────────────┐ │ │ │ 缓存层 │ │ │ │ (Redis、Memcached) │ │ │ └──────────────────────────────┘ │ │ ┌──────────────────────────────┐ │ │ │ 监控层 │ │ │ │ (Prometheus、Grafana) │ │ │ └──────────────────────────────┘ │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ LLM 提供商 │ │ (Anthropic、OpenAI、Cohere 等) │ └─────────────────────────────────────────┘
## 33.2.2 安装和配置
### 1\. 安装 LiteLLM
#### 使用 Docker 安装(推荐)
bash
bash
# 拉取 LiteLLM 镜像
docker pull litellm/litellm:latest
# 创建配置目录
mkdir -p ~/litellm/config
cd ~/litellm
# 创建配置文件
cat > config.yaml << EOF
model_list:
- model_name: claude-sonnet-4
litellm_params:
model: claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: claude-opus-4
litellm_params:
model: claude-opus-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: claude-haiku-4
litellm_params:
model: claude-haiku-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
litellm_settings:
drop_params: true
set_verbose: true
general_settings:
master_key: sk-litellm-master-key-123456
database_url: postgresql://user:password@localhost:5432/litellm
security_settings:
valid_api_keys:
- sk-team-a-key-123
- sk-team-b-key-456
EOF
# 启动 LiteLLM
docker run -d \
--name litellm \
-p 4000:4000 \
-v $(pwd)/config.yaml:/app/config.yaml \
-e ANTHROPIC_API_KEY=sk-ant-xxx \
litellm/litellm:latest
```#### 使用 Python 安装
# 安装 LiteLLM
pip install litellm[proxy]
# 初始化配置
litellm init
# 编辑配置文件
nano litellm_config.yaml
# 启动代理服务器
litellm proxy --config litellm_config.yaml --port 4000
### 2\. 配置文件详解
yaml
```yaml
```yaml
# litellm_config.yaml
# 模型列表
model_list:
# Anthropic Claude 模型
- model_name: claude-sonnet-4
litellm_params:
model: claude-sonnet-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
api_base: https://api.anthropic.com
max_tokens: 4096
temperature: 0.7
- model_name: claude-opus-4
litellm_params:
model: claude-opus-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
max_tokens: 4096
- model_name: claude-haiku-4
litellm_params:
model: claude-haiku-4-20250514
api_key: os.environ/ANTHROPIC_API_KEY
max_tokens: 4096
# Amazon Bedrock 模型
- model_name: bedrock-claude-sonnet
litellm_params:
model: anthropic.claude-sonnet-4-5-20250929-v1:0
api_base: https://bedrock-runtime.us-east-1.amazonaws.com
api_key: os.environ/AWS_ACCESS_KEY_ID
aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
aws_region_name: us-east-1
# Google Vertex AI 模型
- model_name: vertex-claude-sonnet
litellm_params:
model: claude-sonnet-4-5@20250929
api_base: https://us-central1-aiplatform.googleapis.com
api_key: os.environ/GOOGLE_APPLICATION_CREDENTIALS
vertex_project: os.environ/VERTEX_PROJECT_ID
vertex_location: us-central1
# LiteLLM 设置
litellm_settings:
drop_params: true # 删除未使用的参数
set_verbose: true # 启用详细日志
json_logs: true # JSON 格式日志
success_callback: http://localhost:5000/callback # 成功回调
failure_callback: http://localhost:5000/failure # 失败回调
# 通用设置
general_settings:
master_key: sk-litellm-master-key-123456 # 主密钥
database_url: postgresql://user:password@localhost:5432/litellm # 数据库 URL
cache: redis://localhost:6379 # Redis 缓存
cache_seconds: 3600 # 缓存时间(秒)
# 安全设置
security_settings:
valid_api_keys: # 有效的 API 密钥
- sk-team-a-key-123
- sk-team-b-key-456
- sk-team-c-key-789
max_budget: 1000.0 # 最大预算(美元)
budget_duration: monthly # 预算周期
rpm_limit: 100 # 每分钟请求数限制
tpm_limit: 10000 # 每分钟令牌数限制
# 负载均衡设置
load_balancing_settings:
routing_strategy: usage-based # 路由策略:usage-based, round-robin, least-latency
health_check: true # 启用健康检查
health_check_interval: 60 # 健康检查间隔(秒)
# 监控设置
monitoring_settings:
enable_prometheus: true # 启用 Prometheus
prometheus_port: 9090 # Prometheus 端口
enable_slack_alerts: true # 启用 Slack 告警
slack_webhook_url: https://hooks.slack.com/services/xxx/yyy/zzz
alert_thresholds:
error_rate: 0.05 # 错误率阈值
latency_p99: 5000 # P99 延迟阈值(毫秒)
```## 33.2.3 高级配置
### 1. 缓存配置
# 缓存设置
cache_settings:
type: redis # 缓存类型:redis, memory, none
redis_url: redis://localhost:6379/0
cache_ttl: 3600 # 缓存生存时间(秒)
cache_key_prefix: litellm # 缓存键前缀
enable_cache_for_stream: false # 是否为流式响应启用缓存
cache_control_headers: true # 是否使用缓存控制头
### 2\. 速率限制配置
yaml
```yaml
# 速率限制设置
rate_limit_settings:
enabled: true
strategy: sliding_window # 策略:sliding_window, token_bucket, fixed_window
limits:
- api_key: sk-team-a-key-123
rpm: 100 # 每分钟请求数
tpm: 10000 # 每分钟令牌数
rpd: 10000 # 每天请求数
- api_key: sk-team-b-key-456
rpm: 50
tpm: 5000
rpd: 5000
default_limits:
rpm: 10
tpm: 1000
rpd: 100
burst_size: 20 # 突发大小
```### 3. 预算控制配置
# 预算设置
budget_settings:
enabled: true
currency: USD
budgets:
- name: team-a-budget
api_keys:
- sk-team-a-key-123
limit: 1000.0
period: monthly
alert_threshold: 0.8 # 在 80% 时告警
hard_limit: true # 达到限制时阻止请求
- name: team-b-budget
api_keys:
- sk-team-b-key-456
limit: 500.0
period: monthly
alert_threshold: 0.9
hard_limit: false
cost_tracking:
enabled: true
update_interval: 60 # 更新间隔(秒)
storage: database # 存储方式:database, file
### 4\. 监控和告警配置
yaml
```yaml
# 监控设置
monitoring_settings:
prometheus:
enabled: true
port: 9090
metrics:
- request_count
- request_duration
- error_count
- cache_hit_rate
- token_usage
- cost
grafana:
enabled: true
dashboard_url: http://localhost:3000/d/litellm
alerts:
slack:
enabled: true
webhook_url: https://hooks.slack.com/services/xxx/yyy/zzz
channels:
- litellm-alerts
- devops-notifications
alert_rules:
- name: high_error_rate
condition: error_rate > 0.05
duration: 5m
severity: warning
- name: high_latency
condition: p99_latency > 5000
duration: 2m
severity: critical
- name: budget_exceeded
condition: budget_usage > 1.0
severity: critical
email:
enabled: true
smtp_server: smtp.gmail.com
smtp_port: 587
smtp_username: alerts@company.com
smtp_password: ${SMTP_PASSWORD}
from_address: litellm-alerts@company.com
to_addresses:
- devops@company.com
- finance@company.com
```## 33.2.4 集成 Claude Code
### 1. 配置 Claude Code 使用 LiteLLM
# 方法 1:使用统一端点(推荐)
export ANTHROPIC_BASE_URL=https://litellm-server:4000
export ANTHROPIC_AUTH_TOKEN=sk-litellm-static-key
# 方法 2:使用 Anthropic 格式端点
export ANTHROPIC_BASE_URL=https://litellm-server:4000/anthropic
export ANTHROPIC_AUTH_TOKEN=sk-litellm-static-key
# 方法 3:使用 API 密钥辅助程序
# 创建辅助程序脚本cat > ~/bin/get-litellm-key.sh << 'EOF' #!/bin/bash
# 从 Vault 获取密钥
vault kv get -field=api_key secret/litellm/claude-code EOF chmod +x ~/bin/get-litellm-key.sh
# 配置 Claude Code 使用辅助程序
cat > ~~/.claude-code/settings.json << EOF { "apiKeyHelper": "~~/bin/get-litellm-key.sh", "env": { "ANTHROPIC_BASE_URL": "<https://litellm-server:4000>" } } EOF
bash
### 2. 验证配置
```python
```python
class LiteLLMValidator:
"""LiteLLM 验证器"""
def __init__(self, gateway_url: str, auth_token: str):
self.gateway_url = gateway_url
self.auth_token = auth_token
def validate_connection(self) -> ValidationResult:
"""验证连接"""
result = ValidationResult()
try:
# 测试健康检查端点
response = requests.get(
f"{self.gateway_url}/health",
headers={'Authorization': f'Bearer {self.auth_token}'},
timeout=10
)
if response.status_code == 200:
result.success = True
result.message = "Connection successful"
else:
result.success = False
result.message = f"Health check failed: {response.status_code}"
except requests.exceptions.Timeout:
result.success = False
result.message = "Connection timeout"
except requests.exceptions.ConnectionError:
result.success = False
result.message = "Connection error"
except Exception as e:
result.success = False
result.message = f"Unexpected error: {str(e)}"
return result
def validate_model_access(self, model: str) -> ValidationResult:
"""验证模型访问"""
result = ValidationResult()
try:
# 测试模型访问
response = requests.post(
f"{self.gateway_url}/v1/completions",
headers={
'Authorization': f'Bearer {self.auth_token}',
'Content-Type': 'application/json'
},
json={
'model': model,
'prompt': 'Hello',
'max_tokens': 10
},
timeout=30
)
if response.status_code == 200:
result.success = True
result.message = f"Model {model} accessible"
else:
result.success = False
result.message = f"Model access failed: {response.status_code}"
result.error = response.text
except Exception as e:
result.success = False
result.message = f"Model access error: {str(e)}"
return result
def validate_all(self) -> ValidationReport:
"""验证所有配置"""
report = ValidationReport()
# 验证连接
report.connection = self.validate_connection()
# 验证模型访问
models = ['claude-sonnet-4', 'claude-opus-4', 'claude-haiku-4']
report.models = {}
for model in models:
report.models[model] = self.validate_model_access(model)
# 生成摘要
report.summary = self._generate_summary(report)
return report
def _generate_summary(self, report: ValidationReport) -> str:
"""生成验证摘要"""
summary = "LiteLLM Validation Summary:\n\n"
summary += f"Connection: {'✓' if report.connection.success else '✗'} "
summary += f"{report.connection.message}\n\n"
summary += "Model Access:\n"
for model, result in report.models.items():
status = '✓' if result.success else '✗'
summary += f" {status} {model}: {result.message}\n"
return summary
```## 33.2.5 监控和维护
```python
### 1. Prometheus 监控
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'litellm'
static_configs:
- targets: ['litellm-server:9090']
metrics_path: '/metrics'
### 2\. Grafana 仪表板
json
```json
```python
{
"dashboard": {
"title": "LiteLLM Dashboard",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(litellm_request_count[1m])"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(litellm_error_count[1m]) / rate(litellm_request_count[1m])"
}
]
},
{
"title": "P99 Latency",
"targets": [
{
"expr": "histogram_quantile(0.99, rate(litellm_request_duration_bucket[1m]))"
}
]
},
{
"title": "Cache Hit Rate",
"targets": [
{
"expr": "rate(litellm_cache_hits[1m]) / rate(litellm_cache_requests[1m])"
}
]
},
{
"title": "Token Usage",
"targets": [
{
"expr": "rate(litellm_token_usage[1m])"
}
]
},
{
"title": "Cost",
"targets": [
{
"expr": "litellm_cost_total"
}
]
}
]
}
}
```### 3. 日志管理
class LiteLLMLogManager:
"""LiteLLM 日志管理器"""
def __init__(self, log_file: str):
self.log_file = log_file
self.log_parser = LiteLLMLogParser()
def analyze_logs(self,
start_time: datetime = None,
end_time: datetime = None) -> LogAnalysis:
"""分析日志"""
analysis = LogAnalysis()
# 读取日志文件
with open(self.log_file, 'r') as f:
logs = f.readlines()
# 解析日志
parsed_logs = []
for log in logs:
try:
parsed = self.log_parser.parse(log)
parsed_logs.append(parsed)
except Exception as e:
logger.warning(f"Failed to parse log: {e}")
# 过滤时间范围
if start_time or end_time:
parsed_logs = [
log for log in parsed_logs
if (not start_time or log.timestamp >= start_time) and
(not end_time or log.timestamp <= end_time)
]
# 分析日志
analysis.total_requests = len(parsed_logs)
analysis.successful_requests = sum(
1 for log in parsed_logs if log.status == 'success'
)
analysis.failed_requests = sum(
1 for log in parsed_logs if log.status == 'error'
)
analysis.error_rate = (
analysis.failed_requests / analysis.total_requests
if analysis.total_requests > 0 else 0
)
# 分析延迟
latencies = [log.duration for log in parsed_logs if log.duration]
if latencies:
analysis.avg_latency = sum(latencies) / len(latencies)
analysis.p50_latency = np.percentile(latencies, 50)
analysis.p95_latency = np.percentile(latencies, 95)
analysis.p99_latency = np.percentile(latencies, 99)
# 分析令牌使用
analysis.total_tokens = sum(
log.input_tokens + log.output_tokens
for log in parsed_logs
)
# 分析成本
analysis.total_cost = sum(log.cost for log in parsed_logs)
return analysis
def generate_report(self, analysis: LogAnalysis) -> str:
"""生成报告"""
report = "LiteLLM Log Analysis Report\n"
report += "=" * 50 + "\n\n"
report += "Request Summary:\n"
report += f" Total: {analysis.total_requests}\n"
report += f" Successful: {analysis.successful_requests}\n"
report += f" Failed: {analysis.failed_requests}\n"
report += f" Error Rate: {analysis.error_rate:.2%}\n\n"
report += "Latency (ms):\n"
report += f" Average: {analysis.avg_latency:.0f}\n"
report += f" P50: {analysis.p50_latency:.0f}\n"
report += f" P95: {analysis.p95_latency:.0f}\n"
report += f" P99: {analysis.p99_latency:.0f}\n\n"
report += "Token Usage:\n"
report += f" Total: {analysis.total_tokens:,}\n\n"
report += "Cost:\n"
report += f" Total: ${analysis.total_cost:.2f}\n"
return report