Skip to content

33.2 LiteLLM 閘道器部署

33.2.1 LiteLLM 簡介

LiteLLM 是一個開源的 LLM 閘道器,支援 100+ 個 LLM 提供商,包括 Anthropic、OpenAI、Cohere 等。它提供了統一的 API 介面,簡化了多提供商的使用和管理。

LiteLLM 的核心特性

  1. 多提供商支援 :支援 100+ LLM 提供商
  2. 統一 API :一致的 API 介面,簡化整合
  3. 智慧快取 :內建快取機制,減少成本和延遲
  4. 速率限制 :可配置的速率限制,控制使用
  5. 成本跟蹤 :詳細的使用情況和成本分析
  6. 負載均衡 :在多個 API 金鑰之間分配請求
  7. 失敗重試 :自動重試失敗的請求
  8. 流式響應 :支援流式輸出

LiteLLM 架構

┌─────────────────────────────────────────┐ │ Claude Code 客戶端 │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ LiteLLM Proxy │ │ ┌──────────────────────────────┐ │ │ │ API 層 │ │ │ │ (Anthropic、OpenAI 等) │ │ │ └──────────────────────────────┘ │ │ ┌──────────────────────────────┐ │ │ │ 快取層 │ │ │ │ (Redis、Memcached) │ │ │ └──────────────────────────────┘ │ │ ┌──────────────────────────────┐ │ │ │ 監控層 │ │ │ │ (Prometheus、Grafana) │ │ │ └──────────────────────────────┘ │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ LLM 提供商 │ │ (Anthropic、OpenAI、Cohere 等) │ └─────────────────────────────────────────┘

bash
## 33.2.2 安装和配置

### 1\. 安装 LiteLLM

#### 使用 Docker 安装(推荐)

    bash


    bash

    # 拉取 LiteLLM 镜像
    docker pull litellm/litellm:latest

    # 创建配置目录
    mkdir -p ~/litellm/config
    cd ~/litellm

    # 创建配置文件
    cat > config.yaml << EOF
    model_list:
      - model_name: claude-sonnet-4
        litellm_params:
          model: claude-sonnet-4-20250514
          api_key: os.environ/ANTHROPIC_API_KEY

      - model_name: claude-opus-4
        litellm_params:
          model: claude-opus-4-20250514
          api_key: os.environ/ANTHROPIC_API_KEY

      - model_name: claude-haiku-4
        litellm_params:
          model: claude-haiku-4-20250514
          api_key: os.environ/ANTHROPIC_API_KEY

    litellm_settings:
      drop_params: true
      set_verbose: true

    general_settings:
      master_key: sk-litellm-master-key-123456
      database_url: postgresql://user:password@localhost:5432/litellm

    security_settings:
      valid_api_keys:
        - sk-team-a-key-123
        - sk-team-b-key-456
    EOF

    # 启动 LiteLLM
    docker run -d \
      --name litellm \
      -p 4000:4000 \
      -v $(pwd)/config.yaml:/app/config.yaml \
      -e ANTHROPIC_API_KEY=sk-ant-xxx \
      litellm/litellm:latest

    ```#### 使用 Python 安裝

    # 安装 LiteLLM
    pip install litellm[proxy]
    # 初始化配置
    litellm init
    # 编辑配置文件
    nano litellm_config.yaml
    # 启动代理服务器
    litellm proxy --config litellm_config.yaml --port 4000

### 2\. 配置文件详解

    yaml


```yaml

    ```yaml

    # litellm_config.yaml

    # 模型列表

    model_list:

      # Anthropic Claude 模型


      - model_name: claude-sonnet-4

        litellm_params:
          model: claude-sonnet-4-20250514
          api_key: os.environ/ANTHROPIC_API_KEY
          api_base: https://api.anthropic.com
          max_tokens: 4096
          temperature: 0.7

      - model_name: claude-opus-4

        litellm_params:
          model: claude-opus-4-20250514
          api_key: os.environ/ANTHROPIC_API_KEY
          max_tokens: 4096

      - model_name: claude-haiku-4

        litellm_params:
          model: claude-haiku-4-20250514
          api_key: os.environ/ANTHROPIC_API_KEY
          max_tokens: 4096

      # Amazon Bedrock 模型


      - model_name: bedrock-claude-sonnet

        litellm_params:
          model: anthropic.claude-sonnet-4-5-20250929-v1:0
          api_base: https://bedrock-runtime.us-east-1.amazonaws.com
          api_key: os.environ/AWS_ACCESS_KEY_ID
          aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
          aws_region_name: us-east-1

      # Google Vertex AI 模型


      - model_name: vertex-claude-sonnet

        litellm_params:
          model: claude-sonnet-4-5@20250929
          api_base: https://us-central1-aiplatform.googleapis.com
          api_key: os.environ/GOOGLE_APPLICATION_CREDENTIALS
          vertex_project: os.environ/VERTEX_PROJECT_ID
          vertex_location: us-central1

    # LiteLLM 設定

    litellm_settings:
      drop_params: true              # 刪除未使用的引數
      set_verbose: true              # 啟用詳細日誌
      json_logs: true               # JSON 格式日誌
      success_callback: http://localhost:5000/callback  # 成功回撥
      failure_callback: http://localhost:5000/failure  # 失敗回撥

    # 通用設定

    general_settings:
      master_key: sk-litellm-master-key-123456  # 主金鑰
      database_url: postgresql://user:password@localhost:5432/litellm  # 資料庫 URL
      cache: redis://localhost:6379  # Redis 快取
      cache_seconds: 3600  # 快取時間(秒)

    # 安全設定

    security_settings:
      valid_api_keys:  # 有效的 API 金鑰

        - sk-team-a-key-123
        - sk-team-b-key-456
        - sk-team-c-key-789

      max_budget: 1000.0  # 最大預算(美元)
      budget_duration: monthly  # 預算週期
      rpm_limit: 100  # 每分鐘請求數限制
      tpm_limit: 10000  # 每分鐘令牌數限制

    # 負載均衡設定

    load_balancing_settings:
      routing_strategy: usage-based  # 路由策略:usage-based, round-robin, least-latency
      health_check: true  # 啟用健康檢查
      health_check_interval: 60  # 健康檢查間隔(秒)

    # 監控設定

    monitoring_settings:
      enable_prometheus: true  # 啟用 Prometheus
      prometheus_port: 9090  # Prometheus 埠
      enable_slack_alerts: true  # 啟用 Slack 告警
      slack_webhook_url: https://hooks.slack.com/services/xxx/yyy/zzz
      alert_thresholds:
        error_rate: 0.05  # 錯誤率閾值
        latency_p99: 5000  # P99 延遲閾值(毫秒)

    ```## 33.2.3 高级配置

    ### 1. 缓存配置

    # 缓存设置
    cache_settings:
    type: redis  # 缓存类型:redis, memory, none
    redis_url: redis://localhost:6379/0
    cache_ttl: 3600  # 缓存生存时间(秒)
    cache_key_prefix: litellm  # 缓存键前缀
    enable_cache_for_stream: false  # 是否为流式响应启用缓存
    cache_control_headers: true  # 是否使用缓存控制头

### 2\. 速率限制配置

    yaml
yaml

    ```yaml

    # 速率限制設定

    rate_limit_settings:
      enabled: true
      strategy: sliding_window  # 策略:sliding_window, token_bucket, fixed_window
      limits:

        - api_key: sk-team-a-key-123

          rpm: 100  # 每分鐘請求數
          tpm: 10000  # 每分鐘令牌數
          rpd: 10000  # 每天請求數

        - api_key: sk-team-b-key-456

          rpm: 50
          tpm: 5000
          rpd: 5000
      default_limits:
        rpm: 10
        tpm: 1000
        rpd: 100
      burst_size: 20  # 突發大小

    ```### 3. 预算控制配置

    # 预算设置
    budget_settings:
    enabled: true
    currency: USD
    budgets:
    - name: team-a-budget
    api_keys:
    - sk-team-a-key-123
    limit: 1000.0
    period: monthly
    alert_threshold: 0.8  # 在 80% 时告警
    hard_limit: true  # 达到限制时阻止请求
    - name: team-b-budget
    api_keys:
    - sk-team-b-key-456
    limit: 500.0
    period: monthly
    alert_threshold: 0.9
    hard_limit: false
    cost_tracking:
    enabled: true
    update_interval: 60  # 更新间隔(秒)
    storage: database  # 存储方式:database, file

### 4\. 监控和告警配置

    yaml
bash

    ```yaml

    # 監控設定

    monitoring_settings:
      prometheus:
        enabled: true
        port: 9090
        metrics:

          - request_count
          - request_duration
          - error_count
          - cache_hit_rate
          - token_usage
          - cost

      grafana:
        enabled: true
        dashboard_url: http://localhost:3000/d/litellm

      alerts:
        slack:
          enabled: true
          webhook_url: https://hooks.slack.com/services/xxx/yyy/zzz
          channels:

            - litellm-alerts
            - devops-notifications

          alert_rules:

            - name: high_error_rate

              condition: error_rate > 0.05
              duration: 5m
              severity: warning

            - name: high_latency

              condition: p99_latency > 5000
              duration: 2m
              severity: critical

            - name: budget_exceeded

              condition: budget_usage > 1.0
              severity: critical

        email:
          enabled: true
          smtp_server: smtp.gmail.com
          smtp_port: 587
          smtp_username: alerts@company.com
          smtp_password: ${SMTP_PASSWORD}
          from_address: litellm-alerts@company.com
          to_addresses:

            - devops@company.com
            - finance@company.com

    ```## 33.2.4 集成 Claude Code

    ### 1. 配置 Claude Code 使用 LiteLLM

    # 方法 1:使用统一端点(推荐)
    export ANTHROPIC_BASE_URL=https://litellm-server:4000
    export ANTHROPIC_AUTH_TOKEN=sk-litellm-static-key
    # 方法 2:使用 Anthropic 格式端点
    export ANTHROPIC_BASE_URL=https://litellm-server:4000/anthropic
    export ANTHROPIC_AUTH_TOKEN=sk-litellm-static-key

# 方法 3:使用 API 密钥辅助程序

# 创建辅助程序脚本
cat > ~/bin/get-litellm-key.sh << 'EOF' #!/bin/bash

# 从 Vault 获取密钥

vault kv get -field=api_key secret/litellm/claude-code EOF chmod +x ~/bin/get-litellm-key.sh

# 配置 Claude Code 使用辅助程序

cat > ~~/.claude-code/settings.json << EOF { "apiKeyHelper": "~~/bin/get-litellm-key.sh", "env": { "ANTHROPIC_BASE_URL": "<https://litellm-server:4000>" } } EOF

    bash


    ### 2. 验证配置

    ```python

    ```python

    class LiteLLMValidator:
        """LiteLLM 验证器"""

        def __init__(self, gateway_url: str, auth_token: str):
            self.gateway_url = gateway_url
            self.auth_token = auth_token

        def validate_connection(self) -> ValidationResult:
            """验证连接"""
            result = ValidationResult()

            try:
                # 测试健康检查端点
                response = requests.get(
                    f"{self.gateway_url}/health",
                    headers={'Authorization': f'Bearer {self.auth_token}'},
                    timeout=10
                )

                if response.status_code == 200:
                    result.success = True
                    result.message = "Connection successful"
                else:
                    result.success = False
                    result.message = f"Health check failed: {response.status_code}"

            except requests.exceptions.Timeout:
                result.success = False
                result.message = "Connection timeout"
            except requests.exceptions.ConnectionError:
                result.success = False
                result.message = "Connection error"
            except Exception as e:
                result.success = False
                result.message = f"Unexpected error: {str(e)}"

            return result

        def validate_model_access(self, model: str) -> ValidationResult:
            """验证模型访问"""
            result = ValidationResult()

            try:
                # 测试模型访问
                response = requests.post(
                    f"{self.gateway_url}/v1/completions",
                    headers={
                        'Authorization': f'Bearer {self.auth_token}',
                        'Content-Type': 'application/json'
                    },
                    json={
                        'model': model,
                        'prompt': 'Hello',
                        'max_tokens': 10
                    },
                    timeout=30
                )

                if response.status_code == 200:
                    result.success = True
                    result.message = f"Model {model} accessible"
                else:
                    result.success = False
                    result.message = f"Model access failed: {response.status_code}"
                    result.error = response.text

            except Exception as e:
                result.success = False
                result.message = f"Model access error: {str(e)}"

            return result

        def validate_all(self) -> ValidationReport:
            """验证所有配置"""
            report = ValidationReport()

            # 验证连接
            report.connection = self.validate_connection()

            # 验证模型访问
            models = ['claude-sonnet-4', 'claude-opus-4', 'claude-haiku-4']
            report.models = {}

            for model in models:
                report.models[model] = self.validate_model_access(model)

            # 生成摘要
            report.summary = self._generate_summary(report)

            return report

        def _generate_summary(self, report: ValidationReport) -> str:
            """生成验证摘要"""
            summary = "LiteLLM Validation Summary:\n\n"

            summary += f"Connection: {'✓' if report.connection.success else '✗'} "
            summary += f"{report.connection.message}\n\n"

            summary += "Model Access:\n"
            for model, result in report.models.items():
                status = '✓' if result.success else '✗'
                summary += f"  {status} {model}: {result.message}\n"

            return summary

    ```## 33.2.5 監控和維護

```python
    ### 1. Prometheus 监控

    # prometheus.yml
    global:
    scrape_interval: 15s
    evaluation_interval: 15s
    scrape_configs:
    - job_name: 'litellm'
    static_configs:
    - targets: ['litellm-server:9090']
    metrics_path: '/metrics'

### 2\. Grafana 仪表板

    json


    ```json

```python
    {
      "dashboard": {
        "title": "LiteLLM Dashboard",
        "panels": [
          {
            "title": "Request Rate",
            "targets": [
              {
                "expr": "rate(litellm_request_count[1m])"
              }
            ]
          },
          {
            "title": "Error Rate",
            "targets": [
              {
                "expr": "rate(litellm_error_count[1m]) / rate(litellm_request_count[1m])"
              }
            ]
          },
          {
            "title": "P99 Latency",
            "targets": [
              {
                "expr": "histogram_quantile(0.99, rate(litellm_request_duration_bucket[1m]))"
              }
            ]
          },
          {
            "title": "Cache Hit Rate",
            "targets": [
              {
                "expr": "rate(litellm_cache_hits[1m]) / rate(litellm_cache_requests[1m])"
              }
            ]
          },
          {
            "title": "Token Usage",
            "targets": [
              {
                "expr": "rate(litellm_token_usage[1m])"
              }
            ]
          },
          {
            "title": "Cost",
            "targets": [
              {
                "expr": "litellm_cost_total"
              }
            ]
          }
        ]
      }
    }

    ```### 3. 日誌管理

    class LiteLLMLogManager:
    """LiteLLM 日誌管理器"""
    def __init__(self, log_file: str):
    self.log_file = log_file
    self.log_parser = LiteLLMLogParser()
    def analyze_logs(self,
    start_time: datetime = None,
    end_time: datetime = None) -> LogAnalysis:
    """分析日誌"""
    analysis = LogAnalysis()

    # 讀取日誌檔案

    with open(self.log_file, 'r') as f:
    logs = f.readlines()

    # 解析日誌

    parsed_logs = []
    for log in logs:
    try:
    parsed = self.log_parser.parse(log)
    parsed_logs.append(parsed)
    except Exception as e:
    logger.warning(f"Failed to parse log: {e}")

    # 過濾時間範圍

    if start_time or end_time:
    parsed_logs = [
    log for log in parsed_logs
    if (not start_time or log.timestamp >= start_time) and
    (not end_time or log.timestamp <= end_time)
    ]

    # 分析日誌

    analysis.total_requests = len(parsed_logs)
    analysis.successful_requests = sum(

    1 for log in parsed_logs if log.status == 'success'

    )
    analysis.failed_requests = sum(

    1 for log in parsed_logs if log.status == 'error'

    )
    analysis.error_rate = (
    analysis.failed_requests / analysis.total_requests
    if analysis.total_requests > 0 else 0
    )

    # 分析延遲

    latencies = [log.duration for log in parsed_logs if log.duration]
    if latencies:
    analysis.avg_latency = sum(latencies) / len(latencies)
    analysis.p50_latency = np.percentile(latencies, 50)
    analysis.p95_latency = np.percentile(latencies, 95)
    analysis.p99_latency = np.percentile(latencies, 99)

    # 分析令牌使用

    analysis.total_tokens = sum(
    log.input_tokens + log.output_tokens
    for log in parsed_logs
    )

    # 分析成本

    analysis.total_cost = sum(log.cost for log in parsed_logs)
    return analysis
    def generate_report(self, analysis: LogAnalysis) -> str:
    """生成報告"""
    report = "LiteLLM Log Analysis Report\n"
    report += "=" * 50 + "\n\n"
    report += "Request Summary:\n"
    report += f"  Total: {analysis.total_requests}\n"
    report += f"  Successful: {analysis.successful_requests}\n"
    report += f"  Failed: {analysis.failed_requests}\n"
    report += f"  Error Rate: {analysis.error_rate:.2%}\n\n"
    report += "Latency (ms):\n"
    report += f"  Average: {analysis.avg_latency:.0f}\n"
    report += f"  P50: {analysis.p50_latency:.0f}\n"
    report += f"  P95: {analysis.p95_latency:.0f}\n"
    report += f"  P99: {analysis.p99_latency:.0f}\n\n"
    report += "Token Usage:\n"
    report += f"  Total: {analysis.total_tokens:,}\n\n"
    report += "Cost:\n"
    report += f"  Total: ${analysis.total_cost:.2f}\n"
    return report

基于 MIT 许可发布 | 永久导航