34.3 企业级监控与维护

学习如何建立企业级监控和维护体系，确保 Claude Code 在生产环境中的稳定运行和持续优化。

34.3.1 监控体系概述

监控的重要性

企业级监控对于 Claude Code 部署至关重要，它可以帮助：

确保可用性：及时发现和解决服务中断
优化性能：识别性能瓶颈并优化资源使用
安全防护：检测异常行为和安全威胁
成本控制：监控使用情况和资源消耗
合规审计：满足企业合规要求

监控维度

python

# 企业级监控维度
MONITORING_DIMENSIONS = {
    "可用性监控": {
        "指标": ["服务状态", "响应时间", "错误率"],
        "目标": "99.9% 可用性"
    },
    "性能监控": {
        "指标": ["API 延迟", "令牌使用", "并发连接"],
        "目标": "P95 延迟 < 2s"
    },
    "资源监控": {
        "指标": ["CPU 使用率", "内存使用", "磁盘 I/O", "网络带宽"],
        "目标": "资源利用率 < 80%"
    },
    "安全监控": {
        "指标": ["异常访问", "权限违规", "数据泄露"],
        "目标": "零安全事件"
    },
    "成本监控": {
        "指标": ["API 调用成本", "令牌成本", "基础设施成本"],
        "目标": "成本控制在预算内"
    }
}

34.3.2 指标收集

Prometheus 配置

yaml

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Claude Code API 监控
  - job_name: 'claude-code-api'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'
    scrape_interval: 10s

  # LLM 网关监控
  - job_name: 'llm-gateway'
    static_configs:
      - targets: ['localhost:4000']
    metrics_path: '/metrics'
    scrape_interval: 10s

  # 开发容器监控
  - job_name: 'dev-containers'
    static_configs:
      - targets: ['localhost:9323']
    metrics_path: '/metrics'
    scrape_interval: 30s

  # 沙箱监控
  - job_name: 'sandbox'
    static_configs:
      - targets: ['localhost:9100']
    metrics_path: '/metrics'
    scrape_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

自定义指标导出器

python

# claude_code_exporter.py
from prometheus_client import start_http_server, Gauge, Counter, Histogram
import time
import json
import requests
from datetime import datetime

# 定义指标
api_requests_total = Counter(
    'claude_code_api_requests_total',
    'Total API requests',
    ['endpoint', 'status']
)

api_latency = Histogram(
    'claude_code_api_latency_seconds',
    'API request latency',
    ['endpoint']
)

active_sessions = Gauge(
    'claude_code_active_sessions',
    'Number of active sessions'
)

tokens_used = Counter(
    'claude_code_tokens_used_total',
    'Total tokens used',
    ['model', 'type']
)

cost_incurred = Gauge(
    'claude_code_cost_usd',
    'Total cost incurred in USD'
)

class ClaudeCodeMetricsCollector:
    def __init__(self, api_base_url='http://localhost:8080'):
        self.api_base_url = api_base_url
        self.start_time = datetime.now()

    def collect_api_metrics(self):
        """收集 API 指标"""
        try:
            # 获取 API 状态
            response = requests.get(f'{self.api_base_url}/health')
            if response.status_code == 200:
                data = response.json()

                # 更新活跃会话数
                active_sessions.set(data.get('active_sessions', 0))

                # 更新令牌使用
                tokens = data.get('tokens_used', {})
                for model, count in tokens.items():
                    tokens_used.labels(model=model, type='input').inc(count.get('input', 0))
                    tokens_used.labels(model=model, type='output').inc(count.get('output', 0))

                # 更新成本
                cost_incurred.set(data.get('total_cost', 0.0))
        except Exception as e:
            print(f"Error collecting API metrics: {e}")

    def collect_performance_metrics(self):
        """收集性能指标"""
        try:
            # 测试 API 延迟
            start_time = time.time()
            response = requests.get(f'{self.api_base_url}/health')
            latency = time.time() - start_time

            # 记录延迟
            api_latency.labels(endpoint='/health').observe(latency)

            # 记录请求
            api_requests_total.labels(
                endpoint='/health',
                status=response.status_code
            ).inc()
        except Exception as e:
            print(f"Error collecting performance metrics: {e}")

    def collect_sandbox_metrics(self):
        """收集沙箱指标"""
        try:
            response = requests.get(f'{self.api_base_url}/sandbox/status')
            if response.status_code == 200:
                data = response.json()
                # 沙箱违规计数
                violations = data.get('violations', 0)
                # 可以添加更多沙箱相关指标
        except Exception as e:
            print(f"Error collecting sandbox metrics: {e}")

    def run(self, interval=10):
        """运行指标收集器"""
        start_http_server(9100)
        print("Metrics server started on port 9100")

        while True:
            self.collect_api_metrics()
            self.collect_performance_metrics()
            self.collect_sandbox_metrics()
            time.sleep(interval)

if __name__ == '__main__':
    collector = ClaudeCodeMetricsCollector()
    collector.run()

日志收集配置

yaml

# filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/claude-code/*.log
    fields:
      service: claude-code
      environment: production
    fields_under_root: true

  - type: log
    enabled: true
    paths:
      - /var/log/llm-gateway/*.log
    fields:
      service: llm-gateway
      environment: production
    fields_under_root: true

  - type: log
    enabled: true
    paths:
      - /var/log/claude-sandbox/*.log
    fields:
      service: claude-sandbox
      environment: production
    fields_under_root: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "claude-code-%{+yyyy.MM.dd}"

setup.kibana:
  host: "kibana:5601"

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~

34.3.3 告警配置

Prometheus 告警规则

yaml

# alert_rules.yml
groups:
  - name: claude_code_alerts
    interval: 30s
    rules:
      # 服务可用性告警
      - alert: ClaudeCodeServiceDown
        expr: up{job="claude-code-api"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Claude Code 服务不可用"
          description: "Claude Code API 服务已宕机超过 1 分钟"

      # API 错误率告警
      - alert: HighAPIErrorRate
        expr: |
          rate(claude_code_api_requests_total{status=~"5.."}[5m]) /
          rate(claude_code_api_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "API 错误率过高"
          description: "API 错误率超过 5% (当前: {{ $value }})"

      # API 延迟告警
      - alert: HighAPILatency
        expr: |
          histogram_quantile(0.95,
            rate(claude_code_api_latency_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "API 延迟过高"
          description: "API P95 延迟超过 2 秒 (当前: {{ $value }}s)"

      # 令牌使用告警
      - alert: HighTokenUsage
        expr: |
          rate(claude_code_tokens_used_total[1h]) > 100000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "令牌使用率过高"
          description: "令牌使用率超过 100,000/小时 (当前: {{ $value }})"

      # 成本告警
      - alert: HighCostIncurred
        expr: claude_code_cost_usd > 1000
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "成本超过阈值"
          description: "累计成本超过 $1000 (当前: ${{ $value }})"

      # 沙箱违规告警
      - alert: SandboxViolations
        expr: |
          rate(claude_sandbox_violations_total[5m]) > 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "沙箱违规频繁"
          description: "沙箱违规率超过 10/分钟 (当前: {{ $value }})"

      # 资源使用告警
      - alert: HighCPUUsage
        expr: |
          rate(process_cpu_seconds_total{job="claude-code-api"}[5m]) > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高"
          description: "CPU 使用率超过 80% (当前: {{ $value }})"

      - alert: HighMemoryUsage
        expr: |
          process_resident_memory_bytes{job="claude-code-api"} /
          node_memory_MemTotal_bytes > 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率过高"
          description: "内存使用率超过 80% (当前: {{ $value }})"

Alertmanager 配置

yaml

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
      continue: false

    - match:
        severity: warning
      receiver: 'warning-alerts'
      continue: false

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@company.com'
        from: 'alerts@company.com'
        smarthost: 'smtp.company.com:587'
        auth_username: 'alerts@company.com'
        auth_password: 'password'

  - name: 'critical-alerts'
    email_configs:
      - to: 'oncall@company.com'
        from: 'alerts@company.com'
        smarthost: 'smtp.company.com:587'
        auth_username: 'alerts@company.com'
        auth_password: 'password'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#critical-alerts'
        title: 'Claude Code Critical Alert'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'warning-alerts'
    email_configs:
      - to: 'dev-team@company.com'
        from: 'alerts@company.com'
        smarthost: 'smtp.company.com:587'
        auth_username: 'alerts@company.com'
        auth_password: 'password'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#warnings'
        title: 'Claude Code Warning'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname']

34.3.4 可视化仪表板

Grafana 仪表板配置

json

{
  "dashboard": {
    "title": "Claude Code Enterprise Dashboard",
    "panels": [
      {
        "title": "API 请求速率",
        "targets": [
          {
            "expr": "rate(claude_code_api_requests_total[5m])",
            "legendFormat": "{{ endpoint }}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "API 延迟 (P95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(claude_code_api_latency_seconds_bucket[5m]))",
            "legendFormat": "P95"
          }
        ],
        "type": "graph"
      },
      {
        "title": "活跃会话数",
        "targets": [
          {
            "expr": "claude_code_active_sessions",
            "legendFormat": "Sessions"
          }
        ],
        "type": "stat"
      },
      {
        "title": "令牌使用率",
        "targets": [
          {
            "expr": "rate(claude_code_tokens_used_total[1h])",
            "legendFormat": "{{ model }} - {{ type }}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "累计成本",
        "targets": [
          {
            "expr": "claude_code_cost_usd",
            "legendFormat": "Cost (USD)"
          }
        ],
        "type": "stat"
      },
      {
        "title": "API 错误率",
        "targets": [
          {
            "expr": "rate(claude_code_api_requests_total{status=~\"5..\"}[5m]) / rate(claude_code_api_requests_total[5m])",
            "legendFormat": "Error Rate"
          }
        ],
        "type": "graph"
      },
      {
        "title": "沙箱违规",
        "targets": [
          {
            "expr": "rate(claude_sandbox_violations_total[5m])",
            "legendFormat": "Violations/min"
          }
        ],
        "type": "graph"
      },
      {
        "title": "资源使用",
        "targets": [
          {
            "expr": "rate(process_cpu_seconds_total{job=\"claude-code-api\"}[5m])",
            "legendFormat": "CPU"
          },
          {
            "expr": "process_resident_memory_bytes{job=\"claude-code-api\"} / 1024 / 1024 / 1024",
            "legendFormat": "Memory (GB)"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

34.3.5 日志分析

ELK Stack 配置

python

# log_analyzer.py
import elasticsearch
from elasticsearch import Elasticsearch
from datetime import datetime, timedelta
import json

class ClaudeCodeLogAnalyzer:
    def __init__(self, es_host='http://localhost:9200'):
        self.es = Elasticsearch([es_host])
        self.index_pattern = 'claude-code-*'

    def search_errors(self, hours=24):
        """搜索错误日志"""
        query = {
            "query": {
                "bool": {
                    "must": [
                        {"match": {"level": "ERROR"}},
                        {"range": {
                            "@timestamp": {
                                "gte": (datetime.now() - timedelta(hours=hours)).isoformat()
                            }
                        }}
                    ]
                }
            }
        }

        response = self.es.search(index=self.index_pattern, body=query)
        return response['hits']['hits']

    def search_slow_requests(self, threshold_seconds=2, hours=24):
        """搜索慢请求"""
        query = {
            "query": {
                "bool": {
                    "must": [
                        {"range": {
                            "latency": {
                                "gte": threshold_seconds
                            }
                        }},
                        {"range": {
                            "@timestamp": {
                                "gte": (datetime.now() - timedelta(hours=hours)).isoformat()
                            }
                        }}
                    ]
                }
            }
        }

        response = self.es.search(index=self.index_pattern, body=query)
        return response['hits']['hits']

    def analyze_user_activity(self, user_id, days=7):
        """分析用户活动"""
        query = {
            "query": {
                "bool": {
                    "must": [
                        {"match": {"user_id": user_id}},
                        {"range": {
                            "@timestamp": {
                                "gte": (datetime.now() - timedelta(days=days)).isoformat()
                            }
                        }}
                    ]
                }
            },
            "aggs": {
                "daily_requests": {
                    "date_histogram": {
                        "field": "@timestamp",
                        "calendar_interval": "day"
                    },
                    "aggs": {
                        "total_tokens": {
                            "sum": {
                                "field": "tokens_used"
                            }
                        }
                    }
                }
            }
        }

        response = self.es.search(index=self.index_pattern, body=query)
        return response

    def detect_anomalies(self, hours=1):
        """检测异常"""
        # 计算平均请求速率
        avg_query = {
            "query": {
                "range": {
                    "@timestamp": {
                        "gte": (datetime.now() - timedelta(hours=hours*2)).isoformat(),
                        "lt": (datetime.now() - timedelta(hours=hours)).isoformat()
                    }
                }
            },
            "aggs": {
                "avg_rate": {
                    "avg": {
                        "script": {
                            "source": "doc['request_count'].value"
                        }
                    }
                }
            }
        }

        avg_response = self.es.search(index=self.index_pattern, body=avg_query)
        avg_rate = avg_response['aggregations']['avg_rate']['value']

        # 检查当前速率是否异常
        current_query = {
            "query": {
                "range": {
                    "@timestamp": {
                        "gte": (datetime.now() - timedelta(hours=hours)).isoformat()
                    }
                }
            },
            "aggs": {
                "current_rate": {
                    "avg": {
                        "script": {
                            "source": "doc['request_count'].value"
                        }
                    }
                }
            }
        }

        current_response = self.es.search(index=self.index_pattern, body=current_query)
        current_rate = current_response['aggregations']['current_rate']['value']

        # 如果当前速率超过平均值的 2 倍，视为异常
        if current_rate > avg_rate * 2:
            return {
                "anomaly": True,
                "avg_rate": avg_rate,
                "current_rate": current_rate,
                "threshold": avg_rate * 2
            }

        return {"anomaly": False}

# 使用示例
analyzer = ClaudeCodeLogAnalyzer()

# 搜索错误
errors = analyzer.search_errors(hours=24)
print(f"发现 {len(errors)} 个错误")

# 搜索慢请求
slow_requests = analyzer.search_slow_requests(threshold_seconds=2, hours=24)
print(f"发现 {len(slow_requests)} 个慢请求")

# 分析用户活动
user_activity = analyzer.analyze_user_activity(user_id="user123", days=7)

# 检测异常
anomalies = analyzer.detect_anomalies(hours=1)
if anomalies['anomaly']:
    print(f"检测到异常！当前速率: {anomalies['current_rate']}, 阈值: {anomalies['threshold']}")

34.3.6 维护策略

定期维护任务

bash

#!/bin/bash
# maintenance.sh
set -e

LOG_DIR="/var/log/claude-code"
BACKUP_DIR="/backup/claude-code"
DATE=$(date +%Y-%m-%d)

echo "=== Claude Code 维护脚本 - $DATE ==="

# 1. 日志轮转
echo "执行日志轮转..."
logrotate -f /etc/logrotate.d/claude-code

# 2. 清理旧日志
echo "清理 30 天前的日志..."
find $LOG_DIR -name "*.log" -mtime +30 -delete

# 3. 备份配置
echo "备份配置文件..."
mkdir -p $BACKUP_DIR/$DATE
cp -r /etc/claude-code $BACKUP_DIR/$DATE/

# 4. 清理缓存
echo "清理缓存..."
rm -rf /tmp/claude-code-cache/*

# 5. 数据库维护（如果使用）
echo "执行数据库维护..."
# psql -U claude -d claude_code -c "VACUUM ANALYZE;"

# 6. 生成维护报告
echo "生成维护报告..."
cat > $BACKUP_DIR/$DATE/maintenance-report.txt << EOF
Claude Code 维护报告
日期: $DATE

日志轮转: 完成
旧日志清理: 完成
配置备份: 完成
缓存清理: 完成
数据库维护: 完成

磁盘使用情况:
$(df -h /var/log/claude-code)

服务状态:
$(systemctl status claude-code --no-pager)
EOF

echo "维护完成！报告已保存到 $BACKUP_DIR/$DATE/maintenance-report.txt"

健康检查脚本

python

# health_check.py
import requests
import json
import sys
from datetime import datetime

class ClaudeCodeHealthChecker:
    def __init__(self, api_base_url='http://localhost:8080'):
        self.api_base_url = api_base_url
        self.checks = []

    def check_api_health(self):
        """检查 API 健康状态"""
        try:
            response = requests.get(f'{self.api_base_url}/health', timeout=5)
            if response.status_code == 200:
                data = response.json()
                self.checks.append({
                    "name": "API Health",
                    "status": "healthy",
                    "details": data
                })
                return True
            else:
                self.checks.append({
                    "name": "API Health",
                    "status": "unhealthy",
                    "details": f"Status code: {response.status_code}"
                })
                return False
        except Exception as e:
            self.checks.append({
                "name": "API Health",
                "status": "unhealthy",
                "details": str(e)
            })
            return False

    def check_llm_gateway(self):
        """检查 LLM 网关"""
        try:
            response = requests.get('http://localhost:4000/health', timeout=5)
            if response.status_code == 200:
                self.checks.append({
                    "name": "LLM Gateway",
                    "status": "healthy",
                    "details": response.json()
                })
                return True
            else:
                self.checks.append({
                    "name": "LLM Gateway",
                    "status": "unhealthy",
                    "details": f"Status code: {response.status_code}"
                })
                return False
        except Exception as e:
            self.checks.append({
                "name": "LLM Gateway",
                "status": "unhealthy",
                "details": str(e)
            })
            return False

    def check_sandbox(self):
        """检查沙箱状态"""
        try:
            response = requests.get(f'{self.api_base_url}/sandbox/status', timeout=5)
            if response.status_code == 200:
                data = response.json()
                self.checks.append({
                    "name": "Sandbox",
                    "status": "healthy",
                    "details": data
                })
                return True
            else:
                self.checks.append({
                    "name": "Sandbox",
                    "status": "unhealthy",
                    "details": f"Status code: {response.status_code}"
                })
                return False
        except Exception as e:
            self.checks.append({
                "name": "Sandbox",
                "status": "unhealthy",
                "details": str(e)
            })
            return False

    def check_disk_space(self, threshold=90):
        """检查磁盘空间"""
        import shutil
        usage = shutil.disk_usage('/')
        percent = (usage.used / usage.total) * 100

        if percent < threshold:
            self.checks.append({
                "name": "Disk Space",
                "status": "healthy",
                "details": f"Usage: {percent:.1f}%"
            })
            return True
        else:
            self.checks.append({
                "name": "Disk Space",
                "status": "unhealthy",
                "details": f"Usage: {percent:.1f}% (Threshold: {threshold}%)"
            })
            return False

    def check_memory(self, threshold=90):
        """检查内存使用"""
        import psutil
        percent = psutil.virtual_memory().percent

        if percent < threshold:
            self.checks.append({
                "name": "Memory",
                "status": "healthy",
                "details": f"Usage: {percent:.1f}%"
            })
            return True
        else:
            self.checks.append({
                "name": "Memory",
                "status": "unhealthy",
                "details": f"Usage: {percent:.1f}% (Threshold: {threshold}%)"
            })
            return False

    def run_all_checks(self):
        """运行所有检查"""
        self.check_api_health()
        self.check_llm_gateway()
        self.check_sandbox()
        self.check_disk_space()
        self.check_memory()

        return self.checks

    def generate_report(self):
        """生成健康检查报告"""
        report = {
            "timestamp": datetime.now().isoformat(),
            "overall_status": "healthy",
            "checks": self.checks
        }

        # 确定整体状态
        for check in self.checks:
            if check['status'] == 'unhealthy':
                report['overall_status'] = 'unhealthy'
                break

        return report

    def print_report(self):
        """打印报告"""
        report = self.generate_report()

        print("=" * 50)
        print(f"Claude Code 健康检查报告")
        print(f"时间: {report['timestamp']}")
        print(f"整体状态: {report['overall_status'].upper()}")
        print("=" * 50)

        for check in report['checks']:
            status_icon = "✓" if check['status'] == 'healthy' else "✗"
            print(f"{status_icon} {check['name']}: {check['status']}")
            print(f"  详情: {check['details']}")
            print()

        return report['overall_status'] == 'healthy'

if __name__ == '__main__':
    checker = ClaudeCodeHealthChecker()
    checker.run_all_checks()
    is_healthy = checker.print_report()

    sys.exit(0 if is_healthy else 1)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182

34.3.7 灾难恢复

备份策略

bash

#!/bin/bash
# backup.sh
set -e

BACKUP_DIR="/backup/claude-code"
DATE=$(date +%Y-%m-%d_%H-%M-%S)
BACKUP_PATH="$BACKUP_DIR/$DATE"

echo "=== Claude Code 备份脚本 - $DATE ==="

# 创建备份目录
mkdir -p $BACKUP_PATH

# 1. 备份配置文件
echo "备份配置文件..."
tar -czf $BACKUP_PATH/config.tar.gz /etc/claude-code

# 2. 备份数据库
echo "备份数据库..."
# pg_dump -U claude claude_code > $BACKUP_PATH/database.sql

# 3. 备份日志
echo "备份日志..."
tar -czf $BACKUP_PATH/logs.tar.gz /var/log/claude-code

# 4. 备份用户数据
echo "备份用户数据..."
tar -czf $BACKUP_PATH/user-data.tar.gz /var/lib/claude-code

# 5. 生成备份清单
echo "生成备份清单..."
cat > $BACKUP_PATH/manifest.txt << EOF
备份清单
日期: $DATE
配置文件: config.tar.gz
数据库: database.sql
日志: logs.tar.gz
用户数据: user-data.tar.gz

文件大小:
$(du -sh $BACKUP_PATH/*)
EOF

# 6. 上传到远程存储（可选）
echo "上传到远程存储..."
# aws s3 cp $BACKUP_PATH s3://company-backups/claude-code/$DATE --recursive

# 7. 清理旧备份（保留最近 30 天）
echo "清理旧备份..."
find $BACKUP_DIR -type d -mtime +30 -exec rm -rf {} \;

echo "备份完成！备份位置: $BACKUP_PATH"

恢复脚本

bash

#!/bin/bash
# restore.sh

set -e

if [ -z "$1" ]; then
    echo "用法: $0 <备份目录>"
    exit 1
fi

BACKUP_PATH="$1"

echo "=== Claude Code 恢复脚本 ==="
echo "备份目录: $BACKUP_PATH"

# 1. 停止服务
echo "停止服务..."
systemctl stop claude-code

# 2. 恢复配置文件
echo "恢复配置文件..."
tar -xzf $BACKUP_PATH/config.tar.gz -C /

# 3. 恢复数据库
echo "恢复数据库..."
# psql -U claude -d claude_code < $BACKUP_PATH/database.sql

# 4. 恢复用户数据
echo "恢复用户数据..."
tar -xzf $BACKUP_PATH/user-data.tar.gz -C /

# 5. 启动服务
echo "启动服务..."
systemctl start claude-code

# 6. 验证恢复
echo "验证恢复..."
sleep 5
if systemctl is-active --quiet claude-code; then
    echo "服务启动成功！"
else
    echo "服务启动失败！"
    exit 1
fi

echo "恢复完成！"

34.3.8 性能优化

缓存策略

python

# cache_manager.py
import redis
import json
from datetime import datetime, timedelta

class CacheManager:
    def __init__(self, redis_host='localhost', redis_port=6379):
        self.redis = redis.Redis(host=redis_host, port=redis_port, decode_responses=True)

    def cache_api_response(self, key, response, ttl=3600):
        """缓存 API 响应"""
        self.redis.setex(key, ttl, json.dumps(response))

    def get_cached_response(self, key):
        """获取缓存的响应"""
        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)
        return None

    def cache_token_count(self, user_id, count, ttl=86400):
        """缓存令牌计数"""
        key = f"tokens:{user_id}:{datetime.now().strftime('%Y-%m-%d')}"
        self.redis.incrby(key, count)
        self.redis.expire(key, ttl)

    def get_token_count(self, user_id):
        """获取令牌计数"""
        key = f"tokens:{user_id}:{datetime.now().strftime('%Y-%m-%d')}"
        count = self.redis.get(key)
        return int(count) if count else 0

    def cache_model_response(self, model, prompt_hash, response, ttl=7200):
        """缓存模型响应"""
        key = f"model:{model}:{prompt_hash}"
        self.redis.setex(key, ttl, json.dumps(response))

    def get_cached_model_response(self, model, prompt_hash):
        """获取缓存的模型响应"""
        key = f"model:{model}:{prompt_hash}"
        cached = self.redis.get(key)
        if cached:
            return json.loads(cached)
        return None

# 使用示例
cache = CacheManager()

# 缓存 API 响应
cache.cache_api_response("api:user:123:profile", {"name": "John"}, ttl=3600)

# 获取缓存的响应
cached = cache.get_cached_response("api:user:123:profile")

负载均衡配置

nginx

# nginx.conf
upstream claude_code_backend {
    least_conn;
    server claude-code-1:8080 weight=3;
    server claude-code-2:8080 weight=2;
    server claude-code-3:8080 weight=1;

    keepalive 32;
}

server {
    listen 80;
    server_name claude-code.company.com;

    # 重定向到 HTTPS
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name claude-code.company.com;

    ssl_certificate /etc/nginx/ssl/claude-code.crt;
    ssl_certificate_key /etc/nginx/ssl/claude-code.key;

    # SSL 配置
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;

    # 日志
    access_log /var/log/nginx/claude-code-access.log;
    error_log /var/log/nginx/claude-code-error.log;

    # 代理配置
    location / {
        proxy_pass http://claude_code_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # 超时配置
        proxy_connect_timeout 60s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;

        # 缓冲配置
        proxy_buffering on;
        proxy_buffer_size 4k;
        proxy_buffers 8 4k;
        proxy_busy_buffers_size 8k;

        # 健康检查
        health_check interval=10s fails=3 passes=2;
    }

    # 健康检查端点
    location /health {
        proxy_pass http://claude_code_backend/health;
        access_log off;
    }
}

34.3.9 小结

本节介绍了企业级监控和维护的各个方面，包括：

监控体系概述和监控维度
指标收集（Prometheus、自定义导出器）
告警配置（Prometheus、Alertmanager）
可视化仪表板（Grafana）
日志分析（ELK Stack）
维护策略（定期维护、健康检查）
灾难恢复（备份和恢复）
性能优化（缓存、负载均衡）

通过建立完善的监控和维护体系，企业可以确保 Claude Code 在生产环境中的稳定运行，及时发现和解决问题，优化性能和成本控制。

34.3 企业级监控与维护 ​

34.3.1 监控体系概述 ​

监控的重要性 ​

监控维度 ​

34.3.2 指标收集 ​

Prometheus 配置 ​

自定义指标导出器 ​

日志收集配置 ​

34.3.3 告警配置 ​

Prometheus 告警规则 ​

Alertmanager 配置 ​

34.3.4 可视化仪表板 ​

Grafana 仪表板配置 ​

34.3.5 日志分析 ​

ELK Stack 配置 ​

34.3.6 维护策略 ​

定期维护任务 ​

健康检查脚本 ​

34.3.7 灾难恢复 ​

备份策略 ​

恢复脚本 ​

34.3.8 性能优化 ​

缓存策略 ​

负载均衡配置 ​

34.3.9 小结 ​

34.3 企业级监控与维护

34.3.1 监控体系概述

监控的重要性

监控维度

34.3.2 指标收集

Prometheus 配置

自定义指标导出器

日志收集配置

34.3.3 告警配置

Prometheus 告警规则

Alertmanager 配置

34.3.4 可视化仪表板

Grafana 仪表板配置

34.3.5 日志分析

ELK Stack 配置

34.3.6 维护策略

定期维护任务

健康检查脚本

34.3.7 灾难恢复

备份策略

恢复脚本

34.3.8 性能优化

缓存策略

负载均衡配置

34.3.9 小结