🚀 快速安装

复制以下命令并运行,立即安装此 Skill:

npx @anthropic-ai/skills install supercent-io/skills-template/monitoring-observability

💡 提示:需要 Node.js 和 NPM

监控与可观测性

何时使用此技能

  • 生产部署前:设置必要的监控系统
  • 性能问题:识别瓶颈
  • 事件响应:快速定位根本原因
  • SLA 合规:跟踪可用性/响应时间

指示

步骤 1:指标收集(Prometheus)

应用仪表化(Node.js):

import express from 'express';
import promClient from 'prom-client';

const app = express();

// 默认指标(CPU、内存等)
promClient.collectDefaultMetrics();

// 自定义指标
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP 请求持续时间(秒)',
  labelNames: ['method', 'route', 'status_code']
});

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'HTTP 请求总数',
  labelNames: ['method', 'route', 'status_code']
});

// 用于跟踪请求的中间件
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const labels = {
      method: req.method,
      route: req.route?.path || req.path,
      status_code: res.statusCode
    };

    httpRequestDuration.observe(labels, duration);
    httpRequestTotal.inc(labels);
  });

  next();
});

// 指标端点
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});

app.listen(3000);

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['localhost:3000']
    metrics_path: '/metrics'

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - 'alert_rules.yml'

步骤 2:告警规则

alert_rules.yml

groups:
  - name: application_alerts
    interval: 30s
    rules:
      # 高错误率
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "检测到高错误率"
          description: "错误率为 {{ $value }}%(阈值:5%)"

      # 响应时间慢
      - alert: SlowResponseTime
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "响应时间慢"
          description: "第 95 百分位数为 {{ $value }}秒"

      # Pod 宕机
      - alert: PodDown
        expr: up{job="my-app"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Pod 已宕机"
          description: "{{ $labels.instance }} 已宕机超过 2 分钟"

      # 内存使用率高
      - alert: HighMemoryUsage
        expr: |
          (
            node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
          ) / node_memory_MemTotal_bytes > 0.90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率高"
          description: "内存使用率为 {{ $value }}%"

步骤 3:日志聚合(结构化日志)

Winston(Node.js)

import winston from 'winston';

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'my-app',
    environment: process.env.NODE_ENV
  },
  transports: [
    new winston.transports.Console({
      format: winston.format.combine(
        winston.format.colorize(),
        winston.format.simple()
      )
    }),
    new winston.transports.File({
      filename: 'logs/error.log',
      level: 'error'
    }),
    new winston.transports.File({
      filename: 'logs/combined.log'
    })
  ]
});

// 使用示例
logger.info('用户已登录', { userId: '123', ip: '1.2.3.4' });
logger.error('数据库连接失败', { error: err.message, stack: err.stack });

// Express 中间件
app.use((req, res, next) => {
  logger.info('HTTP 请求', {
    method: req.method,
    path: req.path,
    ip: req.ip,
    userAgent: req.get('user-agent')
  });
  next();
});

步骤 4:Grafana 仪表板

dashboard.json(示例):

{
  "dashboard": {
    "title": "应用指标",
    "panels": [
      {
        "title": "请求速率",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}} {{route}}"
          }
        ]
      },
      {
        "title": "错误率",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{status_code=~\"5..\"}[5m])",
            "legendFormat": "错误"
          }
        ]
      },
      {
        "title": "响应时间(第 95 百分位)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
          }
        ]
      },
      {
        "title": "CPU 使用率",
        "type": "gauge",
        "targets": [
          {
            "expr": "rate(process_cpu_seconds_total[5m]) * 100"
          }
        ]
      }
    ]
  }
}

步骤 5:健康检查

高级健康检查

interface HealthStatus {
  status: 'healthy' | 'degraded' | 'unhealthy';
  timestamp: string;
  uptime: number;
  checks: {
    database: { status: string; latency?: number; error?: string };
    redis: { status: string; latency?: number };
    externalApi: { status: string; latency?: number };
  };
}

app.get('/health', async (req, res) => {
  const startTime = Date.now();
  const health: HealthStatus = {
    status: 'healthy',
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    checks: {
      database: { status: 'unknown' },
      redis: { status: 'unknown' },
      externalApi: { status: 'unknown' }
    }
  };

  // 数据库检查
  try {
    const dbStart = Date.now();
    await db.raw('SELECT 1');
    health.checks.database = {
      status: 'healthy',
      latency: Date.now() - dbStart
    };
  } catch (error) {
    health.status = 'unhealthy';
    health.checks.database = {
      status: 'unhealthy',
      error: error.message
    };
  }

  // Redis 检查
  try {
    const redisStart = Date.now();
    await redis.ping();
    health.checks.redis = {
      status: 'healthy',
      latency: Date.now() - redisStart
    };
  } catch (error) {
    health.status = 'degraded';
    health.checks.redis = { status: 'unhealthy' };
  }

  const statusCode = health.status === 'healthy' ? 200 : health.status === 'degraded' ? 200 : 503;
  res.status(statusCode).json(health);
});

输出格式

监控仪表板配置

黄金信号:
1. 延迟(响应时间)
   - 第 50、第 95、第 99 百分位数
   - 按 API 端点区分

2. 流量(请求量)
   - 每秒请求数
   - 按端点、按状态码区分

3. 错误(错误率)
   - 5xx 错误率
   - 4xx 错误率
   - 按错误类型区分

4. 饱和度(资源利用率)
   - CPU 使用率
   - 内存使用率
   - 磁盘 I/O
   - 网络带宽

约束条件

必需规则(必须遵守)

  1. 结构化日志:JSON 格式的日志
  2. 指标标签:保持唯一性(注意高基数问题)
  3. 防止告警疲劳:仅发送关键告警

禁止事项(不得违反)

  1. 不记录敏感数据:切勿记录密码、API 密钥
  2. 过度指标:不必要的指标会浪费资源

最佳实践

  1. 定义 SLO:明确定义服务级别目标
  2. 编写运行手册:为每个告警记录响应流程
  3. 仪表板:根据团队需要定制仪表板

参考资料

元数据

版本

  • 当前版本:1.0.0
  • 最后更新:2025-01-01
  • 兼容平台:Claude, ChatGPT, Gemini

相关技能

标签

#监控 #可观测性 #Prometheus #Grafana #日志 #指标 #基础设施

示例

示例 1:基本用法

示例 2:高级用法

📄 原始文档

完整文档(英文):

https://skills.sh/supercent-io/skills-template/monitoring-observability

💡 提示:点击上方链接查看 skills.sh 原始英文文档,方便对照翻译。

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。