🚀 快速安装

复制以下命令并运行，立即安装此 Skill：

npx @anthropic-ai/skills install supercent-io/skills-template/monitoring-observability

💡 提示：需要 Node.js 和 NPM

监控与可观测性

何时使用此技能

生产部署前：设置必要的监控系统
性能问题：识别瓶颈
事件响应：快速定位根本原因
SLA 合规：跟踪可用性/响应时间

指示

步骤 1：指标收集（Prometheus）

应用仪表化（Node.js）：

import express from 'express';
import promClient from 'prom-client';

const app = express();

// 默认指标（CPU、内存等）
promClient.collectDefaultMetrics();

// 自定义指标
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP 请求持续时间（秒）',
  labelNames: ['method', 'route', 'status_code']
});

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'HTTP 请求总数',
  labelNames: ['method', 'route', 'status_code']
});

// 用于跟踪请求的中间件
app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const labels = {
      method: req.method,
      route: req.route?.path || req.path,
      status_code: res.statusCode
    };

    httpRequestDuration.observe(labels, duration);
    httpRequestTotal.inc(labels);
  });

  next();
});

// 指标端点
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});

app.listen(3000);

prometheus.yml：

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'my-app'
    static_configs:
      - targets: ['localhost:3000']
    metrics_path: '/metrics'

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - 'alert_rules.yml'

步骤 2：告警规则

alert_rules.yml：

groups:
  - name: application_alerts
    interval: 30s
    rules:
      # 高错误率
      - alert: HighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status_code=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "检测到高错误率"
          description: "错误率为 {{ $value }}%（阈值：5%）"

      # 响应时间慢
      - alert: SlowResponseTime
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "响应时间慢"
          description: "第 95 百分位数为 {{ $value }}秒"

      # Pod 宕机
      - alert: PodDown
        expr: up{job="my-app"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Pod 已宕机"
          description: "{{ $labels.instance }} 已宕机超过 2 分钟"

      # 内存使用率高
      - alert: HighMemoryUsage
        expr: |
          (
            node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
          ) / node_memory_MemTotal_bytes > 0.90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率高"
          description: "内存使用率为 {{ $value }}%"

步骤 3：日志聚合（结构化日志）

Winston（Node.js）：

import winston from 'winston';

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'my-app',
    environment: process.env.NODE_ENV
  },
  transports: [
    new winston.transports.Console({
      format: winston.format.combine(
        winston.format.colorize(),
        winston.format.simple()
      )
    }),
    new winston.transports.File({
      filename: 'logs/error.log',
      level: 'error'
    }),
    new winston.transports.File({
      filename: 'logs/combined.log'
    })
  ]
});

// 使用示例
logger.info('用户已登录', { userId: '123', ip: '1.2.3.4' });
logger.error('数据库连接失败', { error: err.message, stack: err.stack });

// Express 中间件
app.use((req, res, next) => {
  logger.info('HTTP 请求', {
    method: req.method,
    path: req.path,
    ip: req.ip,
    userAgent: req.get('user-agent')
  });
  next();
});

步骤 4：Grafana 仪表板

dashboard.json（示例）：

{
  "dashboard": {
    "title": "应用指标",
    "panels": [
      {
        "title": "请求速率",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])",
            "legendFormat": "{{method}} {{route}}"
          }
        ]
      },
      {
        "title": "错误率",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(http_requests_total{status_code=~\"5..\"}[5m])",
            "legendFormat": "错误"
          }
        ]
      },
      {
        "title": "响应时间（第 95 百分位）",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
          }
        ]
      },
      {
        "title": "CPU 使用率",
        "type": "gauge",
        "targets": [
          {
            "expr": "rate(process_cpu_seconds_total[5m]) * 100"
          }
        ]
      }
    ]
  }
}

步骤 5：健康检查

高级健康检查：

interface HealthStatus {
  status: 'healthy' | 'degraded' | 'unhealthy';
  timestamp: string;
  uptime: number;
  checks: {
    database: { status: string; latency?: number; error?: string };
    redis: { status: string; latency?: number };
    externalApi: { status: string; latency?: number };
  };
}

app.get('/health', async (req, res) => {
  const startTime = Date.now();
  const health: HealthStatus = {
    status: 'healthy',
    timestamp: new Date().toISOString(),
    uptime: process.uptime(),
    checks: {
      database: { status: 'unknown' },
      redis: { status: 'unknown' },
      externalApi: { status: 'unknown' }
    }
  };

  // 数据库检查
  try {
    const dbStart = Date.now();
    await db.raw('SELECT 1');
    health.checks.database = {
      status: 'healthy',
      latency: Date.now() - dbStart
    };
  } catch (error) {
    health.status = 'unhealthy';
    health.checks.database = {
      status: 'unhealthy',
      error: error.message
    };
  }

  // Redis 检查
  try {
    const redisStart = Date.now();
    await redis.ping();
    health.checks.redis = {
      status: 'healthy',
      latency: Date.now() - redisStart
    };
  } catch (error) {
    health.status = 'degraded';
    health.checks.redis = { status: 'unhealthy' };
  }

  const statusCode = health.status === 'healthy' ? 200 : health.status === 'degraded' ? 200 : 503;
  res.status(statusCode).json(health);
});

输出格式

监控仪表板配置

黄金信号：
1. 延迟（响应时间）
   - 第 50、第 95、第 99 百分位数
   - 按 API 端点区分

2. 流量（请求量）
   - 每秒请求数
   - 按端点、按状态码区分

3. 错误（错误率）
   - 5xx 错误率
   - 4xx 错误率
   - 按错误类型区分

4. 饱和度（资源利用率）
   - CPU 使用率
   - 内存使用率
   - 磁盘 I/O
   - 网络带宽

约束条件

必需规则（必须遵守）

结构化日志：JSON 格式的日志
指标标签：保持唯一性（注意高基数问题）
防止告警疲劳：仅发送关键告警

禁止事项（不得违反）

不记录敏感数据：切勿记录密码、API 密钥
过度指标：不必要的指标会浪费资源

最佳实践

定义 SLO：明确定义服务级别目标
编写运行手册：为每个告警记录响应流程
仪表板：根据团队需要定制仪表板

参考资料

元数据

版本

当前版本：1.0.0
最后更新：2025-01-01
兼容平台：Claude, ChatGPT, Gemini

示例

示例 1：基本用法

示例 2：高级用法

📄 原始文档

完整文档（英文）：

https://skills.sh/supercent-io/skills-template/monitoring-observability

💡 提示：点击上方链接查看 skills.sh 原始英文文档，方便对照翻译。

声明：本站所有文章，如无特殊说明或标注，均为本站原创发布。任何个人或组织，在未征得本站同意时，禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益，可联系我们进行处理。

Monitoring & Observability

🚀 快速安装

监控与可观测性

何时使用此技能

指示

步骤 1：指标收集（Prometheus）

步骤 2：告警规则

步骤 3：日志聚合（结构化日志）

步骤 4：Grafana 仪表板

步骤 5：健康检查

输出格式

监控仪表板配置

约束条件

必需规则（必须遵守）

禁止事项（不得违反）

最佳实践

参考资料

元数据

版本

相关技能

标签

示例

示例 1：基本用法

示例 2：高级用法

📄 原始文档

评论(0)

提示：请文明发言取消回复

Monitoring & Observability

🚀 快速安装

监控与可观测性

何时使用此技能

指示

步骤 1：指标收集（Prometheus）

步骤 2：告警规则

步骤 3：日志聚合（结构化日志）

步骤 4：Grafana 仪表板

步骤 5：健康检查

输出格式

监控仪表板配置

约束条件

必需规则（必须遵守）

禁止事项（不得违反）

最佳实践

参考资料

元数据

版本

相关技能

标签

示例

示例 1：基本用法

示例 2：高级用法

📄 原始文档

评论(0)

提示：请文明发言 取消回复

相关文章

Marketing Skills for AI Agents

Churn Prevention

Backend Testing

onboarding-cro

提示：请文明发言取消回复