🚀 快速安装
复制以下命令并运行,立即安装此 Skill:
npx @anthropic-ai/skills install supercent-io/skills-template/monitoring-observability
💡 提示:需要 Node.js 和 NPM
监控与可观测性
何时使用此技能
- 生产部署前:设置必要的监控系统
- 性能问题:识别瓶颈
- 事件响应:快速定位根本原因
- SLA 合规:跟踪可用性/响应时间
指示
步骤 1:指标收集(Prometheus)
应用仪表化(Node.js):
import express from 'express';
import promClient from 'prom-client';
const app = express();
// 默认指标(CPU、内存等)
promClient.collectDefaultMetrics();
// 自定义指标
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP 请求持续时间(秒)',
labelNames: ['method', 'route', 'status_code']
});
const httpRequestTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'HTTP 请求总数',
labelNames: ['method', 'route', 'status_code']
});
// 用于跟踪请求的中间件
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
const labels = {
method: req.method,
route: req.route?.path || req.path,
status_code: res.statusCode
};
httpRequestDuration.observe(labels, duration);
httpRequestTotal.inc(labels);
});
next();
});
// 指标端点
app.get('/metrics', async (req, res) => {
res.set('Content-Type', promClient.register.contentType);
res.end(await promClient.register.metrics());
});
app.listen(3000);
prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'my-app'
static_configs:
- targets: ['localhost:3000']
metrics_path: '/metrics'
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- 'alert_rules.yml'
步骤 2:告警规则
alert_rules.yml:
groups:
- name: application_alerts
interval: 30s
rules:
# 高错误率
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "检测到高错误率"
description: "错误率为 {{ $value }}%(阈值:5%)"
# 响应时间慢
- alert: SlowResponseTime
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "响应时间慢"
description: "第 95 百分位数为 {{ $value }}秒"
# Pod 宕机
- alert: PodDown
expr: up{job="my-app"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Pod 已宕机"
description: "{{ $labels.instance }} 已宕机超过 2 分钟"
# 内存使用率高
- alert: HighMemoryUsage
expr: |
(
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
) / node_memory_MemTotal_bytes > 0.90
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率高"
description: "内存使用率为 {{ $value }}%"
步骤 3:日志聚合(结构化日志)
Winston(Node.js):
import winston from 'winston';
const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: 'my-app',
environment: process.env.NODE_ENV
},
transports: [
new winston.transports.Console({
format: winston.format.combine(
winston.format.colorize(),
winston.format.simple()
)
}),
new winston.transports.File({
filename: 'logs/error.log',
level: 'error'
}),
new winston.transports.File({
filename: 'logs/combined.log'
})
]
});
// 使用示例
logger.info('用户已登录', { userId: '123', ip: '1.2.3.4' });
logger.error('数据库连接失败', { error: err.message, stack: err.stack });
// Express 中间件
app.use((req, res, next) => {
logger.info('HTTP 请求', {
method: req.method,
path: req.path,
ip: req.ip,
userAgent: req.get('user-agent')
});
next();
});
步骤 4:Grafana 仪表板
dashboard.json(示例):
{
"dashboard": {
"title": "应用指标",
"panels": [
{
"title": "请求速率",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{method}} {{route}}"
}
]
},
{
"title": "错误率",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{status_code=~\"5..\"}[5m])",
"legendFormat": "错误"
}
]
},
{
"title": "响应时间(第 95 百分位)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
}
]
},
{
"title": "CPU 使用率",
"type": "gauge",
"targets": [
{
"expr": "rate(process_cpu_seconds_total[5m]) * 100"
}
]
}
]
}
}
步骤 5:健康检查
高级健康检查:
interface HealthStatus {
status: 'healthy' | 'degraded' | 'unhealthy';
timestamp: string;
uptime: number;
checks: {
database: { status: string; latency?: number; error?: string };
redis: { status: string; latency?: number };
externalApi: { status: string; latency?: number };
};
}
app.get('/health', async (req, res) => {
const startTime = Date.now();
const health: HealthStatus = {
status: 'healthy',
timestamp: new Date().toISOString(),
uptime: process.uptime(),
checks: {
database: { status: 'unknown' },
redis: { status: 'unknown' },
externalApi: { status: 'unknown' }
}
};
// 数据库检查
try {
const dbStart = Date.now();
await db.raw('SELECT 1');
health.checks.database = {
status: 'healthy',
latency: Date.now() - dbStart
};
} catch (error) {
health.status = 'unhealthy';
health.checks.database = {
status: 'unhealthy',
error: error.message
};
}
// Redis 检查
try {
const redisStart = Date.now();
await redis.ping();
health.checks.redis = {
status: 'healthy',
latency: Date.now() - redisStart
};
} catch (error) {
health.status = 'degraded';
health.checks.redis = { status: 'unhealthy' };
}
const statusCode = health.status === 'healthy' ? 200 : health.status === 'degraded' ? 200 : 503;
res.status(statusCode).json(health);
});
输出格式
监控仪表板配置
黄金信号:
1. 延迟(响应时间)
- 第 50、第 95、第 99 百分位数
- 按 API 端点区分
2. 流量(请求量)
- 每秒请求数
- 按端点、按状态码区分
3. 错误(错误率)
- 5xx 错误率
- 4xx 错误率
- 按错误类型区分
4. 饱和度(资源利用率)
- CPU 使用率
- 内存使用率
- 磁盘 I/O
- 网络带宽
约束条件
必需规则(必须遵守)
- 结构化日志:JSON 格式的日志
- 指标标签:保持唯一性(注意高基数问题)
- 防止告警疲劳:仅发送关键告警
禁止事项(不得违反)
- 不记录敏感数据:切勿记录密码、API 密钥
- 过度指标:不必要的指标会浪费资源
最佳实践
- 定义 SLO:明确定义服务级别目标
- 编写运行手册:为每个告警记录响应流程
- 仪表板:根据团队需要定制仪表板
参考资料
元数据
版本
- 当前版本:1.0.0
- 最后更新:2025-01-01
- 兼容平台:Claude, ChatGPT, Gemini
相关技能
- deployment:与部署相关的监控
- security:安全事件监控
标签
#监控 #可观测性 #Prometheus #Grafana #日志 #指标 #基础设施
示例
示例 1:基本用法
示例 2:高级用法
📄 原始文档
完整文档(英文):
https://skills.sh/supercent-io/skills-template/monitoring-observability
💡 提示:点击上方链接查看 skills.sh 原始英文文档,方便对照翻译。
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。

评论(0)