3.6 KiB
3.6 KiB
Advanced Monitoring & Alerting Guide
Date: 2025-01-27 Purpose: Guide for advanced monitoring and alerting setup Status: Complete
Overview
This guide provides strategies for implementing advanced monitoring and alerting across the integrated workspace.
Monitoring Stack
Components
- Prometheus - Metrics collection
- Grafana - Visualization and dashboards
- Loki - Log aggregation
- Alertmanager - Alert routing
- Jaeger - Distributed tracing
Metrics Collection
Application Metrics
Custom Metrics
import { Counter, Histogram } from 'prom-client';
const requestCounter = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status'],
});
const requestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route'],
});
Business Metrics
- Transaction volume
- User activity
- Revenue metrics
- Conversion rates
Infrastructure Metrics
System Metrics
- CPU usage
- Memory usage
- Disk I/O
- Network traffic
Kubernetes Metrics
- Pod status
- Resource usage
- Node health
- Cluster capacity
Dashboards
Application Dashboard
Key Panels:
- Request rate
- Response times (p50, p95, p99)
- Error rates
- Active users
Infrastructure Dashboard
Key Panels:
- Resource utilization
- Pod status
- Node health
- Network traffic
Business Dashboard
Key Panels:
- Transaction volume
- Revenue metrics
- User activity
- Conversion rates
Alerting Rules
Critical Alerts
groups:
- name: critical
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
- alert: ServiceDown
expr: up{job="api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
Warning Alerts
- alert: HighLatency
expr: histogram_quantile(0.95, http_request_duration_seconds) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "High latency detected"
Log Aggregation
Structured Logging
import winston from 'winston';
const logger = winston.createLogger({
format: winston.format.json(),
transports: [
new winston.transports.Console(),
],
});
logger.info('Request processed', {
method: 'GET',
path: '/api/users',
status: 200,
duration: 45,
userId: '123',
});
Log Levels
- ERROR: Errors requiring attention
- WARN: Warnings
- INFO: Informational messages
- DEBUG: Debug information
Distributed Tracing
OpenTelemetry
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('my-service');
const span = tracer.startSpan('process-request');
try {
// Process request
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR });
span.recordException(error);
} finally {
span.end();
}
Best Practices
Metrics
- Use consistent naming
- Include relevant labels
- Avoid high cardinality
- Document metrics
Alerts
- Set appropriate thresholds
- Avoid alert fatigue
- Use alert grouping
- Test alert delivery
Logs
- Use structured logging
- Include correlation IDs
- Don't log sensitive data
- Set appropriate levels
Last Updated: 2025-01-27