# Operational Runbook ## Table of Contents 1. [System Overview](#system-overview) 2. [Monitoring & Alerts](#monitoring--alerts) 3. [Common Operations](#common-operations) 4. [Troubleshooting](#troubleshooting) 5. [Disaster Recovery](#disaster-recovery) ## System Overview ### Architecture - **Application**: Node.js/TypeScript Express server - **Database**: PostgreSQL 14+ - **Cache/Sessions**: Redis (optional) - **Metrics**: Prometheus format on `/metrics` - **Health Check**: `/health` endpoint ### Key Endpoints - API Base: `/api/v1` - Terminal UI: `/` - Health: `/health` - Metrics: `/metrics` - API Docs: `/api-docs` ## Monitoring & Alerts ### Key Metrics to Monitor #### Payment Metrics - `payments_initiated_total` - Total payments initiated - `payments_approved_total` - Total payments approved - `payments_completed_total` - Total payments completed - `payments_failed_total` - Total payments failed - `payment_processing_duration_seconds` - Processing latency #### TLS Metrics - `tls_connections_active` - Active TLS connections - `tls_connection_errors_total` - TLS connection errors - `tls_acks_received_total` - ACKs received - `tls_nacks_received_total` - NACKs received #### System Metrics - `http_request_duration_seconds` - HTTP request latency - `process_cpu_user_seconds_total` - CPU usage - `process_resident_memory_bytes` - Memory usage ### Alert Thresholds **Critical Alerts:** - Payment failure rate > 5% in 5 minutes - TLS connection errors > 10 in 1 minute - Database connection pool exhaustion - Health check failing **Warning Alerts:** - Payment processing latency p95 > 30s - Unmatched reconciliation items > 10 - TLS circuit breaker OPEN state ## Common Operations ### Start System ```bash # Using npm npm start # Using Docker Compose docker-compose up -d # Verify health curl http://localhost:3000/health ``` ### Stop System ```bash # Graceful shutdown docker-compose down # Or send SIGTERM to process kill -TERM ``` ### Check System Status ```bash # Health check curl http://localhost:3000/health # Metrics curl http://localhost:3000/metrics # Database connection psql $DATABASE_URL -c "SELECT 1" ``` ### View Logs ```bash # Application logs tail -f logs/application-*.log # Docker logs docker-compose logs -f app # Audit logs (database) psql $DATABASE_URL -c "SELECT * FROM audit_logs ORDER BY timestamp DESC LIMIT 100" ``` ### Run Reconciliation ```bash # Via API curl -X GET "http://localhost:3000/api/v1/payments/reconciliation/daily?date=2024-01-01" \ -H "Authorization: Bearer " # Check aging items curl -X GET "http://localhost:3000/api/v1/payments/reconciliation/aging?days=1" \ -H "Authorization: Bearer " ``` ### Database Operations ```bash # Run migrations npm run migrate # Rollback last migration npm run migrate:rollback # Seed operators npm run seed # Backup database pg_dump $DATABASE_URL > backup_$(date +%Y%m%d).sql # Restore database psql $DATABASE_URL < backup_20240101.sql ``` ## Troubleshooting ### Payment Stuck in Processing **Symptoms:** - Payment status is `APPROVED` but not progressing - No ledger posting or message generation **Diagnosis:** ```sql SELECT id, status, created_at, updated_at FROM payments WHERE status = 'APPROVED' AND updated_at < NOW() - INTERVAL '5 minutes'; ``` **Resolution:** 1. Check application logs for errors 2. Verify compliance screening status 3. Check ledger adapter connectivity 4. Manually trigger processing if needed ### TLS Connection Issues **Symptoms:** - `tls_connection_errors_total` increasing - Circuit breaker in OPEN state - Messages not transmitting **Diagnosis:** ```bash # Check TLS pool stats curl http://localhost:3000/metrics | grep tls # Check receiver connectivity openssl s_client -connect 172.67.157.88:443 -servername devmindgroup.com ``` **Resolution:** 1. Verify receiver IP/port configuration 2. Check certificate validity 3. Verify network connectivity 4. Review TLS pool logs 5. Reset circuit breaker if needed ### Database Connection Issues **Symptoms:** - Health check shows database error - High connection pool usage - Query timeouts **Diagnosis:** ```sql -- Check active connections SELECT count(*) FROM pg_stat_activity; -- Check connection pool stats SELECT * FROM pg_stat_database WHERE datname = 'dbis_core'; ``` **Resolution:** 1. Increase connection pool size in config 2. Check for long-running queries 3. Restart database if needed 4. Review connection pool settings ### Reconciliation Exceptions **Symptoms:** - High number of unmatched payments - Aging items accumulating **Resolution:** 1. Review reconciliation report 2. Check exception queue 3. Manually reconcile exceptions 4. Investigate root cause (missing ACK, ledger mismatch, etc.) ## Disaster Recovery ### Backup Procedures **Daily Backups:** ```bash # Database backup pg_dump $DATABASE_URL | gzip > backups/dbis_core_$(date +%Y%m%d).sql.gz # Audit logs export (for compliance) psql $DATABASE_URL -c "\COPY audit_logs TO 'audit_logs_$(date +%Y%m%d).csv' CSV HEADER" ``` ### Recovery Procedures **Database Recovery:** ```bash # Stop application docker-compose stop app # Restore database gunzip < backups/dbis_core_20240101.sql.gz | psql $DATABASE_URL # Run migrations npm run migrate # Restart application docker-compose start app ``` ### Data Retention - **Audit Logs**: 7-10 years (configurable) - **Payment Records**: Indefinite (archived after 7 years) - **Application Logs**: 30 days ### Failover Procedures 1. **Application Failover:** - Deploy to secondary server - Update load balancer - Verify health checks 2. **Database Failover:** - Promote replica to primary - Update DATABASE_URL - Restart application ## Emergency Contacts - **System Administrator**: [Contact] - **Database Administrator**: [Contact] - **Security Team**: [Contact] - **On-Call Engineer**: [Contact] ## Change Management All changes to production must: 1. Be tested in staging environment 2. Have rollback plan documented 3. Be approved by technical lead 4. Be performed during maintenance window 5. Be monitored post-deployment