# Monitoring & Observability HemoStat includes a comprehensive monitoring stack built on Prometheus and Grafana, providing deep insights into system performance, container health, and agent operations. ## Overview The monitoring stack consists of three key components: 1. **Metrics Exporter Agent** - Subscribes to HemoStat events and exposes Prometheus metrics 2. **Prometheus** - Time-series database for metrics collection and alerting 3. **Grafana** - Visualization dashboards for historical analysis ```{mermaid} graph LR A[Monitor Agent] -->|events| R[Redis] B[Analyzer Agent] -->|events| R C[Responder Agent] -->|events| R D[Alert Agent] -->|events| R R -->|subscribe| M[Metrics Exporter] M -->|:9090/metrics| P[Prometheus] P -->|data source| G[Grafana Dashboards] ``` ## Architecture ### Metrics Exporter Agent The Metrics Exporter is a specialized HemoStat agent that: - Subscribes to all HemoStat Redis pub/sub channels - Converts events into Prometheus-compatible metrics - Exposes an HTTP endpoint at `http://localhost:9090/metrics` - Tracks container health, agent performance, and system operations **Implementation**: `agents/hemostat_metrics/metrics.py` ### Data Flow 1. HemoStat agents publish events to Redis 2. Metrics Exporter subscribes and converts events to metrics 3. Prometheus scrapes metrics endpoint every 10 seconds 4. Grafana queries Prometheus for visualization 5. Alert rules trigger notifications on anomalies ## Metrics Catalog ### Container Health Metrics | Metric | Type | Description | |--------|------|-------------| | `hemostat_container_cpu_percent` | Gauge | CPU usage percentage per container | | `hemostat_container_memory_percent` | Gauge | Memory usage percentage per container | | `hemostat_container_memory_bytes` | Gauge | Memory usage in bytes | | `hemostat_container_restart_count` | Gauge | Container restart count | | `hemostat_container_network_rx_bytes_total` | Counter | Network bytes received | | `hemostat_container_network_tx_bytes_total` | Counter | Network bytes transmitted | | `hemostat_container_blkio_read_bytes_total` | Counter | Block I/O read bytes | | `hemostat_container_blkio_write_bytes_total` | Counter | Block I/O write bytes | ### Health Alert Metrics | Metric | Type | Description | |--------|------|-------------| | `hemostat_health_alerts_total` | Counter | Total health alerts by severity | | `hemostat_anomalies_detected_total` | Counter | Total anomalies by type | ### Analysis Metrics | Metric | Type | Description | |--------|------|-------------| | `hemostat_analysis_requests_total` | Counter | Total analysis requests by result type | | `hemostat_analysis_duration_seconds` | Histogram | Analysis response time distribution | | `hemostat_analysis_confidence` | Histogram | AI confidence score distribution | ### Remediation Metrics | Metric | Type | Description | |--------|------|-------------| | `hemostat_remediation_attempts_total` | Counter | Total attempts by action and status | | `hemostat_remediation_duration_seconds` | Histogram | Remediation execution time | | `hemostat_remediation_cooldown_active` | Gauge | Cooldown status per container | ### Alert Metrics | Metric | Type | Description | |--------|------|-------------| | `hemostat_alerts_sent_total` | Counter | Total alerts sent by channel and status | | `hemostat_alerts_deduped_total` | Counter | Total deduplicated alerts | ### System Metrics | Metric | Type | Description | |--------|------|-------------| | `hemostat_agent_uptime_seconds` | Gauge | Agent uptime tracking | | `hemostat_redis_operations_total` | Counter | Redis operations by type | | `hemostat_time_to_detection_seconds` | Histogram | Time from issue to detection | | `hemostat_time_to_remediation_seconds` | Histogram | Time from detection to fix | ## Quick Start ### 1. Start Monitoring Stack ```bash # Start all services including monitoring docker compose up -d # Or start just monitoring components docker compose up -d redis metrics prometheus grafana ``` ### 2. Access Dashboards **Grafana Dashboard** - URL: http://localhost:3000 - Username: `admin` - Password: `admin` (change on first login) **Prometheus Query UI** - URL: http://localhost:9091 - Direct metric queries and exploration **Metrics Endpoint** - URL: http://localhost:9090/metrics - Raw Prometheus metrics ### 3. View HemoStat Overview Dashboard 1. Login to Grafana 2. Navigate to **Dashboards** → **HemoStat** folder 3. Select **HemoStat Overview** The dashboard displays: - Container CPU and memory usage graphs - Health alerts by severity - Analysis performance metrics (response time, confidence) - Remediation attempts and success rates - Agent uptime and system health ## Prometheus Configuration ### Scrape Configuration Prometheus is configured to scrape the Metrics Exporter every 10 seconds: ```yaml scrape_configs: - job_name: 'hemostat-metrics' static_configs: - targets: ['metrics:9090'] scrape_interval: 10s scrape_timeout: 5s ``` **Configuration file**: `monitoring/prometheus/prometheus.yml` ### Alert Rules Pre-configured alert rules monitor system health: **Container Health Alerts** - `HighContainerCPU` - CPU > 90% for 2+ minutes - `HighContainerMemory` - Memory > 90% for 2+ minutes - `ExcessiveContainerRestarts` - Frequent restart rate **System Performance Alerts** - `SlowAnalysisResponse` - p95 latency > 10 seconds - `HighRemediationFailureRate` - Failure rate > 30% - `HighAlertFailureRate` - Notification failures > 20% **System Health Alerts** - `MetricsExporterDown` - Exporter unavailable for 1+ minute - `NoHealthAlertsDetected` - No alerts for 30+ minutes (possible monitor issue) **Configuration file**: `monitoring/prometheus/rules/hemostat_alerts.yml` ## Grafana Dashboards ### HemoStat Overview Dashboard The main dashboard provides 11 panels across four sections: **Summary Metrics** - Monitored containers count - Health alerts per minute - Median analysis confidence - Remediations per minute **Container Health Graphs** - CPU usage time-series per container - Memory usage time-series per container **System Activity** - Health alerts by severity - Analysis requests by result type - Remediation attempts by action and status **Performance Metrics** - Analysis duration percentiles (p50, p95, p99) - Remediation duration percentiles (p50, p95, p99) ### Auto-Provisioning Dashboards and data sources are automatically configured on startup: - **Data Source**: `monitoring/grafana/provisioning/datasources/prometheus.yml` - **Dashboard**: `monitoring/grafana/provisioning/dashboards/hemostat_overview.json` ## PromQL Query Examples ### Container Metrics ```promql # Average CPU across all containers avg(hemostat_container_cpu_percent) # Containers with high memory usage hemostat_container_memory_percent > 80 # Container restart rate rate(hemostat_container_restart_count[5m]) ``` ### Analysis Performance ```promql # Analysis p95 latency histogram_quantile(0.95, rate(hemostat_analysis_duration_seconds_bucket[5m])) # Median confidence score histogram_quantile(0.5, rate(hemostat_analysis_confidence_bucket[5m])) # Analysis requests per second rate(hemostat_analysis_requests_total[1m]) ``` ### Remediation Tracking ```promql # Remediation success rate sum(rate(hemostat_remediation_attempts_total{status="success"}[5m])) / sum(rate(hemostat_remediation_attempts_total[5m])) # Failed remediations per minute rate(hemostat_remediation_attempts_total{status="failed"}[5m]) * 60 # Remediation duration p99 histogram_quantile(0.99, rate(hemostat_remediation_duration_seconds_bucket[5m])) ``` ### System Health ```promql # Agent uptime hemostat_agent_uptime_seconds{agent_name="metrics"} # Total health alerts in last hour sum(increase(hemostat_health_alerts_total[1h])) # Alert deduplication rate rate(hemostat_alerts_deduped_total[5m]) ``` ## Configuration ### Environment Variables **Metrics Exporter** ```bash METRICS_PORT=9090 # HTTP server port REDIS_HOST=redis # Redis hostname REDIS_PORT=6379 # Redis port LOG_LEVEL=INFO # Logging level ``` **Prometheus** - Data retention: 15 days (configurable via `--storage.tsdb.retention.time`) - Scrape interval: 10 seconds (configurable in `prometheus.yml`) **Grafana** ```bash GF_SECURITY_ADMIN_USER=admin # Admin username GF_SECURITY_ADMIN_PASSWORD=admin # Admin password (change this!) GF_USERS_ALLOW_SIGN_UP=false # Disable user signup ``` ### Customizing Scrape Intervals Edit `monitoring/prometheus/prometheus.yml`: ```yaml scrape_configs: - job_name: 'hemostat-metrics' scrape_interval: 5s # More frequent scraping ``` ### Adjusting Alert Thresholds Edit `monitoring/prometheus/rules/hemostat_alerts.yml`: ```yaml - alert: HighContainerCPU expr: hemostat_container_cpu_percent > 95 # Increase threshold for: 5m # Wait longer before alerting ``` ### Changing Data Retention Edit `docker-compose.yml`: ```yaml prometheus: command: - '--storage.tsdb.retention.time=30d' # Keep data for 30 days - '--storage.tsdb.retention.size=10GB' # Limit storage size ``` ## Troubleshooting ### Metrics Not Appearing **Check Metrics Exporter** ```bash # View logs docker compose logs metrics # Verify endpoint is accessible curl http://localhost:9090/metrics | grep hemostat_ # Check Redis connection docker compose exec metrics python -c "import redis; redis.Redis(host='redis').ping()" ``` ### Prometheus Not Scraping **Verify Targets** ```bash # Check target status curl http://localhost:9091/api/v1/targets # View in browser open http://localhost:9091/targets ``` **Check Connectivity** ```bash # Test from Prometheus container docker compose exec prometheus wget -O- http://metrics:9090/metrics ``` ### Grafana Shows "No Data" 1. **Verify time range** - Use "Last 15 minutes" or "Last 1 hour" 2. **Test data source** - Go to Configuration → Data Sources → Test 3. **Check Prometheus** - Query metrics directly in Prometheus UI 4. **Generate activity** - Start agents to create events **Test Prometheus Connection** ```bash # From Grafana container docker compose exec grafana wget -O- http://prometheus:9090/api/v1/query?query=up ``` ### High Resource Usage **Reduce Scrape Frequency** ```yaml scrape_interval: 30s # From 10s to 30s ``` **Lower Retention Period** ```yaml --storage.tsdb.retention.time=7d # From 15d to 7d ``` **Add Recording Rules** for expensive queries: ```yaml # Create recording rules for frequently used queries - record: job:hemostat_cpu_avg:5m expr: avg(hemostat_container_cpu_percent) ``` ## Integration with Existing Dashboard HemoStat provides two complementary dashboards: ### Streamlit Dashboard (Port 8501) - **Real-time event streaming** - **Live container status grid** - **Event timeline and details** - **Active issues feed** ### Grafana Dashboard (Port 3000) - **Historical metrics analysis** - **Performance trends over time** - **Alert visualization** - **Custom query exploration** **Use Cases:** - **Streamlit**: Real-time incident response and live monitoring - **Grafana**: Performance analysis, capacity planning, trend identification ## Best Practices ### Dashboard Design 1. **Use appropriate time ranges** - Last 1 hour for real-time, 24 hours for trends 2. **Set meaningful thresholds** - Based on SLAs and baseline performance 3. **Add annotations** - Mark deployments and incidents on graphs 4. **Create variables** - For dynamic container/service filtering ### Alerting Strategy 1. **Set alert thresholds conservatively** - Avoid alert fatigue 2. **Use multi-condition alerts** - Combine metrics for context 3. **Configure notification channels** - Slack, email, PagerDuty 4. **Test alert rules** - Verify alerts trigger appropriately ### Performance Optimization 1. **Use recording rules** - Pre-compute expensive queries 2. **Limit cardinality** - Avoid high-cardinality labels 3. **Set appropriate retention** - Balance storage and query needs 4. **Monitor Prometheus itself** - Track ingestion rate and query performance ### Security 1. **Change default passwords** - Especially Grafana admin password 2. **Enable authentication** - For Prometheus and Grafana 3. **Use HTTPS** - In production deployments 4. **Restrict network access** - Limit to authorized networks ## Advanced Topics ### Adding Custom Metrics Extend the Metrics Exporter in `agents/hemostat_metrics/metrics.py`: ```python from prometheus_client import Counter # Define new metric self.custom_metric = Counter( "hemostat_custom_total", "Description of custom metric", ["label1", "label2"] ) # Update metric in event handler self.custom_metric.labels(label1="value1", label2="value2").inc() ``` ### Creating Custom Dashboards 1. Design dashboard in Grafana UI 2. Export as JSON: Dashboard Settings → JSON Model 3. Save to `monitoring/grafana/provisioning/dashboards/` 4. Restart Grafana to load ### Setting Up Alertmanager For advanced alert routing and aggregation: ```yaml # In prometheus.yml alerting: alertmanagers: - static_configs: - targets: ['alertmanager:9093'] ``` ### Remote Write Configuration For long-term storage: ```yaml # In prometheus.yml remote_write: - url: "https://prometheus-remote-storage.example.com/api/v1/write" ``` ## Service URLs | Service | URL | Purpose | |---------|-----|---------| | **Grafana** | http://localhost:3000 | Visualization dashboards | | **Prometheus** | http://localhost:9091 | Query UI and alerts | | **Metrics Exporter** | http://localhost:9090/metrics | Raw metrics endpoint | | **Streamlit Dashboard** | http://localhost:8501 | Real-time monitoring | ## References - **Prometheus Documentation**: https://prometheus.io/docs/ - **Grafana Documentation**: https://grafana.com/docs/ - **PromQL Guide**: https://prometheus.io/docs/prometheus/latest/querying/basics/ - **Metrics Exporter Source**: `agents/hemostat_metrics/` - **Configuration Files**: `monitoring/prometheus/` and `monitoring/grafana/` ## See Also - [Architecture](architecture.md) - System design and agent communication - [Deployment](deployment.md) - Production deployment strategies - [Troubleshooting](troubleshooting.md) - Common issues and solutions - [API Reference](api/agents.rst) - Metrics Exporter API documentation