Monitoring & Observability
HemoStat includes a comprehensive monitoring stack built on Prometheus and Grafana, providing deep insights into system performance, container health, and agent operations.
Overview
The monitoring stack consists of three key components:
Metrics Exporter Agent - Subscribes to HemoStat events and exposes Prometheus metrics
Prometheus - Time-series database for metrics collection and alerting
Grafana - Visualization dashboards for historical analysis
graph LR
A[Monitor Agent] -->|events| R[Redis]
B[Analyzer Agent] -->|events| R
C[Responder Agent] -->|events| R
D[Alert Agent] -->|events| R
R -->|subscribe| M[Metrics Exporter]
M -->|:9090/metrics| P[Prometheus]
P -->|data source| G[Grafana Dashboards]
Architecture
Metrics Exporter Agent
The Metrics Exporter is a specialized HemoStat agent that:
Subscribes to all HemoStat Redis pub/sub channels
Converts events into Prometheus-compatible metrics
Exposes an HTTP endpoint at
http://localhost:9090/metricsTracks container health, agent performance, and system operations
Implementation: agents/hemostat_metrics/metrics.py
Data Flow
HemoStat agents publish events to Redis
Metrics Exporter subscribes and converts events to metrics
Prometheus scrapes metrics endpoint every 10 seconds
Grafana queries Prometheus for visualization
Alert rules trigger notifications on anomalies
Metrics Catalog
Container Health Metrics
Metric |
Type |
Description |
|---|---|---|
|
Gauge |
CPU usage percentage per container |
|
Gauge |
Memory usage percentage per container |
|
Gauge |
Memory usage in bytes |
|
Gauge |
Container restart count |
|
Counter |
Network bytes received |
|
Counter |
Network bytes transmitted |
|
Counter |
Block I/O read bytes |
|
Counter |
Block I/O write bytes |
Health Alert Metrics
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Total health alerts by severity |
|
Counter |
Total anomalies by type |
Analysis Metrics
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Total analysis requests by result type |
|
Histogram |
Analysis response time distribution |
|
Histogram |
AI confidence score distribution |
Remediation Metrics
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Total attempts by action and status |
|
Histogram |
Remediation execution time |
|
Gauge |
Cooldown status per container |
Alert Metrics
Metric |
Type |
Description |
|---|---|---|
|
Counter |
Total alerts sent by channel and status |
|
Counter |
Total deduplicated alerts |
System Metrics
Metric |
Type |
Description |
|---|---|---|
|
Gauge |
Agent uptime tracking |
|
Counter |
Redis operations by type |
|
Histogram |
Time from issue to detection |
|
Histogram |
Time from detection to fix |
Quick Start
1. Start Monitoring Stack
# Start all services including monitoring
docker compose up -d
# Or start just monitoring components
docker compose up -d redis metrics prometheus grafana
2. Access Dashboards
Grafana Dashboard
URL: http://localhost:3000
Username:
adminPassword:
admin(change on first login)
Prometheus Query UI
URL: http://localhost:9091
Direct metric queries and exploration
Metrics Endpoint
URL: http://localhost:9090/metrics
Raw Prometheus metrics
3. View HemoStat Overview Dashboard
Login to Grafana
Navigate to Dashboards → HemoStat folder
Select HemoStat Overview
The dashboard displays:
Container CPU and memory usage graphs
Health alerts by severity
Analysis performance metrics (response time, confidence)
Remediation attempts and success rates
Agent uptime and system health
Prometheus Configuration
Scrape Configuration
Prometheus is configured to scrape the Metrics Exporter every 10 seconds:
scrape_configs:
- job_name: 'hemostat-metrics'
static_configs:
- targets: ['metrics:9090']
scrape_interval: 10s
scrape_timeout: 5s
Configuration file: monitoring/prometheus/prometheus.yml
Alert Rules
Pre-configured alert rules monitor system health:
Container Health Alerts
HighContainerCPU- CPU > 90% for 2+ minutesHighContainerMemory- Memory > 90% for 2+ minutesExcessiveContainerRestarts- Frequent restart rate
System Performance Alerts
SlowAnalysisResponse- p95 latency > 10 secondsHighRemediationFailureRate- Failure rate > 30%HighAlertFailureRate- Notification failures > 20%
System Health Alerts
MetricsExporterDown- Exporter unavailable for 1+ minuteNoHealthAlertsDetected- No alerts for 30+ minutes (possible monitor issue)
Configuration file: monitoring/prometheus/rules/hemostat_alerts.yml
Grafana Dashboards
HemoStat Overview Dashboard
The main dashboard provides 11 panels across four sections:
Summary Metrics
Monitored containers count
Health alerts per minute
Median analysis confidence
Remediations per minute
Container Health Graphs
CPU usage time-series per container
Memory usage time-series per container
System Activity
Health alerts by severity
Analysis requests by result type
Remediation attempts by action and status
Performance Metrics
Analysis duration percentiles (p50, p95, p99)
Remediation duration percentiles (p50, p95, p99)
Auto-Provisioning
Dashboards and data sources are automatically configured on startup:
Data Source:
monitoring/grafana/provisioning/datasources/prometheus.ymlDashboard:
monitoring/grafana/provisioning/dashboards/hemostat_overview.json
PromQL Query Examples
Container Metrics
# Average CPU across all containers
avg(hemostat_container_cpu_percent)
# Containers with high memory usage
hemostat_container_memory_percent > 80
# Container restart rate
rate(hemostat_container_restart_count[5m])
Analysis Performance
# Analysis p95 latency
histogram_quantile(0.95, rate(hemostat_analysis_duration_seconds_bucket[5m]))
# Median confidence score
histogram_quantile(0.5, rate(hemostat_analysis_confidence_bucket[5m]))
# Analysis requests per second
rate(hemostat_analysis_requests_total[1m])
Remediation Tracking
# Remediation success rate
sum(rate(hemostat_remediation_attempts_total{status="success"}[5m])) /
sum(rate(hemostat_remediation_attempts_total[5m]))
# Failed remediations per minute
rate(hemostat_remediation_attempts_total{status="failed"}[5m]) * 60
# Remediation duration p99
histogram_quantile(0.99, rate(hemostat_remediation_duration_seconds_bucket[5m]))
System Health
# Agent uptime
hemostat_agent_uptime_seconds{agent_name="metrics"}
# Total health alerts in last hour
sum(increase(hemostat_health_alerts_total[1h]))
# Alert deduplication rate
rate(hemostat_alerts_deduped_total[5m])
Configuration
Environment Variables
Metrics Exporter
METRICS_PORT=9090 # HTTP server port
REDIS_HOST=redis # Redis hostname
REDIS_PORT=6379 # Redis port
LOG_LEVEL=INFO # Logging level
Prometheus
Data retention: 15 days (configurable via
--storage.tsdb.retention.time)Scrape interval: 10 seconds (configurable in
prometheus.yml)
Grafana
GF_SECURITY_ADMIN_USER=admin # Admin username
GF_SECURITY_ADMIN_PASSWORD=admin # Admin password (change this!)
GF_USERS_ALLOW_SIGN_UP=false # Disable user signup
Customizing Scrape Intervals
Edit monitoring/prometheus/prometheus.yml:
scrape_configs:
- job_name: 'hemostat-metrics'
scrape_interval: 5s # More frequent scraping
Adjusting Alert Thresholds
Edit monitoring/prometheus/rules/hemostat_alerts.yml:
- alert: HighContainerCPU
expr: hemostat_container_cpu_percent > 95 # Increase threshold
for: 5m # Wait longer before alerting
Changing Data Retention
Edit docker-compose.yml:
prometheus:
command:
- '--storage.tsdb.retention.time=30d' # Keep data for 30 days
- '--storage.tsdb.retention.size=10GB' # Limit storage size
Troubleshooting
Metrics Not Appearing
Check Metrics Exporter
# View logs
docker compose logs metrics
# Verify endpoint is accessible
curl http://localhost:9090/metrics | grep hemostat_
# Check Redis connection
docker compose exec metrics python -c "import redis; redis.Redis(host='redis').ping()"
Prometheus Not Scraping
Verify Targets
# Check target status
curl http://localhost:9091/api/v1/targets
# View in browser
open http://localhost:9091/targets
Check Connectivity
# Test from Prometheus container
docker compose exec prometheus wget -O- http://metrics:9090/metrics
Grafana Shows “No Data”
Verify time range - Use “Last 15 minutes” or “Last 1 hour”
Test data source - Go to Configuration → Data Sources → Test
Check Prometheus - Query metrics directly in Prometheus UI
Generate activity - Start agents to create events
Test Prometheus Connection
# From Grafana container
docker compose exec grafana wget -O- http://prometheus:9090/api/v1/query?query=up
High Resource Usage
Reduce Scrape Frequency
scrape_interval: 30s # From 10s to 30s
Lower Retention Period
--storage.tsdb.retention.time=7d # From 15d to 7d
Add Recording Rules for expensive queries:
# Create recording rules for frequently used queries
- record: job:hemostat_cpu_avg:5m
expr: avg(hemostat_container_cpu_percent)
Integration with Existing Dashboard
HemoStat provides two complementary dashboards:
Streamlit Dashboard (Port 8501)
Real-time event streaming
Live container status grid
Event timeline and details
Active issues feed
Grafana Dashboard (Port 3000)
Historical metrics analysis
Performance trends over time
Alert visualization
Custom query exploration
Use Cases:
Streamlit: Real-time incident response and live monitoring
Grafana: Performance analysis, capacity planning, trend identification
Best Practices
Dashboard Design
Use appropriate time ranges - Last 1 hour for real-time, 24 hours for trends
Set meaningful thresholds - Based on SLAs and baseline performance
Add annotations - Mark deployments and incidents on graphs
Create variables - For dynamic container/service filtering
Alerting Strategy
Set alert thresholds conservatively - Avoid alert fatigue
Use multi-condition alerts - Combine metrics for context
Configure notification channels - Slack, email, PagerDuty
Test alert rules - Verify alerts trigger appropriately
Performance Optimization
Use recording rules - Pre-compute expensive queries
Limit cardinality - Avoid high-cardinality labels
Set appropriate retention - Balance storage and query needs
Monitor Prometheus itself - Track ingestion rate and query performance
Security
Change default passwords - Especially Grafana admin password
Enable authentication - For Prometheus and Grafana
Use HTTPS - In production deployments
Restrict network access - Limit to authorized networks
Advanced Topics
Adding Custom Metrics
Extend the Metrics Exporter in agents/hemostat_metrics/metrics.py:
from prometheus_client import Counter
# Define new metric
self.custom_metric = Counter(
"hemostat_custom_total",
"Description of custom metric",
["label1", "label2"]
)
# Update metric in event handler
self.custom_metric.labels(label1="value1", label2="value2").inc()
Creating Custom Dashboards
Design dashboard in Grafana UI
Export as JSON: Dashboard Settings → JSON Model
Save to
monitoring/grafana/provisioning/dashboards/Restart Grafana to load
Setting Up Alertmanager
For advanced alert routing and aggregation:
# In prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
Remote Write Configuration
For long-term storage:
# In prometheus.yml
remote_write:
- url: "https://prometheus-remote-storage.example.com/api/v1/write"
Service URLs
Service |
URL |
Purpose |
|---|---|---|
Grafana |
http://localhost:3000 |
Visualization dashboards |
Prometheus |
http://localhost:9091 |
Query UI and alerts |
Metrics Exporter |
http://localhost:9090/metrics |
Raw metrics endpoint |
Streamlit Dashboard |
http://localhost:8501 |
Real-time monitoring |
References
Prometheus Documentation: https://prometheus.io/docs/
Grafana Documentation: https://grafana.com/docs/
PromQL Guide: https://prometheus.io/docs/prometheus/latest/querying/basics/
Metrics Exporter Source:
agents/hemostat_metrics/Configuration Files:
monitoring/prometheus/andmonitoring/grafana/
See Also
Architecture - System design and agent communication
Deployment - Production deployment strategies
Troubleshooting - Common issues and solutions
API Reference - Metrics Exporter API documentation