Monitoring & Observability

HemoStat includes a comprehensive monitoring stack built on Prometheus and Grafana, providing deep insights into system performance, container health, and agent operations.

Overview

The monitoring stack consists of three key components:

Metrics Exporter Agent - Subscribes to HemoStat events and exposes Prometheus metrics
Prometheus - Time-series database for metrics collection and alerting
Grafana - Visualization dashboards for historical analysis

        graph LR
    A[Monitor Agent] -->|events| R[Redis]
    B[Analyzer Agent] -->|events| R
    C[Responder Agent] -->|events| R
    D[Alert Agent] -->|events| R
    R -->|subscribe| M[Metrics Exporter]
    M -->|:9090/metrics| P[Prometheus]
    P -->|data source| G[Grafana Dashboards]

Architecture

Metrics Exporter Agent

The Metrics Exporter is a specialized HemoStat agent that:

Subscribes to all HemoStat Redis pub/sub channels
Converts events into Prometheus-compatible metrics
Exposes an HTTP endpoint at http://localhost:9090/metrics
Tracks container health, agent performance, and system operations

Implementation: agents/hemostat_metrics/metrics.py

Data Flow

HemoStat agents publish events to Redis
Metrics Exporter subscribes and converts events to metrics
Prometheus scrapes metrics endpoint every 10 seconds
Grafana queries Prometheus for visualization
Alert rules trigger notifications on anomalies

Metrics Catalog

Container Health Metrics

Metric	Type	Description
`hemostat_container_cpu_percent`	Gauge	CPU usage percentage per container
`hemostat_container_memory_percent`	Gauge	Memory usage percentage per container
`hemostat_container_memory_bytes`	Gauge	Memory usage in bytes
`hemostat_container_restart_count`	Gauge	Container restart count
`hemostat_container_network_rx_bytes_total`	Counter	Network bytes received
`hemostat_container_network_tx_bytes_total`	Counter	Network bytes transmitted
`hemostat_container_blkio_read_bytes_total`	Counter	Block I/O read bytes
`hemostat_container_blkio_write_bytes_total`	Counter	Block I/O write bytes

Health Alert Metrics

Metric	Type	Description
`hemostat_health_alerts_total`	Counter	Total health alerts by severity
`hemostat_anomalies_detected_total`	Counter	Total anomalies by type

Analysis Metrics

Metric	Type	Description
`hemostat_analysis_requests_total`	Counter	Total analysis requests by result type
`hemostat_analysis_duration_seconds`	Histogram	Analysis response time distribution
`hemostat_analysis_confidence`	Histogram	AI confidence score distribution

Remediation Metrics

Metric	Type	Description
`hemostat_remediation_attempts_total`	Counter	Total attempts by action and status
`hemostat_remediation_duration_seconds`	Histogram	Remediation execution time
`hemostat_remediation_cooldown_active`	Gauge	Cooldown status per container

Alert Metrics

Metric	Type	Description
`hemostat_alerts_sent_total`	Counter	Total alerts sent by channel and status
`hemostat_alerts_deduped_total`	Counter	Total deduplicated alerts

System Metrics

Metric	Type	Description
`hemostat_agent_uptime_seconds`	Gauge	Agent uptime tracking
`hemostat_redis_operations_total`	Counter	Redis operations by type
`hemostat_time_to_detection_seconds`	Histogram	Time from issue to detection
`hemostat_time_to_remediation_seconds`	Histogram	Time from detection to fix

Quick Start

1. Start Monitoring Stack

# Start all services including monitoring
docker compose up -d

# Or start just monitoring components
docker compose up -d redis metrics prometheus grafana

2. Access Dashboards

Grafana Dashboard

URL: http://localhost:3000
Username: admin
Password: admin (change on first login)

Prometheus Query UI

URL: http://localhost:9091
Direct metric queries and exploration

Metrics Endpoint

URL: http://localhost:9090/metrics
Raw Prometheus metrics

3. View HemoStat Overview Dashboard

Login to Grafana
Navigate to Dashboards → HemoStat folder
Select HemoStat Overview

The dashboard displays:

Container CPU and memory usage graphs
Health alerts by severity
Analysis performance metrics (response time, confidence)
Remediation attempts and success rates
Agent uptime and system health

Prometheus Configuration

Scrape Configuration

Prometheus is configured to scrape the Metrics Exporter every 10 seconds:

scrape_configs:
  - job_name: 'hemostat-metrics'
    static_configs:
      - targets: ['metrics:9090']
    scrape_interval: 10s
    scrape_timeout: 5s

Configuration file: monitoring/prometheus/prometheus.yml

Alert Rules

Pre-configured alert rules monitor system health:

Container Health Alerts

HighContainerCPU - CPU > 90% for 2+ minutes
HighContainerMemory - Memory > 90% for 2+ minutes
ExcessiveContainerRestarts - Frequent restart rate

System Performance Alerts

SlowAnalysisResponse - p95 latency > 10 seconds
HighRemediationFailureRate - Failure rate > 30%
HighAlertFailureRate - Notification failures > 20%

System Health Alerts

MetricsExporterDown - Exporter unavailable for 1+ minute
NoHealthAlertsDetected - No alerts for 30+ minutes (possible monitor issue)

Configuration file: monitoring/prometheus/rules/hemostat_alerts.yml

Grafana Dashboards

HemoStat Overview Dashboard

The main dashboard provides 11 panels across four sections:

Summary Metrics

Monitored containers count
Health alerts per minute
Median analysis confidence
Remediations per minute

Container Health Graphs

CPU usage time-series per container
Memory usage time-series per container

System Activity

Health alerts by severity
Analysis requests by result type
Remediation attempts by action and status

Performance Metrics

Analysis duration percentiles (p50, p95, p99)
Remediation duration percentiles (p50, p95, p99)

Auto-Provisioning

Dashboards and data sources are automatically configured on startup:

Data Source: monitoring/grafana/provisioning/datasources/prometheus.yml
Dashboard: monitoring/grafana/provisioning/dashboards/hemostat_overview.json

PromQL Query Examples

Container Metrics

# Average CPU across all containers
avg(hemostat_container_cpu_percent)

# Containers with high memory usage
hemostat_container_memory_percent > 80

# Container restart rate
rate(hemostat_container_restart_count[5m])

Analysis Performance

# Analysis p95 latency
histogram_quantile(0.95, rate(hemostat_analysis_duration_seconds_bucket[5m]))

# Median confidence score
histogram_quantile(0.5, rate(hemostat_analysis_confidence_bucket[5m]))

# Analysis requests per second
rate(hemostat_analysis_requests_total[1m])

Remediation Tracking

# Remediation success rate
sum(rate(hemostat_remediation_attempts_total{status="success"}[5m])) /
sum(rate(hemostat_remediation_attempts_total[5m]))

# Failed remediations per minute
rate(hemostat_remediation_attempts_total{status="failed"}[5m]) * 60

# Remediation duration p99
histogram_quantile(0.99, rate(hemostat_remediation_duration_seconds_bucket[5m]))

System Health

# Agent uptime
hemostat_agent_uptime_seconds{agent_name="metrics"}

# Total health alerts in last hour
sum(increase(hemostat_health_alerts_total[1h]))

# Alert deduplication rate
rate(hemostat_alerts_deduped_total[5m])

Configuration

Environment Variables

Metrics Exporter

METRICS_PORT=9090          # HTTP server port
REDIS_HOST=redis           # Redis hostname
REDIS_PORT=6379           # Redis port
LOG_LEVEL=INFO            # Logging level

Prometheus

Data retention: 15 days (configurable via --storage.tsdb.retention.time)
Scrape interval: 10 seconds (configurable in prometheus.yml)

Grafana

GF_SECURITY_ADMIN_USER=admin           # Admin username
GF_SECURITY_ADMIN_PASSWORD=admin       # Admin password (change this!)
GF_USERS_ALLOW_SIGN_UP=false          # Disable user signup

Customizing Scrape Intervals

Edit monitoring/prometheus/prometheus.yml:

scrape_configs:
  - job_name: 'hemostat-metrics'
    scrape_interval: 5s  # More frequent scraping

Adjusting Alert Thresholds

Edit monitoring/prometheus/rules/hemostat_alerts.yml:

- alert: HighContainerCPU
  expr: hemostat_container_cpu_percent > 95  # Increase threshold
  for: 5m  # Wait longer before alerting

Changing Data Retention

Edit docker-compose.yml:

prometheus:
  command:
    - '--storage.tsdb.retention.time=30d'  # Keep data for 30 days
    - '--storage.tsdb.retention.size=10GB' # Limit storage size

Troubleshooting

Metrics Not Appearing

Check Metrics Exporter

# View logs
docker compose logs metrics

# Verify endpoint is accessible
curl http://localhost:9090/metrics | grep hemostat_

# Check Redis connection
docker compose exec metrics python -c "import redis; redis.Redis(host='redis').ping()"

Prometheus Not Scraping

Verify Targets

# Check target status
curl http://localhost:9091/api/v1/targets

# View in browser
open http://localhost:9091/targets

Check Connectivity

# Test from Prometheus container
docker compose exec prometheus wget -O- http://metrics:9090/metrics

Grafana Shows “No Data”

Verify time range - Use “Last 15 minutes” or “Last 1 hour”
Test data source - Go to Configuration → Data Sources → Test
Check Prometheus - Query metrics directly in Prometheus UI
Generate activity - Start agents to create events

Test Prometheus Connection

# From Grafana container
docker compose exec grafana wget -O- http://prometheus:9090/api/v1/query?query=up

High Resource Usage

Reduce Scrape Frequency

scrape_interval: 30s  # From 10s to 30s

Lower Retention Period

--storage.tsdb.retention.time=7d  # From 15d to 7d

Add Recording Rules for expensive queries:

# Create recording rules for frequently used queries
- record: job:hemostat_cpu_avg:5m
  expr: avg(hemostat_container_cpu_percent)

Integration with Existing Dashboard

HemoStat provides two complementary dashboards:

Streamlit Dashboard (Port 8501)

Real-time event streaming
Live container status grid
Event timeline and details
Active issues feed

Grafana Dashboard (Port 3000)

Historical metrics analysis
Performance trends over time
Alert visualization
Custom query exploration

Use Cases:

Streamlit: Real-time incident response and live monitoring
Grafana: Performance analysis, capacity planning, trend identification

Best Practices

Dashboard Design

Use appropriate time ranges - Last 1 hour for real-time, 24 hours for trends
Set meaningful thresholds - Based on SLAs and baseline performance
Add annotations - Mark deployments and incidents on graphs
Create variables - For dynamic container/service filtering

Alerting Strategy

Set alert thresholds conservatively - Avoid alert fatigue
Use multi-condition alerts - Combine metrics for context
Configure notification channels - Slack, email, PagerDuty
Test alert rules - Verify alerts trigger appropriately

Performance Optimization

Use recording rules - Pre-compute expensive queries
Limit cardinality - Avoid high-cardinality labels
Set appropriate retention - Balance storage and query needs
Monitor Prometheus itself - Track ingestion rate and query performance

Security

Change default passwords - Especially Grafana admin password
Enable authentication - For Prometheus and Grafana
Use HTTPS - In production deployments
Restrict network access - Limit to authorized networks

Advanced Topics

Adding Custom Metrics

Extend the Metrics Exporter in agents/hemostat_metrics/metrics.py:

from prometheus_client import Counter

# Define new metric
self.custom_metric = Counter(
    "hemostat_custom_total",
    "Description of custom metric",
    ["label1", "label2"]
)

# Update metric in event handler
self.custom_metric.labels(label1="value1", label2="value2").inc()

Creating Custom Dashboards

Design dashboard in Grafana UI
Export as JSON: Dashboard Settings → JSON Model
Save to monitoring/grafana/provisioning/dashboards/
Restart Grafana to load

Setting Up Alertmanager

For advanced alert routing and aggregation:

# In prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

Remote Write Configuration

For long-term storage:

# In prometheus.yml
remote_write:
  - url: "https://prometheus-remote-storage.example.com/api/v1/write"

Service URLs

Service	URL	Purpose
Grafana	http://localhost:3000	Visualization dashboards
Prometheus	http://localhost:9091	Query UI and alerts
Metrics Exporter	http://localhost:9090/metrics	Raw metrics endpoint
Streamlit Dashboard	http://localhost:8501	Real-time monitoring

References

Prometheus Documentation: https://prometheus.io/docs/
Grafana Documentation: https://grafana.com/docs/
PromQL Guide: https://prometheus.io/docs/prometheus/latest/querying/basics/
Metrics Exporter Source: agents/hemostat_metrics/
Configuration Files: monitoring/prometheus/ and monitoring/grafana/