Monitoring & Observability

HemoStat includes a comprehensive monitoring stack built on Prometheus and Grafana, providing deep insights into system performance, container health, and agent operations.

Overview

The monitoring stack consists of three key components:

  1. Metrics Exporter Agent - Subscribes to HemoStat events and exposes Prometheus metrics

  2. Prometheus - Time-series database for metrics collection and alerting

  3. Grafana - Visualization dashboards for historical analysis

        graph LR
    A[Monitor Agent] -->|events| R[Redis]
    B[Analyzer Agent] -->|events| R
    C[Responder Agent] -->|events| R
    D[Alert Agent] -->|events| R
    R -->|subscribe| M[Metrics Exporter]
    M -->|:9090/metrics| P[Prometheus]
    P -->|data source| G[Grafana Dashboards]
    

Architecture

Metrics Exporter Agent

The Metrics Exporter is a specialized HemoStat agent that:

  • Subscribes to all HemoStat Redis pub/sub channels

  • Converts events into Prometheus-compatible metrics

  • Exposes an HTTP endpoint at http://localhost:9090/metrics

  • Tracks container health, agent performance, and system operations

Implementation: agents/hemostat_metrics/metrics.py

Data Flow

  1. HemoStat agents publish events to Redis

  2. Metrics Exporter subscribes and converts events to metrics

  3. Prometheus scrapes metrics endpoint every 10 seconds

  4. Grafana queries Prometheus for visualization

  5. Alert rules trigger notifications on anomalies

Metrics Catalog

Container Health Metrics

Metric

Type

Description

hemostat_container_cpu_percent

Gauge

CPU usage percentage per container

hemostat_container_memory_percent

Gauge

Memory usage percentage per container

hemostat_container_memory_bytes

Gauge

Memory usage in bytes

hemostat_container_restart_count

Gauge

Container restart count

hemostat_container_network_rx_bytes_total

Counter

Network bytes received

hemostat_container_network_tx_bytes_total

Counter

Network bytes transmitted

hemostat_container_blkio_read_bytes_total

Counter

Block I/O read bytes

hemostat_container_blkio_write_bytes_total

Counter

Block I/O write bytes

Health Alert Metrics

Metric

Type

Description

hemostat_health_alerts_total

Counter

Total health alerts by severity

hemostat_anomalies_detected_total

Counter

Total anomalies by type

Analysis Metrics

Metric

Type

Description

hemostat_analysis_requests_total

Counter

Total analysis requests by result type

hemostat_analysis_duration_seconds

Histogram

Analysis response time distribution

hemostat_analysis_confidence

Histogram

AI confidence score distribution

Remediation Metrics

Metric

Type

Description

hemostat_remediation_attempts_total

Counter

Total attempts by action and status

hemostat_remediation_duration_seconds

Histogram

Remediation execution time

hemostat_remediation_cooldown_active

Gauge

Cooldown status per container

Alert Metrics

Metric

Type

Description

hemostat_alerts_sent_total

Counter

Total alerts sent by channel and status

hemostat_alerts_deduped_total

Counter

Total deduplicated alerts

System Metrics

Metric

Type

Description

hemostat_agent_uptime_seconds

Gauge

Agent uptime tracking

hemostat_redis_operations_total

Counter

Redis operations by type

hemostat_time_to_detection_seconds

Histogram

Time from issue to detection

hemostat_time_to_remediation_seconds

Histogram

Time from detection to fix

Quick Start

1. Start Monitoring Stack

# Start all services including monitoring
docker compose up -d

# Or start just monitoring components
docker compose up -d redis metrics prometheus grafana

2. Access Dashboards

Grafana Dashboard

  • URL: http://localhost:3000

  • Username: admin

  • Password: admin (change on first login)

Prometheus Query UI

  • URL: http://localhost:9091

  • Direct metric queries and exploration

Metrics Endpoint

  • URL: http://localhost:9090/metrics

  • Raw Prometheus metrics

3. View HemoStat Overview Dashboard

  1. Login to Grafana

  2. Navigate to DashboardsHemoStat folder

  3. Select HemoStat Overview

The dashboard displays:

  • Container CPU and memory usage graphs

  • Health alerts by severity

  • Analysis performance metrics (response time, confidence)

  • Remediation attempts and success rates

  • Agent uptime and system health

Prometheus Configuration

Scrape Configuration

Prometheus is configured to scrape the Metrics Exporter every 10 seconds:

scrape_configs:
  - job_name: 'hemostat-metrics'
    static_configs:
      - targets: ['metrics:9090']
    scrape_interval: 10s
    scrape_timeout: 5s

Configuration file: monitoring/prometheus/prometheus.yml

Alert Rules

Pre-configured alert rules monitor system health:

Container Health Alerts

  • HighContainerCPU - CPU > 90% for 2+ minutes

  • HighContainerMemory - Memory > 90% for 2+ minutes

  • ExcessiveContainerRestarts - Frequent restart rate

System Performance Alerts

  • SlowAnalysisResponse - p95 latency > 10 seconds

  • HighRemediationFailureRate - Failure rate > 30%

  • HighAlertFailureRate - Notification failures > 20%

System Health Alerts

  • MetricsExporterDown - Exporter unavailable for 1+ minute

  • NoHealthAlertsDetected - No alerts for 30+ minutes (possible monitor issue)

Configuration file: monitoring/prometheus/rules/hemostat_alerts.yml

Grafana Dashboards

HemoStat Overview Dashboard

The main dashboard provides 11 panels across four sections:

Summary Metrics

  • Monitored containers count

  • Health alerts per minute

  • Median analysis confidence

  • Remediations per minute

Container Health Graphs

  • CPU usage time-series per container

  • Memory usage time-series per container

System Activity

  • Health alerts by severity

  • Analysis requests by result type

  • Remediation attempts by action and status

Performance Metrics

  • Analysis duration percentiles (p50, p95, p99)

  • Remediation duration percentiles (p50, p95, p99)

Auto-Provisioning

Dashboards and data sources are automatically configured on startup:

  • Data Source: monitoring/grafana/provisioning/datasources/prometheus.yml

  • Dashboard: monitoring/grafana/provisioning/dashboards/hemostat_overview.json

PromQL Query Examples

Container Metrics

# Average CPU across all containers
avg(hemostat_container_cpu_percent)

# Containers with high memory usage
hemostat_container_memory_percent > 80

# Container restart rate
rate(hemostat_container_restart_count[5m])

Analysis Performance

# Analysis p95 latency
histogram_quantile(0.95, rate(hemostat_analysis_duration_seconds_bucket[5m]))

# Median confidence score
histogram_quantile(0.5, rate(hemostat_analysis_confidence_bucket[5m]))

# Analysis requests per second
rate(hemostat_analysis_requests_total[1m])

Remediation Tracking

# Remediation success rate
sum(rate(hemostat_remediation_attempts_total{status="success"}[5m])) /
sum(rate(hemostat_remediation_attempts_total[5m]))

# Failed remediations per minute
rate(hemostat_remediation_attempts_total{status="failed"}[5m]) * 60

# Remediation duration p99
histogram_quantile(0.99, rate(hemostat_remediation_duration_seconds_bucket[5m]))

System Health

# Agent uptime
hemostat_agent_uptime_seconds{agent_name="metrics"}

# Total health alerts in last hour
sum(increase(hemostat_health_alerts_total[1h]))

# Alert deduplication rate
rate(hemostat_alerts_deduped_total[5m])

Configuration

Environment Variables

Metrics Exporter

METRICS_PORT=9090          # HTTP server port
REDIS_HOST=redis           # Redis hostname
REDIS_PORT=6379           # Redis port
LOG_LEVEL=INFO            # Logging level

Prometheus

  • Data retention: 15 days (configurable via --storage.tsdb.retention.time)

  • Scrape interval: 10 seconds (configurable in prometheus.yml)

Grafana

GF_SECURITY_ADMIN_USER=admin           # Admin username
GF_SECURITY_ADMIN_PASSWORD=admin       # Admin password (change this!)
GF_USERS_ALLOW_SIGN_UP=false          # Disable user signup

Customizing Scrape Intervals

Edit monitoring/prometheus/prometheus.yml:

scrape_configs:
  - job_name: 'hemostat-metrics'
    scrape_interval: 5s  # More frequent scraping

Adjusting Alert Thresholds

Edit monitoring/prometheus/rules/hemostat_alerts.yml:

- alert: HighContainerCPU
  expr: hemostat_container_cpu_percent > 95  # Increase threshold
  for: 5m  # Wait longer before alerting

Changing Data Retention

Edit docker-compose.yml:

prometheus:
  command:
    - '--storage.tsdb.retention.time=30d'  # Keep data for 30 days
    - '--storage.tsdb.retention.size=10GB' # Limit storage size

Troubleshooting

Metrics Not Appearing

Check Metrics Exporter

# View logs
docker compose logs metrics

# Verify endpoint is accessible
curl http://localhost:9090/metrics | grep hemostat_

# Check Redis connection
docker compose exec metrics python -c "import redis; redis.Redis(host='redis').ping()"

Prometheus Not Scraping

Verify Targets

# Check target status
curl http://localhost:9091/api/v1/targets

# View in browser
open http://localhost:9091/targets

Check Connectivity

# Test from Prometheus container
docker compose exec prometheus wget -O- http://metrics:9090/metrics

Grafana Shows “No Data”

  1. Verify time range - Use “Last 15 minutes” or “Last 1 hour”

  2. Test data source - Go to Configuration → Data Sources → Test

  3. Check Prometheus - Query metrics directly in Prometheus UI

  4. Generate activity - Start agents to create events

Test Prometheus Connection

# From Grafana container
docker compose exec grafana wget -O- http://prometheus:9090/api/v1/query?query=up

High Resource Usage

Reduce Scrape Frequency

scrape_interval: 30s  # From 10s to 30s

Lower Retention Period

--storage.tsdb.retention.time=7d  # From 15d to 7d

Add Recording Rules for expensive queries:

# Create recording rules for frequently used queries
- record: job:hemostat_cpu_avg:5m
  expr: avg(hemostat_container_cpu_percent)

Integration with Existing Dashboard

HemoStat provides two complementary dashboards:

Streamlit Dashboard (Port 8501)

  • Real-time event streaming

  • Live container status grid

  • Event timeline and details

  • Active issues feed

Grafana Dashboard (Port 3000)

  • Historical metrics analysis

  • Performance trends over time

  • Alert visualization

  • Custom query exploration

Use Cases:

  • Streamlit: Real-time incident response and live monitoring

  • Grafana: Performance analysis, capacity planning, trend identification

Best Practices

Dashboard Design

  1. Use appropriate time ranges - Last 1 hour for real-time, 24 hours for trends

  2. Set meaningful thresholds - Based on SLAs and baseline performance

  3. Add annotations - Mark deployments and incidents on graphs

  4. Create variables - For dynamic container/service filtering

Alerting Strategy

  1. Set alert thresholds conservatively - Avoid alert fatigue

  2. Use multi-condition alerts - Combine metrics for context

  3. Configure notification channels - Slack, email, PagerDuty

  4. Test alert rules - Verify alerts trigger appropriately

Performance Optimization

  1. Use recording rules - Pre-compute expensive queries

  2. Limit cardinality - Avoid high-cardinality labels

  3. Set appropriate retention - Balance storage and query needs

  4. Monitor Prometheus itself - Track ingestion rate and query performance

Security

  1. Change default passwords - Especially Grafana admin password

  2. Enable authentication - For Prometheus and Grafana

  3. Use HTTPS - In production deployments

  4. Restrict network access - Limit to authorized networks

Advanced Topics

Adding Custom Metrics

Extend the Metrics Exporter in agents/hemostat_metrics/metrics.py:

from prometheus_client import Counter

# Define new metric
self.custom_metric = Counter(
    "hemostat_custom_total",
    "Description of custom metric",
    ["label1", "label2"]
)

# Update metric in event handler
self.custom_metric.labels(label1="value1", label2="value2").inc()

Creating Custom Dashboards

  1. Design dashboard in Grafana UI

  2. Export as JSON: Dashboard Settings → JSON Model

  3. Save to monitoring/grafana/provisioning/dashboards/

  4. Restart Grafana to load

Setting Up Alertmanager

For advanced alert routing and aggregation:

# In prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

Remote Write Configuration

For long-term storage:

# In prometheus.yml
remote_write:
  - url: "https://prometheus-remote-storage.example.com/api/v1/write"

Service URLs

Service

URL

Purpose

Grafana

http://localhost:3000

Visualization dashboards

Prometheus

http://localhost:9091

Query UI and alerts

Metrics Exporter

http://localhost:9090/metrics

Raw metrics endpoint

Streamlit Dashboard

http://localhost:8501

Real-time monitoring

References

  • Prometheus Documentation: https://prometheus.io/docs/

  • Grafana Documentation: https://grafana.com/docs/

  • PromQL Guide: https://prometheus.io/docs/prometheus/latest/querying/basics/

  • Metrics Exporter Source: agents/hemostat_metrics/

  • Configuration Files: monitoring/prometheus/ and monitoring/grafana/

See Also