# System Architecture HemoStat uses a multi-agent architecture where specialized agents communicate through Redis pub/sub to monitor, analyze, and remediate container health issues. ## System Overview ```{mermaid} graph TD A["Monitor Agent
Polls Docker every 30s"] -->|publishes health_alert| B["Analyzer Agent
AI-powered root cause analysis"] B -->|publishes remediation_needed| C["Responder Agent
Executes safe fixes"] C -->|publishes remediation_complete| D["Alert Agent
Sends notifications"] D -->|updates Redis| E["Dashboard
Real-time visualization"] B -->|publishes false_alarm| D A -->|events| M["Metrics Exporter
Prometheus metrics"] B -->|events| M C -->|events| M D -->|events| M M -->|scraped by| P["Prometheus + Grafana
Historical analysis"] ``` ## Agent Roles and Responsibilities ### Monitor Agent - Continuously polls Docker container metrics every 30 seconds - Collects CPU, memory, disk, and process status - Detects anomalies using statistical analysis - Publishes `health_alert` events to Redis ### Analyzer Agent - Subscribes to `health_alert` channel - Performs AI-powered root cause analysis using Claude or GPT-4 - Distinguishes real issues from false alarms with confidence scoring - Publishes `remediation_needed` or `false_alarm` events ### Responder Agent - Subscribes to `remediation_needed` channel - Executes remediation actions (restart, scale, cleanup, exec) - Enforces comprehensive safety constraints: - Cooldown periods (1 hour default) - Circuit breakers (max 3 retries/hour) - Dry-run mode support - Audit logging for compliance - Publishes `remediation_complete` events ### Alert Agent - Subscribes to `remediation_complete` and `false_alarm` channels - Sends notifications to external systems (Slack webhooks) - Stores events in Redis for dashboard consumption - Provides comprehensive audit trail - Implements event deduplication to prevent notification spam ### Metrics Exporter Agent - Subscribes to all HemoStat event channels - Converts events into Prometheus-compatible metrics - Exposes HTTP endpoint for Prometheus scraping (port 9090) - Tracks container health, agent performance, and system operations - Enables historical analysis and trend identification via Grafana See the [Monitoring documentation](monitoring.md) for detailed information on metrics and observability. ## Communication Model All agents communicate via Redis pub/sub channels: ```text hemostat:health_alert (Monitor → Analyzer) hemostat:remediation_needed (Analyzer → Responder) hemostat:remediation_complete (Responder → Alert) hemostat:false_alarm (Analyzer → Alert) ``` ## Data Flow 1. **Monitor** collects container metrics every 30 seconds 2. **Monitor** publishes `health_alert` event if anomalies detected 3. **Analyzer** subscribes to `health_alert` channel 4. **Analyzer** processes alert, decides if real issue 5. If real, **Analyzer** publishes `remediation_needed` event 6. **Responder** subscribes to `remediation_needed` channel 7. **Responder** checks safety constraints, executes fix 8. **Responder** publishes `remediation_complete` event 9. **Alert** subscribes to `remediation_complete` channel 10. **Alert** sends Slack notification, updates Redis 11. **Dashboard** reads from Redis, displays in real-time ## Redis Key Structure ```text hemostat:stats: Current metrics for container hemostat:remediation: Action history for container hemostat:events: Event log by type hemostat:containers List of monitored containers ``` ## Safety Mechanisms ### Cooldown Period - After restart, 1 hour cooldown before next restart - Prevents restart loops and cascading failures ### Max Actions Per Hour - Maximum 3 restarts per hour per container - Circuit breaker for repeated failures ### Fallback Analysis - If Claude fails, fall back to rule-based analysis - System continues operating without AI ### Error Handling - All agents catch and log exceptions - Graceful degradation, no cascading failures - Automatic restart on failure ## Scaling Considerations ### Horizontal Scaling - Run multiple Monitor instances (one per cluster) - Run multiple Analyzer instances (share load via Redis) - Run multiple Responder instances (Redis ensures atomicity) - Keep single Alert and Dashboard ### Performance Characteristics - Monitor: O(n) where n = number of containers - Analyzer: O(1) per alert, limited by Claude API rate - Responder: O(1) per remediation request - Alert: O(1) per completion event - Dashboard: O(1) for display updates ### Redis as Bottleneck For large-scale deployments: - Use Redis Cluster for horizontal scaling - Add message queue (RabbitMQ) for very high volume - Add persistent storage for audit logs ## Extensibility ### Adding New Agents 1. Create `agents/my_agent/my_agent.py` 2. Import `HemoStatAgent` from `agents.agent_base` 3. Override `run()` method 4. Subscribe to relevant Redis channels 5. Publish events to specific channels 6. Add Dockerfile and update docker-compose.yml See the [API Reference](api/agents.rst) for the `HemoStatAgent` base class documentation. ### Adding New Remediation Actions 1. Edit `agents/hemostat_responder/responder.py` 2. Add new method (e.g., `scale_container()`) 3. Update `_handle_remediation_request()` to call new method 4. Update Analyzer to suggest new action ### Customizing Monitor Thresholds Edit `agents/hemostat_monitor/monitor.py` to adjust detection thresholds: ```python self.thresholds = { 'memory_pct': 80, # Change to 70 for earlier alerts 'cpu_pct': 85 # Change to 75 for earlier alerts } ``` ## Deployment Options ### Local Docker Compose (Demo) - Simplest setup - All services on single machine - Perfect for testing and development ### Kubernetes - Horizontal scaling - High availability - Production-grade - More complex setup ### AWS ECS - Managed containers - Auto-scaling - Integration with CloudWatch ### Multi-Cloud - Deploy across multiple cloud providers - Redis cluster for centralized state - Cloud-specific agents for remediation ## Monitoring HemoStat Itself HemoStat includes comprehensive monitoring through Prometheus and Grafana. See the [Monitoring documentation](monitoring.md) for complete details. ### Key Metrics to Track 1. Monitor cycle time (should be ~30s) 2. Analyzer response time (should be <5s) 3. Responder execution time (should be <10s) 4. Alert notification latency 5. False alarm rate (should be low) 6. Mean time to detection (should be <30s) 7. Mean time to fix (should be ~13s) ### Available Metrics - **Container Health**: CPU, memory, restarts, network I/O - **Analysis Performance**: Duration, confidence scores, request rates - **Remediation Tracking**: Attempts, success rates, execution time - **System Health**: Agent uptime, Redis operations, alert rates Access metrics at: - **Grafana Dashboard**: http://localhost:3000 - **Prometheus UI**: http://localhost:9091 - **Raw Metrics**: http://localhost:9090/metrics ### Health Checks - All agents restart automatically on failure - Redis connectivity verified at startup - Docker socket connectivity verified at startup - Health check endpoints available - Prometheus monitors agent availability ## Security Considerations ### API Keys - Store in environment variables, not code - Never commit `.env` file - Rotate keys regularly ### Docker Socket - Only accessible from within container network - Read-only where possible - Audit all container operations ### Redis Access - Localhost only by default - Add authentication for production - Use TLS for remote access ### Logs and Audit Trail - All actions logged - No sensitive data in logs - Retain logs for compliance