System Monitoring
Real-time infrastructure health, background job monitoring, and centralized log access for platform operations.
Overview
The System Monitoring feature provides PenguinMails operations team with comprehensive visibility into platform health, background job processing, and application logs. It serves as the primary dashboard for proactive monitoring, incident response, and performance troubleshooting.
Key Capabilities
- Infrastructure Health: Real-time service status and performance metrics
- Queue Monitoring: Hybrid PostgreSQL + Redis job queue management
- Log Viewer: Centralized application log search and analysis
- IP Reputation: Email deliverability monitoring
- Alerting Integration: PagerDuty and Slack notifications
Sub-Features
Queue Monitoring
Monitor and manage background jobs across the hybrid PostgreSQL + Redis queue system.
Infrastructure Monitoring
Track server health, IP reputation, and service availability.
Log Viewer
Search and analyze application logs for debugging and troubleshooting.
Metrics Glossary
Reference guide for all available metrics, their sources, and update frequencies.
User Stories
Operations Engineer (Queue Management)
“As an ops engineer, I need to check if the email sending queue is backing up during a high-volume campaign launch.”
Workflow:
- Navigate to
/dashboard/system/queues- Check
queue:email-sendingactive jobs- Review queue depth and processing rate
- If backed up, scale workers or pause low-priority queues
- Monitor until queue drains
DevOps Engineer (Incident Response)
“As a DevOps engineer, I need to investigate SMTP service degradation reported by monitoring alerts.”
Workflow:
- Navigate to
/dashboard/system/infrastructure- Check SMTP service status card (shows “Degraded”)
- View infrastructure alerts for root cause
- Navigate to log viewer to search SMTP errors
- Identify and resolve issue
Developer (Debugging)
“As a developer, I need to trace a production error to its source using request ID from customer support ticket.”
Workflow:
- Navigate to
/dashboard/system/logs- Search by request ID
- Filter to Error level logs
- Expand error entry for stack trace
- Identify code path and deploy fix
Integration Architecture
Queue System Integration
- Hybrid Architecture: PostgreSQL for durability + Redis for speed
- Queue Types: High, Normal, Low priority
- Job Categories: Email sending, analytics aggregation, warmup processing
- Monitoring: Real-time job counts and failure tracking
See Queue System Implementation for technical details.
Monitoring Stack
[!NOTE] For launch, PostHog handles basic monitoring and alerting. The tools below are planned for evaluation in a 2026 spike.
- Metrics Collection: Prometheus for time-series data (2026 Spike)
- Visualization: Grafana dashboards (2026 Spike)
- Historical Storage: ClickHouse for long-term metrics (2026 Spike)
- Alerting: PostHog for launch, PagerDuty integration planned
Logging Infrastructure
[!NOTE] For launch, use structured JSON logging to standard output. Centralized log aggregation is planned for the 2026 observability spike.
- Centralized Logs: Elasticsearch for aggregation (2026 Spike)
- Application Logs: Structured JSON logging ✅
- Error Tracking: Sentry integration (2026 Spike)
- Retention: 90 days for operational logs, 7 years for audit logs
Technical Considerations
Performance
- Real-Time Polling: Queue dashboard updates every 5 seconds
- Log Search: Elasticsearch indexes ~1TB logs with <1s query time (2026 Spike)
- Dashboard Load: <500ms with cached infrastructure metrics
Scalability
- Queue Depth: Handles millions of jobs per day
- Log Volume: Processes ~100GB logs daily
- Metric Cardinality: 10,000+ time-series tracked
Reliability
- Monitoring Uptime: 99.9% availability target
- Alert Latency: <1 minute for critical issues
- Data Retention: 90 days operational, 1 year compliance
Related Documentation
Reference
- Metrics Glossary - Definitions of all system metrics
Route Specifications
- Platform Admin Routes - Queue Monitoring - Queue UI spec
- Platform Admin Routes - Infrastructure - Infrastructure UI spec
- Platform Admin Routes - Logs - Log viewer UI spec
API Documentation
- System Monitoring API - Health and queue endpoints
- Queue API - Job management
Implementation
- Queue System Architecture - Queue design
- Logging Standards - Log format and best practices
Operations
- Incident Response - Troubleshooting procedures
- Monitoring Setup - Alerting configuration
Future Enhancements
Q2 2026
- Automated Remediation: Auto-retry failed jobs based on error type
- Predictive Alerting: ML-based anomaly detection
- Custom Dashboards: Per-team monitoring views
Monitoring & Compliance Spike
- Websocket Investigation: Evaluate websocket architecture for real-time compliance alerts and monitoring updates.
- Stakeholder Requirements: Gather detailed needs for automated compliance reporting and real-time dashboards.
Q3 2026
- Cost Monitoring: Infrastructure cost tracking per tenant
- Performance Profiling: APM integration for slow requests
- Capacity Planning: Automated resource scaling recommendations