System Monitoring

Real-time infrastructure health, background job monitoring, and centralized log access for platform operations.

Overview

The System Monitoring feature provides PenguinMails operations team with comprehensive visibility into platform health, background job processing, and application logs. It serves as the primary dashboard for proactive monitoring, incident response, and performance troubleshooting.

Key Capabilities

Infrastructure Health: Real-time service status and performance metrics
Queue Monitoring: Hybrid PostgreSQL + Redis job queue management
Log Viewer: Centralized application log search and analysis
IP Reputation: Email deliverability monitoring
Alerting Integration: PagerDuty and Slack notifications

Sub-Features

Queue Monitoring

Monitor and manage background jobs across the hybrid PostgreSQL + Redis queue system.

Infrastructure Monitoring

Track server health, IP reputation, and service availability.

Log Viewer

Search and analyze application logs for debugging and troubleshooting.

Metrics Glossary

Reference guide for all available metrics, their sources, and update frequencies.

User Stories

Operations Engineer (Queue Management)

“As an ops engineer, I need to check if the email sending queue is backing up during a high-volume campaign launch.”

Workflow:

Navigate to /dashboard/system/queues

Check queue:email-sending active jobs

Review queue depth and processing rate

If backed up, scale workers or pause low-priority queues

Monitor until queue drains

DevOps Engineer (Incident Response)

“As a DevOps engineer, I need to investigate SMTP service degradation reported by monitoring alerts.”

Workflow:

Navigate to /dashboard/system/infrastructure

Check SMTP service status card (shows “Degraded”)

View infrastructure alerts for root cause

Navigate to log viewer to search SMTP errors

Identify and resolve issue

Developer (Debugging)

“As a developer, I need to trace a production error to its source using request ID from customer support ticket.”

Workflow:

Navigate to /dashboard/system/logs

Search by request ID

Filter to Error level logs

Expand error entry for stack trace

Identify code path and deploy fix

Integration Architecture

Queue System Integration

Hybrid Architecture: PostgreSQL for durability + Redis for speed
Queue Types: High, Normal, Low priority
Job Categories: Email sending, analytics aggregation, warmup processing
Monitoring: Real-time job counts and failure tracking

See Queue System Implementation for technical details.

Monitoring Stack

[!NOTE] For launch, PostHog handles basic monitoring and alerting. The tools below are planned for evaluation in a 2026 spike.

Metrics Collection: Prometheus for time-series data (2026 Spike)
Visualization: Grafana dashboards (2026 Spike)
Historical Storage: ClickHouse for long-term metrics (2026 Spike)
Alerting: PostHog for launch, PagerDuty integration planned

Logging Infrastructure

[!NOTE] For launch, use structured JSON logging to standard output. Centralized log aggregation is planned for the 2026 observability spike.

Centralized Logs: Elasticsearch for aggregation (2026 Spike)
Application Logs: Structured JSON logging ✅
Error Tracking: Sentry integration (2026 Spike)
Retention: 90 days for operational logs, 7 years for audit logs

Technical Considerations

Performance

Real-Time Polling: Queue dashboard updates every 5 seconds
Log Search: Elasticsearch indexes ~1TB logs with <1s query time (2026 Spike)
Dashboard Load: <500ms with cached infrastructure metrics

Scalability

Queue Depth: Handles millions of jobs per day
Log Volume: Processes ~100GB logs daily
Metric Cardinality: 10,000+ time-series tracked

Reliability

Monitoring Uptime: 99.9% availability target
Alert Latency: <1 minute for critical issues
Data Retention: 90 days operational, 1 year compliance

Reference

Metrics Glossary - Definitions of all system metrics

Route Specifications

Platform Admin Routes - Queue Monitoring - Queue UI spec
Platform Admin Routes - Infrastructure - Infrastructure UI spec
Platform Admin Routes - Logs - Log viewer UI spec

API Documentation

System Monitoring API - Health and queue endpoints
Queue API - Job management

Implementation

Queue System Architecture - Queue design
Logging Standards - Log format and best practices

Operations

Incident Response - Troubleshooting procedures
Monitoring Setup - Alerting configuration

Future Enhancements

Q2 2026

Automated Remediation: Auto-retry failed jobs based on error type
Predictive Alerting: ML-based anomaly detection
Custom Dashboards: Per-team monitoring views

Monitoring & Compliance Spike

Websocket Investigation: Evaluate websocket architecture for real-time compliance alerts and monitoring updates.
Stakeholder Requirements: Gather detailed needs for automated compliance reporting and real-time dashboards.

Q3 2026

Cost Monitoring: Infrastructure cost tracking per tenant
Performance Profiling: APM integration for slow requests
Capacity Planning: Automated resource scaling recommendations