System Monitoring API

Parent API: Platform API
URL Prefix: /api/v1/platform/system

Overview

The System Monitoring API provides endpoints for infrastructure health monitoring, background job queue management, centralized log access, and IP reputation tracking.

Queue Endpoints

List Queues

Method: GET
URL: /api/v1/platform/system/queues
Purpose: Get status of all background job queues

Response:

{
  "success": true,
  "data": {
    "queues": [
      {
        "name": "queue:email-sending:high",
        "priority": "high",
        "active_jobs": 12,
        "waiting_jobs": 45,
        "completed_last_hour": 1234,
        "failed_jobs": 2,
        "status": "healthy"
      },
      {
        "name": "queue:email-sending",
        "priority": "normal",
        "active_jobs": 50,
        "waiting_jobs": 2500,
        "completed_last_hour": 5678,
        "failed_jobs": 8,
        "status": "warning"
      }
    ]
  }
}

Status Values:

healthy: waiting < 100
warning: waiting 100-1000
critical: waiting > 1000 or failures > 10

Get Queue Jobs

Method: GET
URL: /api/v1/platform/system/queues/{queue_name}/jobs
Purpose: List jobs in a specific queue

Query Parameters:

status: active waiting failed completed
limit (optional): Max results (default: 50, max: 100)
offset (optional): Pagination offset

Response:

{
  "success": true,
  "data": {
    "jobs": [
      {
        "id": "job_123",
        "queue": "queue:email-sending",
        "status": "failed",
        "payload": {
          "campaign_id": "cmp_abc",
          "recipient_count": 1000
        },
        "error": {
          "message": "SMTP connection timeout",
          "stack_trace": "..."
        },
        "attempts": 3,
        "created_at": "2025-12-04T10:00:00Z",
        "updated_at": "2025-12-04T10:15:00Z"
      }
    ],
    "pagination": {
      "total": 125,
      "limit": 50,
      "offset": 0
    }
  }
}

Pause Queue

Method: POST
URL: /api/v1/platform/system/queues/{queue_name}/pause
Purpose: Emergency stop for troubleshooting

Response:

{
  "success": true,
  "data": {
    "queue": "queue:email-sending",
    "status": "paused",
    "timestamp": "2025-12-04T12:00:00Z"
  }
}

Resume Queue

Method: POST
URL: /api/v1/platform/system/queues/{queue_name}/resume
Purpose: Re-enable queue processing

Retry Failed Job

Method: POST
URL: /api/v1/platform/system/jobs/{job_id}/retry
Purpose: Re-queue a failed job

Response:

{
  "success": true,
  "data": {
    "job_id": "job_123",
    "status": "queued",
    "attempts_reset": true
  }
}

Delete Job

Method: DELETE
URL: /api/v1/platform/system/jobs/{job_id}
Purpose: Permanently remove a job from queue

Infrastructure Endpoints

Get Infrastructure Health

Method: GET
URL: /api/v1/platform/system/infrastructure/health
Purpose: Real-time service health status

Response:

{
  "success": true,
  "data": {
    "services": [
      {
        "name": "API Server",
        "status": "healthy",
        "response_time_ms": {
          "avg": 120,
          "p95": 250,
          "p99": 450
        },
        "error_rate_percent": 0.1
      },
      {
        "name": "SMTP Service",
        "status": "degraded",
        "queue_backlog": 5000,
        "reason": "High volume during campaign launch"
      },
      {
        "name": "OLTP Database",
        "status": "healthy",
        "connection_pool_usage_percent": 45,
        "query_latency_ms": 8
      }
    ]
  }
}

Status Values:

healthy: All metrics normal
degraded: Performance impact
down: Service unavailable

Get IP Reputation

Method: GET
URL: /api/v1/platform/system/infrastructure/ip-reputation
Purpose: Monitor email sending reputation

Response:

{
  "success": true,
  "data": {
    "ips": [
      {
        "ip_address": "203.0.113.12",
        "provider": "AWS",
        "reputation_score": 92,
        "blacklists": ["spamhaus"],
        "daily_volume": 50000,
        "status": "critical"
      },
      {
        "ip_address": "203.0.113.13",
        "provider": "SendGrid",
        "reputation_score": 98,
        "blacklists": [],
        "daily_volume": 75000,
        "status": "good"
      }
    ]
  }
}

Get Infrastructure Alerts

Method: GET
URL: /api/v1/platform/system/infrastructure/alerts
Purpose: Recent infrastructure alerts

Query Parameters:

hours (optional): Hours of history (default: 24, max: 168)
severity (optional): error warning info

Response:

{
  "success": true,
  "data": {
    "alerts": [
      {
        "id": "alert_123",
        "timestamp": "2025-12-04T11:30:00Z",
        "severity": "error",
        "service": "SMTP",
        "message": "Queue size exceeded 10k threshold",
        "acknowledged": false
      }
    ]
  }
}

Log Endpoints

Search Logs

Method: GET
URL: /api/v1/platform/system/logs
Purpose: Search and filter application logs

Query Parameters:

query: Search term (full-text or regex)
level: error warning info debug
service: Filter by service name
start_time: ISO 8601 timestamp
end_time: ISO 8601 timestamp
request_id (optional): Filter by request ID
user_id (optional): Filter by user ID
limit (optional): Max results (default: 50, max: 100)
offset (optional): Pagination offset

Response:

{
  "success": true,
  "data": {
    "logs": [
      {
        "id": "log_123",
        "timestamp": "2025-12-04T12:15:30Z",
        "level": "error",
        "service": "api",
        "message": "Payment processing failed",
        "request_id": "req_abc123",
        "user_id": "usr_xyz",
        "metadata": {
          "error_type": "PaymentError",
          "stack_trace": "..."
        }
      }
    ],
    "pagination": {
      "total": 345,
      "limit": 50,
      "offset": 0
    }
  }
}

Export Logs

Method: GET
URL: /api/v1/platform/system/logs/export
Purpose: Download filtered logs

Query Parameters: Same as search logs, plus:

format: csv json txt

Response (Async job):

{
  "success": true,
  "data": {
    "export_id": "export_logs_abc",
    "status": "processing",
    "download_url": null
  }
}

Error Handling

400 Bad Request: Invalid parameters
401 Unauthorized: Missing or invalid token
403 Forbidden: User does not have Ops role
404 Not Found: Queue, job, or export not found
429 Too Many Requests: Rate limit exceeded

System Monitoring Feature - Feature overview
Queue Monitoring - Queue management details
Platform Admin Routes - Queues - UI specification
Queue System Implementation - Architecture

System Monitoring API

Overview

Queue Endpoints

List Queues

Get Queue Jobs

Pause Queue

Resume Queue

Retry Failed Job

Delete Job

Infrastructure Endpoints

Get Infrastructure Health

Get IP Reputation

Get Infrastructure Alerts

Log Endpoints

Search Logs

Export Logs

Error Handling

Related Documentation