Content Database Architecture: Analysis, Gotchas, and Proposals

This document analyzes the current Content DB design (as described in the content-database schema guide) and provides a pragmatic, implementation-ready refinement that aligns with:

4-tier architecture:
- OLTP: operational entities and message metadata.
- Content DB: heavy content (bodies, attachments, archives).
- OLAP: aggregates and analytics (no heavy bodies).
- Queue: pipelines and async processing.
Separation of concerns:
- Content DB is NOT a logging, monitoring, or analytics sink.
- External analytics/logging handles telemetry per the external analytics logging documentation.

This is the canonical interpretation to use when reconciling legacy content DB docs.

1. Current Model (from legacy guide)

Key constructs from the content-database schema guide:

email_messages
- Message-level metadata and traces, keyed by storage_key, referencing email_content.
email_content
- Email bodies (text), headers, compression, retention metadata.
content_objects
- Alternative).
attachments
- Binary blob storage keyed to content storage_key.
transactional_emails, notifications, system_notifications
- System messaging and alerts.
content_access_log, content_retention_stats, content_search_index
- Access logging, retention metrics, search support.
retention_policies, content_encryption_policies, content_security_alerts, content_access_audit
Aggressive lifecycle, deduplication, compression, and metrics functions.

The snippet you shared:

%% Content Database Schema
content_objects {
    varchar(500) storage_key PK "Matches content_storage_key from Primary DB's 'inbox_message_refs' table"
    uuid tenant_id FK "Denormalized for data isolation and filtering"
    text content_text "Plain text version of the email body"
    text content_html "HTML version of the email body"
    jsonb headers "Full email headers"
    integer raw_size_bytes "Total size of the object including text and HTML"
    timestamp created_at
    timestamp expires_at "For implementing data retention policies"
}

attachments {
    uuid id PK
    varchar(500) parent_storage_key FK "Links to the email content object"
    varchar(255) filename "Original filename of the attachment"
    varchar(100) mime_type "Content type of the file (e.g., 'application)"
    integer size_bytes "Size of the attachment file"
    bytea content "The raw binary data of the attachment file"
    varchar(50) storage_disposition "inline or attachment"
    timestamp created_at
}

%% Relationships
content_objects ||--o{ attachments : "can have"

captures the core idea correctly: content_objects as body store; attachments referencing via parent_storage_key.

2. Gotchas in the Existing Content DB Design

Summarized issues and risks based on the content_database_schema_guide:

1) Overlap and confusion: email_content vs content_objects vs email_messages

Problem:
- email_content and content_objects both model stored bodies + headers + retention.
- email_messages in Content DB overlaps with analytics/metadata that belongs either:
  - In OLTP (operational messages), or
  - In OLAP (aggregated analytics), but not as a heavy-content table.
Risk:
- Duplicate responsibilities.
- Confusing cross-tier references (OLTP ↔ Content DB ↔ OLAP).
Principle:
- Content DB should be a pure blob/document store keyed by storage_key, with minimal metadata for retrieval and retention.

1) Cross-database foreign keys and tight coupling

Problem:
- Legacy doc suggests direct FKs or implied strong coupling to:
  - OLTP campaigns, leads, email_accounts.
  - inbox_message_refs or similar bridging tables.
Risk:
- Cross-DB FK is not feasible in practice; leads to migration and deploy headaches.
- Violates tier independence.
Principle:
- Use:
  - Storage keys and denormalized tenant_id/company_id.
  - Application-level validation.
  - No hard cross-DB FKs.

1) Overloading Content DB with logging, analytics, and monitoring

Problem:
- Tables like:
  - content_access_log
  - content_retention_stats
  - content_security_alerts
  - content_connection_pools, content_pool_metrics
  - Extensive lifecycle/system_notifications usage
- Represent:
  - High-volume logs.
  - Monitoring and security analytics.
Risk:
- Recreates the “logs in the primary DB” anti-pattern we just removed from OLAP.
- Bloats Content DB; mixes infra concerns with storage.
Principle:
- Detailed logging / monitoring → external analytics/logging stack.
- Content DB: minimal, targeted audit/retention metadata only where legally required.

1) Functions with heavy in-DB logic and embedded infra semantics

Problem:
- Lifecycle management functions:
  - run_daily_content_lifecycle, run_weekly_content_optimization, deduplicate_content, optimize_content_compression, monitor_content_lifecycle_health, etc.
- Encode:
  - Specific compression choices.
  - Dedup algorithms.
  - Rich alerting flows.
Risk:
- Tight coupling to implementation details.
- Hard to maintain/iterate; too prescriptive for a schema guide.
Principle:
- Keep:
  - Conceptual guidance and minimal reference functions.
- Move:
  - Complex monitoring and tuning to application and infra orchestration.

1) Confused separation between Content DB, Notifications, and System Alerts

Problem:
- system_notifications and similar tables live in the content doc, but:
  - Are really operational/infra constructs.
  - Overlap with admin_system_events and external logging responsibilities.
Risk:
- Multi-place definitions for “alerts.”
Principle:
- If you need durable alert summaries:
  - Use admin_system_events (OLAP) or a dedicated ops DB, not Content DB.
- For internal runtime alerts:
  - Use logging/observability stack.

3. Proposed Content DB Model (Lean, Tiered, Practical)

Target: Small, coherent schema focused on heavy content storage.

A. Core tables

1) content_objects

Purpose:
- Store email/message bodies and related structured content.
Model (normalized from the snippet):

CREATE TABLE content_objects (
    storage_key VARCHAR(500) PRIMARY KEY,
    tenant_id UUID NOT NULL,
    content_text TEXT,
    content_html TEXT,
    headers JSONB,
    raw_size_bytes INTEGER,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    expires_at TIMESTAMP WITH TIME ZONE, -- optional retention hint
    -- Optional fields (if truly needed):
    content_hash VARCHAR(64),
    compression_algorithm VARCHAR(20),
    compressed_size_bytes INTEGER,
    last_accessed_at TIMESTAMP WITH TIME ZONE
);

CREATE INDEX idx_content_objects_tenant ON content_objects(tenant_id);
CREATE INDEX idx_content_objects_expires ON content_objects(expires_at) WHERE expires_at IS NOT NULL;

Key points:

Single canonical body store.
No direct FK to OLTP; tie via storage_key from OLTP tables (e.g., inbox_message_refs).
Optional compression/hash fields allowed, but keep semantics minimal.

1) attachments

Purpose:
- Store binary attachments linked to content_objects.
Model:

CREATE TABLE attachments (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    parent_storage_key VARCHAR(500) NOT NULL REFERENCES content_objects(storage_key) ON DELETE CASCADE,
    filename VARCHAR(255) NOT NULL,
    mime_type VARCHAR(100) NOT NULL,
    size_bytes INTEGER NOT NULL,
    content BYTEA NOT NULL,
    storage_disposition VARCHAR(50) CHECK (storage_disposition IN ('inline', 'attachment')),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

CREATE INDEX idx_attachments_parent ON attachments(parent_storage_key);

Key points:

No analytics/metrics here.
Only what’s needed to fetch and render files.

B. Relationship to OLTP

OLTP holds:
- Message metadata, campaign linkage, lead linkage, status, etc.
- A column like content_storage_key or body_storage_key.
Content DB holds:
- The heavy blob keyed by that storage_key.
No cross-DB FK:
- All linking done by application and jobs.
If you still have inbox_message_refs or email_messages in Content DB:
- Treat as transitional; canonical pattern is:
  - OLTP: email_messages metadata.
  - Content DB: content_objects + attachments.

C. What to drop or externalize (relative to legacy guide)

Remove from Content DB responsibilities:
- content_connection_pools, content_pool_metrics, deep pool tuning functions.
- Rich system_notifications used as monitoring bus.
- Detailed content_access_log (use external analytics).
- content_security_alerts as a heavy event store.
Keep only:
- Lightweight, targeted retention_policies if you want DB-driven retention.
- Optional:
  - content_access_audit with strict scope and sane volume, only if legally required.
- Otherwise:
  - Push security/audit streaming to external stack as per external-analytics-logging.

4. Concrete Gotchas and How to Handle Them

1) “Matches content_storage_key from Primary DB”

Gotcha:
- Implies FK from content_objects.storage_key to an OLTP table.
Guidance:
- Keep this as a documented contract, not a DB constraint.
- Implement:
  - One-way reference: OLTP row has content_storage_key; Content DB trusts it.
  - Background validation job (in app) to detect orphaned content.

1) “Email message analytics” inside Content DB

Gotcha:
- email_messages described as analytics/traces in Content DB.
Guidance:
- Move message analytics concerns:
  - To OLTP for operational message state.
  - To OLAP for aggregates (campaign_analytics, mailbox_analytics, etc.).
- Content DB:
  - Only bodies and attachments.

1) Lifecycle / compression / dedup in SQL

Gotcha:
- Hard-coded algorithms and thresholds in DB functions.
Guidance:
- Treat those blocks as examples, not required schema.
- Implement lifecycle in:
  - A worker service with:
    - Configurable policies.
    - Metrics shipped to observability, not content DB.

1) Access logs and search index in Content DB

Gotcha:
- content_access_log and content_search_index can grow very large.
Guidance:
- Prefer:
  - Search: external search engine (OpenSearch/Meilisearch) or OLAP partial indexing.
  - Access logs: security/event pipeline or external logging.
- If you keep content_search_index:
  - Keep it lean, rolling, and clearly non-canonical; treat as projection.

5. Recommended Final Shape (for main docs)

When you codify this into the canonical docs:

Content DB:
- Tables:
  - content_objects
  - attachments
  - (optional) transactional_emails, notifications (if you explicitly choose to host them here; or move to OLTP)
  - (optional, lean) retention_policies
Responsibilities:
- Heavy content storage.
- Retention and lifecycle metadata.
- Multi-tenant isolation via tenant_id + RLS (if DB supports).
Non-responsibilities:
- No general-purpose logging store.
- No infra metrics (pools, CPU, etc.).
- No rich security).
- No campaign).

This analysis should be used to:

Clean up any Mermaid diagrams or schema guides that still show:
- content_objects + attachments + cross-DB constraints as tightly coupled.
- Logging/metrics tables inside the Content DB.
Align content DB docs with the same principles applied to OLAP:
- Keep it lean.
- Model durable state, not noisy telemetry.