Monitoring System
Production-hardened monitoring with tiered thresholds, sliding windows, and ratio-based domain protection
1. Tiered Mailbox Thresholds
We use a two-tier system to catch issues before they cause irreversible damage:
WARNING State
3 bounces / 60 sends
~5% bounce rate
Action: Mailbox transitions to warning state. Logging intensifies. Operators alerted.
PAUSE State
5 bounces / 100 sends
5% bounce rate threshold
Action: Mailbox paused immediately. No new emails sent. Cooldown period begins.
Why Two Tiers?
Early warning at 3/60 gives operators time to investigate before the mailbox is paused at 5/100. This prevents surprises and allows manual intervention.
1b. Minimum Volume Requirement
Enforcement thresholds only activate after a mailbox reaches a minimum send volume. This prevents the system from overreacting to statistically insignificant data.
Why minimum volume matters
A mailbox that sent 4 emails and received 1 bounce has a 25% bounce rate — but this is not statistically meaningful. One bounce at low volume does not indicate a deliverability problem. Pausing mailboxes based on tiny sample sizes would constantly churn accounts during early warmup.
The three enforcement triggers:
| Trigger | Condition | Purpose |
|---|---|---|
| Percentage | 3%+ bounce rate AND 60+ sends | Catches sustained problems after sufficient data |
| Absolute | 5+ bounces in sliding window | Safety net regardless of send volume |
| Early warning | 3+ bounces within first 60 sends | Flags risk before percentage trigger activates |
Example: Low Volume vs High Volume
4 sends, 1 bounce (25%)
Below 60-send minimum. No enforcement action. The system monitors but waits for more data before making decisions.
Result: Mailbox stays active
80 sends, 3 bounces (3.75%)
Above 60-send minimum and above 3% threshold. The bounce pattern is statistically significant.
Result: Mailbox paused, removed from campaigns, enters healing
2. Sliding Window Logic
Instead of hard resetting stats to 0/0 after 100 sends, we use a sliding window that keeps 50% of past data.
Old Behavior (Hard Reset)
100 sent, 6 bounces → reset → 0 sent, 0 bounces
Problem: Volatility patterns erasedNew Behavior (Sliding Window)
100 sent, 6 bounces → slide → 50 sent, 3 bounces
Benefit: Volatility preserved, reputation context maintainedImpact
A mailbox with a history of bounces won't suddenly appear "clean" after 100 sends. The sliding window ensures reputation tracking reflects reality.
3. Ratio-Based Domain Protection
Domain health is calculated using percentage of unhealthy mailboxes, not absolute counts. This allows the system to scale from small teams (3 mailboxes) to large agencies (200+ mailboxes).
| Threshold | Percentage | Action |
|---|---|---|
| WARNING | 30% unhealthy | Domain enters warning |
| PAUSE | 50% unhealthy | Domain paused, all mailboxes blocked |
Scaling Example
| Total Mailboxes | Unhealthy | Percentage | Status |
|---|---|---|---|
| 3 | 1 | 33% | ⚠️ Warning |
| 10 | 2 | 20% | ✅ Healthy |
| 10 | 5 | 50% | 🛑 Paused |
| 30 | 10 | 33% | ⚠️ Warning |
| 200 | 110 | 55% | 🛑 Paused |
Why Ratios?
Absolute thresholds don't scale:
- • With 3 mailboxes, losing 2 is catastrophic (67% failure)
- • With 30 mailboxes, losing 2 is negligible (7% failure)
- • Ratio-based logic adapts automatically as infrastructure grows
Monitoring Dashboard
The monitoring dashboard provides real-time visibility into:
Mailbox Metrics
- • Current status (healthy, warning, paused)
- • Window bounce count (e.g., 3/60)
- • Total sends and bounces
- • Cooldown expiry time
Domain Aggregations
- • Total mailboxes vs unhealthy count
- • Unhealthy percentage
- • Domain status (healthy, warning, paused)
- • Average risk score across mailboxes
🎯 Production-Hardened
These monitoring refinements are based on real-world outbound operations:
- • Tiered thresholds prevent surprise pauses
- • Sliding windows maintain reputation context
- • Ratio-based domains scale with infrastructure
- • All thresholds are tunable in Configuration