Async Processing: Retries, Idempotency, and DLQ Design

How to design reliable async processing with proper retry semantics, idempotency guarantees, and dead letter queue handling.

Context

A payment processing system needed to handle webhook notifications from multiple payment providers. Each webhook triggered downstream processing (ledger updates, notifications, reconciliation). The system processed 10M+ events/day with strict exactly-once processing requirements for financial accuracy.

Business context:

10M+ payment events/day
Financial accuracy: zero duplicate or missed transactions
Provider SLAs: webhooks may be delivered multiple times
Regulatory requirement: full audit trail of all processing

Constraints

Constraint	Impact
Exactly-once semantics	Cannot double-charge or miss payments
Provider retries	Must handle duplicate webhook delivery
Ordering	Some events must be processed in order (charge → refund)
Latency	P99 processing time under 5 seconds
Audit	Every event and outcome must be logged

Options Considered

Option	Pros	Cons
Synchronous processing	Simple, immediate feedback	No retry, blocks provider
At-least-once + dedup	Common pattern, proven	Requires idempotency keys
Exactly-once (transactional outbox)	Strongest guarantees	Complex, performance overhead
Event sourcing	Natural audit, replay capability	Significant architecture change

Decision

We chose at-least-once delivery with application-level idempotency:

Idempotency key: Provider webhook ID + event type, stored in Redis with 7-day TTL
Retry strategy: Exponential backoff (1s, 2s, 4s, 8s, 16s) with jitter, max 5 retries
Dead letter queue: Failed events after max retries go to DLQ with full context
Ordering: Partition by entity ID (user/merchant), process partitions serially

Why this approach:

Idempotency keys are natural (providers include unique event IDs)
At-least-once is simpler to operate than exactly-once
DLQ allows manual intervention without blocking the pipeline

Trade-offs Accepted

Storage cost: 7-day idempotency key retention uses ~50GB Redis
Manual intervention: DLQ events require on-call engineer review
Ordering constraints: Partitioning limits parallelism for high-volume merchants

Second-order Effects

Positive: Idempotency infrastructure reused for API request deduplication
Unexpected: DLQ analysis revealed provider bugs (duplicate webhooks with different IDs)
Operational: DLQ processing became a daily ritual requiring dedicated time

Failure Modes

Failure Mode	Likelihood	Impact	Mitigation
Redis idempotency store failure	Low	Duplicate processing	Fallback to DB check, accept latency
Poison message	Medium	Blocks partition	Max retry limit, auto-DLQ
DLQ overflow	Low	Lost events	Alert on DLQ depth, auto-scale review
Clock skew (TTL issues)	Very Low	Premature key expiry	Use logical clocks, extend TTL margin

Observability & SLOs

SLI: Percentage of events processed successfully within 5 seconds
SLO: 99.9% success rate, P99 latency under 5 seconds
Dashboard: Processing latency, retry rate, DLQ depth, idempotency cache hit rate
Alerts: DLQ depth > 100, retry rate > 5%, Redis latency > 10ms

Common Failure Modes

Idempotency key collision: Different event types with same provider ID. Fixed by including event type in the key.
Retry amplification: Transient failures caused retry storms that overwhelmed downstream services. Fixed by adding circuit breakers and adaptive rate limiting.
DLQ rot: Events sat in DLQ for weeks without review. Fixed by adding SLA on DLQ processing (24h) and automated escalation.

When to Revisit

Revisit if:

Event volume grows 10x (consider exactly-once with Kafka transactions)
Adding new providers with different retry semantics
DLQ processing becomes more than 1 hour/day of engineering time
Financial audit requirements change