Skip to Content

Async Processing: Retries, Idempotency, and DLQ Design

How to design reliable async processing with proper retry semantics, idempotency guarantees, and dead letter queue handling.

Context

A payment processing system needed to handle webhook notifications from multiple payment providers. Each webhook triggered downstream processing (ledger updates, notifications, reconciliation). The system processed 10M+ events/day with strict exactly-once processing requirements for financial accuracy.

Business context:

  • 10M+ payment events/day
  • Financial accuracy: zero duplicate or missed transactions
  • Provider SLAs: webhooks may be delivered multiple times
  • Regulatory requirement: full audit trail of all processing

Constraints

ConstraintImpact
Exactly-once semanticsCannot double-charge or miss payments
Provider retriesMust handle duplicate webhook delivery
OrderingSome events must be processed in order (charge → refund)
LatencyP99 processing time under 5 seconds
AuditEvery event and outcome must be logged

Options Considered

OptionProsCons
Synchronous processingSimple, immediate feedbackNo retry, blocks provider
At-least-once + dedupCommon pattern, provenRequires idempotency keys
Exactly-once (transactional outbox)Strongest guaranteesComplex, performance overhead
Event sourcingNatural audit, replay capabilitySignificant architecture change

Decision

We chose at-least-once delivery with application-level idempotency:

  1. Idempotency key: Provider webhook ID + event type, stored in Redis with 7-day TTL
  2. Retry strategy: Exponential backoff (1s, 2s, 4s, 8s, 16s) with jitter, max 5 retries
  3. Dead letter queue: Failed events after max retries go to DLQ with full context
  4. Ordering: Partition by entity ID (user/merchant), process partitions serially

Why this approach:

  • Idempotency keys are natural (providers include unique event IDs)
  • At-least-once is simpler to operate than exactly-once
  • DLQ allows manual intervention without blocking the pipeline

Trade-offs Accepted

  • Storage cost: 7-day idempotency key retention uses ~50GB Redis
  • Manual intervention: DLQ events require on-call engineer review
  • Ordering constraints: Partitioning limits parallelism for high-volume merchants

Second-order Effects

  • Positive: Idempotency infrastructure reused for API request deduplication
  • Unexpected: DLQ analysis revealed provider bugs (duplicate webhooks with different IDs)
  • Operational: DLQ processing became a daily ritual requiring dedicated time

Failure Modes

Failure ModeLikelihoodImpactMitigation
Redis idempotency store failureLowDuplicate processingFallback to DB check, accept latency
Poison messageMediumBlocks partitionMax retry limit, auto-DLQ
DLQ overflowLowLost eventsAlert on DLQ depth, auto-scale review
Clock skew (TTL issues)Very LowPremature key expiryUse logical clocks, extend TTL margin

Observability & SLOs

  • SLI: Percentage of events processed successfully within 5 seconds
  • SLO: 99.9% success rate, P99 latency under 5 seconds
  • Dashboard: Processing latency, retry rate, DLQ depth, idempotency cache hit rate
  • Alerts: DLQ depth > 100, retry rate > 5%, Redis latency > 10ms

Common Failure Modes

  1. Idempotency key collision: Different event types with same provider ID. Fixed by including event type in the key.

  2. Retry amplification: Transient failures caused retry storms that overwhelmed downstream services. Fixed by adding circuit breakers and adaptive rate limiting.

  3. DLQ rot: Events sat in DLQ for weeks without review. Fixed by adding SLA on DLQ processing (24h) and automated escalation.

When to Revisit

Revisit if:

  • Event volume grows 10x (consider exactly-once with Kafka transactions)
  • Adding new providers with different retry semantics
  • DLQ processing becomes more than 1 hour/day of engineering time
  • Financial audit requirements change
Last updated on