Async Processing: Retries, Idempotency, and DLQ

How to build async processing that doesn’t lose messages, doesn’t process them twice, and doesn’t silently drop failures.

Context

Async processing is essential for decoupling services and handling variable load, but introduces three fundamental problems:

Messages can fail: Network issues, downstream outages, bugs
Retries can duplicate: Same message processed multiple times
Failures can hide: Messages silently dropped or stuck in limbo

Most async architectures handle the happy path well. Real systems need the sad paths covered.

Constraints

Compliance: Financial transactions must have exactly-once semantics (or audit-safe at-least-once with idempotency)
Timeline: Black Friday in 4 months; must handle 10x normal load
Team: 3 engineers, limited Kafka expertise; currently using RabbitMQ
Dependencies: Payment provider has 99.9% SLA; occasionally returns 500s for 2-3 minutes
Budget: Can’t afford Kafka licensing; must use open-source or existing infrastructure

Options Considered

Option	Pros	Cons	Effort
A: RabbitMQ with manual retry logic	Uses existing infra, well-understood	Complex retry logic, DLQ handling manual	Medium
B: Migrate to Kafka	Better guarantees, industry standard	Learning curve, operational complexity, timeline risk	High
C: AWS SQS + Lambda	Managed, built-in DLQ, auto-scaling	Vendor lock-in, cold start latency, cost at scale	Medium
D: RabbitMQ + idempotency layer + custom DLQ handler	Builds on existing, explicit guarantees	Requires idempotency key storage, monitoring effort	Medium

Decision

Option D: RabbitMQ with explicit idempotency layer and custom DLQ handling

We keep RabbitMQ but add:

Idempotency keys stored in Redis with 7-day TTL
Exponential backoff retry with jitter (3 retries, then DLQ)
DLQ processor that alerts, stores context, and enables manual replay
Transactional outbox for critical paths (payment events)

Trade-offs Accepted

Redis as idempotency store: Single point of failure; mitigated by Redis Sentinel
Manual DLQ processing: Not fully automated; requires on-call runbook
Outbox pattern complexity: Two-phase commits avoided but eventual consistency window exists

These are acceptable because:

Redis failure is recoverable (worst case: duplicate processing, caught by downstream idempotency)
DLQ volume is low (under 0.1% of messages); manual review is feasible
Eventual consistency window is under 5 seconds; acceptable for our use case

Second-Order Effects

Observability requirements: Need distributed tracing to follow message through retries
Testing complexity: Integration tests must cover retry scenarios
Capacity planning: Redis needs sizing for idempotency key storage
On-call runbooks: DLQ handling procedures required

Failure Modes

Failure	Impact	Mitigation
Idempotency key storage fails	Duplicate processing possible	Secondary check in database transaction
DLQ fills up unnoticed	Customer-facing issues undetected	Alert when DLQ depth > 100
Retry storm during outage	Amplifies downstream failure	Circuit breaker + exponential backoff with jitter
Message ordering lost	Business logic errors	Partition by entity ID; accept eventual consistency

Common Failure Modes in Practice

Example 1: The retry storm

Payment provider returns 503 for 2 minutes. All in-flight payment messages retry simultaneously. Provider rate-limits us. Retries fail. Messages go to DLQ. Customers see failed payments. Manual intervention required for 500 orders.

Fix: Exponential backoff with jitter. First retry at 1s ± 500ms, second at 4s ± 2s, third at 16s ± 8s. Spreads retry load over time.

Example 2: The silent duplicate

Order service publishes “OrderCreated” event. Consumer processes it, creates invoice. Network blip causes RabbitMQ to not receive ACK. Message redelivered. Second invoice created. Customer charged twice.

Fix: Idempotency key (order_id + event_type) checked before processing. If exists, ACK immediately without processing.

Observability & SLOs

Key Metrics:

Message processing latency (p50, p95, p99)
Retry rate by queue
DLQ depth by queue
Idempotency hit rate (duplicate detection)
End-to-end message delivery time

SLO Targets:

99.9% of messages processed within 30 seconds
DLQ depth < 100 messages (sustained)
Zero undetected DLQ messages > 1 hour old

Alerting:

Page if DLQ depth > 100 for 5 minutes
Warn if retry rate > 5% for 15 minutes
Page if processing latency p99 > 60 seconds

Rollout Plan

Phase 1 (Week 1): Implement idempotency layer; deploy to staging
Phase 2 (Week 2): Add retry logic with exponential backoff; chaos test
Phase 3 (Week 3): Deploy DLQ handler; create runbooks
Phase 4 (Week 4): Canary to 10% production traffic
Phase 5 (Week 5): Full rollout; load test at 10x

Rollback Criteria:

Message loss detected (consumer count < producer count)
DLQ rate > 1% of messages
Processing latency p99 > 2 minutes

Ownership

DRI: Backend Platform Team
Reviewers: Payments Team, SRE

When to Revisit

Message volume exceeds RabbitMQ cluster capacity (>50k msg/sec sustained)
Need for exactly-once semantics without idempotency (consider Kafka transactions)
Multi-region deployment requiring geo-replication
Compliance requirement for message retention > 7 days