Rate Limiting: Fairness, Abuse, and Dependency Protection
How to design rate limiting that’s fair to good actors, stops bad actors, and protects your downstream dependencies from your own traffic.
Context
Rate limiting serves three distinct purposes that often conflict:
- Fairness: Ensure one customer can’t starve others (multi-tenant SaaS)
- Abuse prevention: Stop bad actors before they cause damage
- Dependency protection: Shield downstream services from traffic spikes you generate
Most rate limiting tutorials focus on simple token buckets. Real systems need layered strategies.
Constraints
- Compliance: SLA guarantees specific request quotas per pricing tier
- Timeline: API launch in 8 weeks; partners already onboarded with rate limit expectations
- Team: 2 engineers; no dedicated infrastructure team
- Dependencies: Third-party payment API with 100 req/sec global limit; shared across all customers
- Business: Can’t reject legitimate customers during peak; premium tier expects “unlimited” access
Options Considered
| Option | Pros | Cons | Effort |
|---|---|---|---|
| A: Global token bucket | Simple, protects backend | Not fair; one customer can consume all capacity | Low |
| B: Per-customer token bucket | Fair per customer | Doesn’t protect shared dependencies | Low |
| C: Layered limiting (customer + global + dependency) | Comprehensive protection | Complex to configure and debug | Medium |
| D: Adaptive rate limiting | Self-tuning, handles variable load | Complex, unpredictable behavior | High |
| E: Priority queuing with rate limiting | Protects premium customers | Queue management complexity | High |
Decision
Option C: Layered rate limiting with three tiers
Layer 1: Per-customer limits (fairness)
↓
Layer 2: Global API limits (backend protection)
↓
Layer 3: Dependency-specific limits (downstream protection)Implementation:
- Per-customer: Sliding window counter in Redis; limits based on pricing tier
- Global: Token bucket for overall API capacity; fail-open to per-customer limits
- Dependency: Semaphore for external API calls; queue excess requests
Premium tier gets 10x base limits but still subject to global and dependency limits.
Trade-offs Accepted
- Premium isn’t truly “unlimited”: Marketing says “unlimited” but engineering enforces 10x standard
- Complexity: Three layers means three places to debug
- Latency: Each layer adds ~1ms; total overhead ~3ms
These are acceptable because:
- “Unlimited” premium customers are under 1% of traffic; 10x handles their needs
- Layered approach is necessary; single layer can’t solve all three problems
- 3ms overhead is negligible compared to 200ms typical request latency
Second-Order Effects
- Pricing tier changes: Rate limits must be updated when tiers change
- Monitoring complexity: Need visibility into which layer triggered rejection
- Customer support: Need tooling to check customer’s current rate limit status
- Dependency onboarding: New external APIs need their own limit configuration
Failure Modes
| Failure | Impact | Mitigation |
|---|---|---|
| Redis failure | Rate limits stop working | Fail-open with local fallback limits |
| Limit misconfiguration | Legitimate traffic rejected | Canary deployment; shadow mode testing |
| Dependency limit too aggressive | Requests pile up, timeout | Circuit breaker for dependency queue |
| One customer exploits limit reset | Spikes at window boundary | Sliding window instead of fixed window |
Common Failure Modes in Practice
Example 1: The synchronized spike
All customers hit API at minute boundary (cron jobs). Fixed-window rate limits reset simultaneously. All customers get full quota at :00. Backend overwhelmed for first 10 seconds of every minute.
Fix: Sliding window rate limiting. Quota is always “last 60 seconds” not “this minute.” Smooths traffic.
Example 2: The “legitimate” abuse
Customer builds polling integration. Hits API 100 times/second checking for updates. Technically within their “unlimited” tier. Consumes 80% of backend capacity. Other customers suffer.
Fix: Per-customer global limits regardless of tier. Even “unlimited” means “very high” not “infinite.” Publish fair use policy.
Observability & SLOs
Key Metrics:
- Rejection rate by layer (customer, global, dependency)
- Rejection rate by customer tier
- Dependency queue depth and wait time
- 429 response rate (overall and per customer)
- Time to rate limit recovery after spike
SLO Targets:
- Under 1% of standard tier requests rejected due to global limits
- Under 0.1% of premium tier requests rejected due to any limit
- Dependency queue wait time p99 under 100ms
Alerting:
- Warn if any customer rejection rate > 10%
- Page if global rejection rate > 5%
- Page if dependency queue depth > 1000
Rollout Plan
- Phase 1 (Week 1-2): Implement per-customer limits; shadow mode (log, don’t reject)
- Phase 2 (Week 3): Analyze shadow mode data; tune limits based on actual traffic
- Phase 3 (Week 4-5): Enable per-customer enforcement; monitor rejection rates
- Phase 4 (Week 6): Add global and dependency layers in shadow mode
- Phase 5 (Week 7-8): Full enforcement; partner notification of rate limit behavior
Rollback Criteria:
-
5% rejection rate for standard tier customers
- Any premium tier customer seeing rejections
- Latency increase > 10ms attributable to rate limiting
Ownership
- DRI: API Platform Team
- Reviewers: Customer Success, Product, SRE
When to Revisit
- Pricing tier changes affecting rate limit expectations
- New external dependency with different limiting requirements
- Geographic expansion requiring regional rate limits
- Traffic pattern shift (e.g., new high-volume use case)