Iteration: v1 — Basic Design Next: Distributed rate limiting, multi-tier strategies, adaptive throttling
1. Problem Statement
A rate limiter controls the number of requests a client can send to a server within a defined time window. Without one, a single client (malicious or buggy) can overwhelm the system, degrading service for everyone.
When do you need a rate limiter?
| Scenario | Example |
|---|---|
| Prevent abuse / DDoS | Bot hammering login endpoint |
| Protect shared resources | Database connection pool exhaustion |
| Enforce billing tiers | Free tier: 100 req/min, Pro: 10K req/min |
| Ensure fairness | One tenant on a multi-tenant platform shouldn’t starve others |
| Cost control | Upstream API charges per call (e.g., OpenAI, Twilio) |
2. Requirements
2.1 Functional Requirements
- FR1: Limit requests per client (identified by user ID, API key, or IP) to N requests per time window.
- FR2: Return a clear rejection response (HTTP 429) when the limit is exceeded.
- FR3: Support configurable limits — different endpoints or user tiers can have different limits.
- FR4: Include rate limit metadata in response headers (
X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset).
2.2 Non-Functional Requirements (Basic)
| NFR | Target | Why it matters |
|---|---|---|
| Latency | < 1ms per check | Rate limiter sits in the hot path of every request |
| Accuracy | Best-effort (slight over-count is acceptable) | Exact counting is expensive; a few extra requests slipping through is fine |
| Availability | Should not become a single point of failure | If the limiter goes down, we need a fallback policy (fail-open vs fail-closed) |
| Memory | Bounded, proportional to active clients | Can’t allocate unbounded memory per IP/user |
Parking for v2: Strong consistency across nodes, geo-distributed limiting, sub-second window granularity.
3. High-Level Architecture (v1 — Single Node)
┌─────────────────────────┐
│ Client │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ API Gateway / │
│ Reverse Proxy │
│ (e.g., Nginx, Envoy) │
└────────────┬────────────┘
│
▼
┌─────────────────────────┐
│ Rate Limiter │
│ Middleware │
│ │
│ ┌───────────────────┐ │
│ │ In-Memory Store │ │
│ │ (HashMap/Cache) │ │
│ └───────────────────┘ │
└────────────┬────────────┘
│
┌──────┴──────┐
│ │
ALLOWED REJECTED
│ │
▼ ▼
┌───────────┐ ┌──────────┐
│ App │ │ HTTP 429 │
│ Server │ │ Too Many │
│ │ │ Requests │
└───────────┘ └──────────┘Where does the rate limiter live?
| Option | Pros | Cons |
|---|---|---|
| Client-side | Reduces unnecessary network calls | Can be bypassed; not trustworthy |
| Server middleware | Simple to implement; colocated with app logic | Coupled to app; doesn’t protect upstream |
| API Gateway | Centralized; language-agnostic; protects all services | Another infra component to manage |
v1 decision: Server middleware (simplest to start). We’ll move to API Gateway in v2.
4. Rate Limiting Algorithms
This is the core of the design. There are 4 well-known algorithms. Let’s understand each and pick one for v1.
4.1 Fixed Window Counter
How it works: Divide time into fixed windows (e.g., each minute). Maintain a counter per client per window. Increment on each request. Reject if counter > limit.
Window: [12:00:00 — 12:01:00] Counter: 0 → 1 → 2 → ... → 100 → REJECT
Window: [12:01:00 — 12:02:00] Counter resets to 0Data structure:
Key: "{client_id}:{window_start_timestamp}"
Value: count (integer)| Pros | Cons |
|---|---|
| Dead simple | Boundary burst problem: 100 requests at 12:00:59 + 100 at 12:01:00 = 200 in 2 seconds |
| Low memory (1 entry per client per window) | Unfair at window edges |
4.2 Sliding Window Log
How it works: Store the timestamp of every request. On a new request, remove timestamps older than now - window_size. If remaining count >= limit, reject.
Timestamps: [12:00:01, 12:00:05, 12:00:12, ..., 12:00:58]
New request at 12:01:02 → remove everything before 12:00:02 → count remainingData structure:
Key: "{client_id}"
Value: sorted set of timestamps| Pros | Cons |
|---|---|
| Perfectly accurate sliding window | High memory — stores every timestamp |
| No boundary burst problem | Cleanup overhead on every request |
4.3 Sliding Window Counter (Hybrid)
How it works: Combine fixed window counter with a weighted overlap from the previous window.
Previous window count: 80 (window was 60s, 45s have passed into current)
Current window count: 20 (15s into current window)
Estimated count = 80 × (45/60) + 20 = 60 + 20 = 80| Pros | Cons |
|---|---|
| Low memory (two counters per client) | Approximate (but very close in practice) |
| Smooths out boundary bursts | Slightly more complex logic |
4.4 Token Bucket
How it works: Each client has a bucket with capacity B tokens. Tokens are added at rate R per second. Each request consumes 1 token. If the bucket is empty, reject.
Bucket capacity: 10
Refill rate: 2 tokens/sec
[t=0] Bucket: 10 → Request arrives → Bucket: 9 ✓
[t=0] Bucket: 9 → Request arrives → Bucket: 8 ✓
...
[t=0] Bucket: 0 → Request arrives → REJECTED ✗
[t=1] Bucket: 2 (refilled) → Request arrives → Bucket: 1 ✓Data structure:
Key: "{client_id}"
Value: { tokens: float, last_refill_timestamp: epoch }No background thread needed — compute tokens lazily on each request:
elapsed = now - last_refill_timestamp
tokens = min(capacity, tokens + elapsed × refill_rate)| Pros | Cons |
|---|---|
| Allows controlled bursts (up to bucket size) | Two parameters to tune (capacity + rate) |
| Smooth rate enforcement | Slightly harder to reason about than counters |
| Very low memory (2 values per client) | |
| Industry standard (AWS, Stripe use this) |
Algorithm Comparison Summary
| Algorithm | Memory | Accuracy | Burst Handling | Complexity |
|---|---|---|---|---|
| Fixed Window | Low | Weak at edges | Poor | Very Low |
| Sliding Window Log | High | Exact | Good | Medium |
| Sliding Window Counter | Low | Approximate | Good | Low |
| Token Bucket | Low | Good | Controlled bursts | Medium |
v1 Decision: Token Bucket
Why: It’s the industry standard. Low memory. Allows controlled bursts (which is realistic — users often send a few requests quickly, then pause). Two values per client. No background threads.
5. Detailed Design (v1)
5.1 Core Data Model
RateLimitEntry {
tokens: float64 // current tokens available
last_refill_time: int64 // unix timestamp (milliseconds)
}
RateLimitConfig {
bucket_capacity: int // max burst size
refill_rate: float64 // tokens per second
}5.2 Algorithm Pseudocode
function is_allowed(client_id, config):
entry = store.get(client_id)
if entry is null:
entry = { tokens: config.bucket_capacity, last_refill_time: now() }
// Lazy refill
elapsed = now() - entry.last_refill_time
entry.tokens = min(config.bucket_capacity, entry.tokens + elapsed * config.refill_rate)
entry.last_refill_time = now()
if entry.tokens >= 1:
entry.tokens -= 1
store.set(client_id, entry)
return ALLOWED // remaining = floor(entry.tokens)
else:
store.set(client_id, entry)
return REJECTED // retry_after = (1 - entry.tokens) / config.refill_rate5.3 HTTP Response Design
Allowed (200/2xx):
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 42
X-RateLimit-Reset: 1708425600Rejected (429):
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1708425600
Retry-After: 3
Content-Type: application/json
{
"error": "rate_limit_exceeded",
"message": "Too many requests. Please retry after 3 seconds.",
"retry_after_seconds": 3
}5.4 Client Identification Strategy
| Method | Use Case | Limitation |
|---|---|---|
| API Key | Authenticated APIs | Doesn’t work for unauthenticated endpoints |
| User ID | Per-user limits after auth | Same — requires authentication |
| IP Address | Unauthenticated / pre-auth endpoints | Shared IPs (NAT, corporate proxies) punish all users behind them |
| Composite Key | "{user_id}:{endpoint}" | Best for per-endpoint-per-user limiting |
v1 decision: Use API Key as primary identifier. Fall back to IP for unauthenticated endpoints.
5.5 Configuration Example
rate_limits:
default:
bucket_capacity: 100
refill_rate: 10 # 10 requests/sec sustained, 100 burst
tiers:
free:
bucket_capacity: 20
refill_rate: 2
pro:
bucket_capacity: 200
refill_rate: 50
enterprise:
bucket_capacity: 1000
refill_rate: 200
endpoint_overrides:
"/api/v1/login":
bucket_capacity: 5 # Strict — prevent brute force
refill_rate: 0.1 # 1 attempt per 10 seconds sustained
"/api/v1/search":
bucket_capacity: 30
refill_rate: 56. Storage Choice (v1)
| Option | Latency | Persistence | Fit for v1? |
|---|---|---|---|
| In-process HashMap | ~nanoseconds | None (lost on restart) | Yes — simplest |
| Redis | ~0.5ms | Optional | Overkill for single node |
| Memcached | ~0.5ms | None | Overkill for single node |
v1 decision: In-process HashMap (or ConcurrentHashMap in Java / sync.Map in Go).
Eviction: Use a TTL equal to bucket_capacity / refill_rate (time to fully refill). Clients who stop sending requests get cleaned up automatically. Alternatively, run a periodic cleanup goroutine/thread every 60 seconds.
7. Failure Mode: Fail-Open vs Fail-Closed
This is a critical design decision even for v1.
| Strategy | Behavior when limiter fails | Risk |
|---|---|---|
| Fail-open | Allow all requests through | System can be overwhelmed |
| Fail-closed | Reject all requests | Legitimate users blocked |
v1 decision: Fail-open with alerting. Rationale: The rate limiter protects, but it should never become the reason the service is down. Log aggressively when the limiter is unavailable so ops can react.
8. What This Design Handles
- Per-client rate limiting (by API key or IP)
- Configurable limits per tier and endpoint
- Controlled bursts via token bucket
- Clear rejection response with retry guidance
- Low latency (in-memory, no network hop)
- Bounded memory with TTL-based eviction
9. What This Design Does NOT Handle (Yet)
These are intentionally deferred to keep v1 simple. Each becomes a future iteration.
| Gap | Why it matters | Future iteration |
|---|---|---|
| Distributed rate limiting | Multiple app servers each have their own counters — a client gets N × limit | v2: Centralized store (Redis) |
| Race conditions | Concurrent requests on same node can read stale token count | v2: Atomic operations / Lua scripts |
| Global rate limiting | Limit across all clients (e.g., total 10K req/s to protect DB) | v3: Global token bucket |
| Sliding window precision | Token bucket allows bursts; some use cases need strict per-second caps | v3: Hybrid algorithm |
| Multi-region | Users hitting different data centers get separate limits | v4: Cross-DC synchronization |
| Adaptive / dynamic limits | Adjusting limits based on system health (CPU, queue depth) | v4: Feedback loop |
| Request prioritization | Not all requests are equal — health checks vs writes | v3: Priority queues |
| Analytics & observability | Who’s being throttled? How often? | v2: Metrics + dashboards |
10. Quick Estimation (Back-of-Envelope)
Assumptions: 1M active users, rate limit check on every request.
| Dimension | Calculation | Result |
|---|---|---|
| Memory per entry | client_id (64B) + tokens (8B) + timestamp (8B) | ~80 bytes |
| Total memory (1M users) | 80B × 1,000,000 | ~80 MB |
| Latency per check | HashMap lookup + arithmetic | < 1 microsecond |
| Throughput | CPU-bound; single core can do millions of lookups/sec | Not a bottleneck |
This fits comfortably in a single server’s memory. The rate limiter will never be the bottleneck at this scale.
11. Interview Discussion Points
When presenting this design, a staff/principal engineer would highlight:
- “I chose token bucket because…” — shows you evaluated alternatives and made a reasoned tradeoff.
- Fail-open vs fail-closed — shows you think about failure modes proactively.
- “This breaks down when…” — acknowledging the single-node limitation and having a clear v2 plan shows architectural maturity.
- Response headers — shows you think about developer experience, not just backend plumbing.
- Per-endpoint overrides — shows you understand that not all traffic is equal (login vs search).
Next: v2 — Distributed Rate Limiting
When we’re ready, the next iteration will tackle:
- Moving from in-memory to Redis (centralized store)
- Handling race conditions with atomic operations (Redis Lua scripts /
MULTI/EXEC) - Consistency vs availability tradeoffs in the rate limiter itself
- Observability: metrics, dashboards, alerting on throttle rates