Skip to Content

Iteration: v1 — Basic Design Next: Distributed rate limiting, multi-tier strategies, adaptive throttling


1. Problem Statement

A rate limiter controls the number of requests a client can send to a server within a defined time window. Without one, a single client (malicious or buggy) can overwhelm the system, degrading service for everyone.

When do you need a rate limiter?

ScenarioExample
Prevent abuse / DDoSBot hammering login endpoint
Protect shared resourcesDatabase connection pool exhaustion
Enforce billing tiersFree tier: 100 req/min, Pro: 10K req/min
Ensure fairnessOne tenant on a multi-tenant platform shouldn’t starve others
Cost controlUpstream API charges per call (e.g., OpenAI, Twilio)

2. Requirements

2.1 Functional Requirements

  • FR1: Limit requests per client (identified by user ID, API key, or IP) to N requests per time window.
  • FR2: Return a clear rejection response (HTTP 429) when the limit is exceeded.
  • FR3: Support configurable limits — different endpoints or user tiers can have different limits.
  • FR4: Include rate limit metadata in response headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset).

2.2 Non-Functional Requirements (Basic)

NFRTargetWhy it matters
Latency< 1ms per checkRate limiter sits in the hot path of every request
AccuracyBest-effort (slight over-count is acceptable)Exact counting is expensive; a few extra requests slipping through is fine
AvailabilityShould not become a single point of failureIf the limiter goes down, we need a fallback policy (fail-open vs fail-closed)
MemoryBounded, proportional to active clientsCan’t allocate unbounded memory per IP/user

Parking for v2: Strong consistency across nodes, geo-distributed limiting, sub-second window granularity.


3. High-Level Architecture (v1 — Single Node)

┌─────────────────────────┐ │ Client │ └────────────┬────────────┘ ┌─────────────────────────┐ │ API Gateway / │ │ Reverse Proxy │ │ (e.g., Nginx, Envoy) │ └────────────┬────────────┘ ┌─────────────────────────┐ │ Rate Limiter │ │ Middleware │ │ │ │ ┌───────────────────┐ │ │ │ In-Memory Store │ │ │ │ (HashMap/Cache) │ │ │ └───────────────────┘ │ └────────────┬────────────┘ ┌──────┴──────┐ │ │ ALLOWED REJECTED │ │ ▼ ▼ ┌───────────┐ ┌──────────┐ │ App │ │ HTTP 429 │ │ Server │ │ Too Many │ │ │ │ Requests │ └───────────┘ └──────────┘

Where does the rate limiter live?

OptionProsCons
Client-sideReduces unnecessary network callsCan be bypassed; not trustworthy
Server middlewareSimple to implement; colocated with app logicCoupled to app; doesn’t protect upstream
API GatewayCentralized; language-agnostic; protects all servicesAnother infra component to manage

v1 decision: Server middleware (simplest to start). We’ll move to API Gateway in v2.


4. Rate Limiting Algorithms

This is the core of the design. There are 4 well-known algorithms. Let’s understand each and pick one for v1.

4.1 Fixed Window Counter

How it works: Divide time into fixed windows (e.g., each minute). Maintain a counter per client per window. Increment on each request. Reject if counter > limit.

Window: [12:00:00 — 12:01:00] Counter: 0 → 1 → 2 → ... → 100 → REJECT Window: [12:01:00 — 12:02:00] Counter resets to 0

Data structure:

Key: "{client_id}:{window_start_timestamp}" Value: count (integer)
ProsCons
Dead simpleBoundary burst problem: 100 requests at 12:00:59 + 100 at 12:01:00 = 200 in 2 seconds
Low memory (1 entry per client per window)Unfair at window edges

4.2 Sliding Window Log

How it works: Store the timestamp of every request. On a new request, remove timestamps older than now - window_size. If remaining count >= limit, reject.

Timestamps: [12:00:01, 12:00:05, 12:00:12, ..., 12:00:58] New request at 12:01:02 → remove everything before 12:00:02 → count remaining

Data structure:

Key: "{client_id}" Value: sorted set of timestamps
ProsCons
Perfectly accurate sliding windowHigh memory — stores every timestamp
No boundary burst problemCleanup overhead on every request

4.3 Sliding Window Counter (Hybrid)

How it works: Combine fixed window counter with a weighted overlap from the previous window.

Previous window count: 80 (window was 60s, 45s have passed into current) Current window count: 20 (15s into current window) Estimated count = 80 × (45/60) + 20 = 60 + 20 = 80
ProsCons
Low memory (two counters per client)Approximate (but very close in practice)
Smooths out boundary burstsSlightly more complex logic

4.4 Token Bucket

How it works: Each client has a bucket with capacity B tokens. Tokens are added at rate R per second. Each request consumes 1 token. If the bucket is empty, reject.

Bucket capacity: 10 Refill rate: 2 tokens/sec [t=0] Bucket: 10 → Request arrives → Bucket: 9 ✓ [t=0] Bucket: 9 → Request arrives → Bucket: 8 ✓ ... [t=0] Bucket: 0 → Request arrives → REJECTED ✗ [t=1] Bucket: 2 (refilled) → Request arrives → Bucket: 1 ✓

Data structure:

Key: "{client_id}" Value: { tokens: float, last_refill_timestamp: epoch }

No background thread needed — compute tokens lazily on each request:

elapsed = now - last_refill_timestamp tokens = min(capacity, tokens + elapsed × refill_rate)
ProsCons
Allows controlled bursts (up to bucket size)Two parameters to tune (capacity + rate)
Smooth rate enforcementSlightly harder to reason about than counters
Very low memory (2 values per client)
Industry standard (AWS, Stripe use this)

Algorithm Comparison Summary

AlgorithmMemoryAccuracyBurst HandlingComplexity
Fixed WindowLowWeak at edgesPoorVery Low
Sliding Window LogHighExactGoodMedium
Sliding Window CounterLowApproximateGoodLow
Token BucketLowGoodControlled burstsMedium

v1 Decision: Token Bucket

Why: It’s the industry standard. Low memory. Allows controlled bursts (which is realistic — users often send a few requests quickly, then pause). Two values per client. No background threads.


5. Detailed Design (v1)

5.1 Core Data Model

RateLimitEntry { tokens: float64 // current tokens available last_refill_time: int64 // unix timestamp (milliseconds) } RateLimitConfig { bucket_capacity: int // max burst size refill_rate: float64 // tokens per second }

5.2 Algorithm Pseudocode

function is_allowed(client_id, config): entry = store.get(client_id) if entry is null: entry = { tokens: config.bucket_capacity, last_refill_time: now() } // Lazy refill elapsed = now() - entry.last_refill_time entry.tokens = min(config.bucket_capacity, entry.tokens + elapsed * config.refill_rate) entry.last_refill_time = now() if entry.tokens >= 1: entry.tokens -= 1 store.set(client_id, entry) return ALLOWED // remaining = floor(entry.tokens) else: store.set(client_id, entry) return REJECTED // retry_after = (1 - entry.tokens) / config.refill_rate

5.3 HTTP Response Design

Allowed (200/2xx):

HTTP/1.1 200 OK X-RateLimit-Limit: 100 X-RateLimit-Remaining: 42 X-RateLimit-Reset: 1708425600

Rejected (429):

HTTP/1.1 429 Too Many Requests X-RateLimit-Limit: 100 X-RateLimit-Remaining: 0 X-RateLimit-Reset: 1708425600 Retry-After: 3 Content-Type: application/json { "error": "rate_limit_exceeded", "message": "Too many requests. Please retry after 3 seconds.", "retry_after_seconds": 3 }

5.4 Client Identification Strategy

MethodUse CaseLimitation
API KeyAuthenticated APIsDoesn’t work for unauthenticated endpoints
User IDPer-user limits after authSame — requires authentication
IP AddressUnauthenticated / pre-auth endpointsShared IPs (NAT, corporate proxies) punish all users behind them
Composite Key"{user_id}:{endpoint}"Best for per-endpoint-per-user limiting

v1 decision: Use API Key as primary identifier. Fall back to IP for unauthenticated endpoints.

5.5 Configuration Example

rate_limits: default: bucket_capacity: 100 refill_rate: 10 # 10 requests/sec sustained, 100 burst tiers: free: bucket_capacity: 20 refill_rate: 2 pro: bucket_capacity: 200 refill_rate: 50 enterprise: bucket_capacity: 1000 refill_rate: 200 endpoint_overrides: "/api/v1/login": bucket_capacity: 5 # Strict — prevent brute force refill_rate: 0.1 # 1 attempt per 10 seconds sustained "/api/v1/search": bucket_capacity: 30 refill_rate: 5

6. Storage Choice (v1)

OptionLatencyPersistenceFit for v1?
In-process HashMap~nanosecondsNone (lost on restart)Yes — simplest
Redis~0.5msOptionalOverkill for single node
Memcached~0.5msNoneOverkill for single node

v1 decision: In-process HashMap (or ConcurrentHashMap in Java / sync.Map in Go).

Eviction: Use a TTL equal to bucket_capacity / refill_rate (time to fully refill). Clients who stop sending requests get cleaned up automatically. Alternatively, run a periodic cleanup goroutine/thread every 60 seconds.


7. Failure Mode: Fail-Open vs Fail-Closed

This is a critical design decision even for v1.

StrategyBehavior when limiter failsRisk
Fail-openAllow all requests throughSystem can be overwhelmed
Fail-closedReject all requestsLegitimate users blocked

v1 decision: Fail-open with alerting. Rationale: The rate limiter protects, but it should never become the reason the service is down. Log aggressively when the limiter is unavailable so ops can react.


8. What This Design Handles

  • Per-client rate limiting (by API key or IP)
  • Configurable limits per tier and endpoint
  • Controlled bursts via token bucket
  • Clear rejection response with retry guidance
  • Low latency (in-memory, no network hop)
  • Bounded memory with TTL-based eviction

9. What This Design Does NOT Handle (Yet)

These are intentionally deferred to keep v1 simple. Each becomes a future iteration.

GapWhy it mattersFuture iteration
Distributed rate limitingMultiple app servers each have their own counters — a client gets N × limitv2: Centralized store (Redis)
Race conditionsConcurrent requests on same node can read stale token countv2: Atomic operations / Lua scripts
Global rate limitingLimit across all clients (e.g., total 10K req/s to protect DB)v3: Global token bucket
Sliding window precisionToken bucket allows bursts; some use cases need strict per-second capsv3: Hybrid algorithm
Multi-regionUsers hitting different data centers get separate limitsv4: Cross-DC synchronization
Adaptive / dynamic limitsAdjusting limits based on system health (CPU, queue depth)v4: Feedback loop
Request prioritizationNot all requests are equal — health checks vs writesv3: Priority queues
Analytics & observabilityWho’s being throttled? How often?v2: Metrics + dashboards

10. Quick Estimation (Back-of-Envelope)

Assumptions: 1M active users, rate limit check on every request.

DimensionCalculationResult
Memory per entryclient_id (64B) + tokens (8B) + timestamp (8B)~80 bytes
Total memory (1M users)80B × 1,000,000~80 MB
Latency per checkHashMap lookup + arithmetic< 1 microsecond
ThroughputCPU-bound; single core can do millions of lookups/secNot a bottleneck

This fits comfortably in a single server’s memory. The rate limiter will never be the bottleneck at this scale.


11. Interview Discussion Points

When presenting this design, a staff/principal engineer would highlight:

  1. “I chose token bucket because…” — shows you evaluated alternatives and made a reasoned tradeoff.
  2. Fail-open vs fail-closed — shows you think about failure modes proactively.
  3. “This breaks down when…” — acknowledging the single-node limitation and having a clear v2 plan shows architectural maturity.
  4. Response headers — shows you think about developer experience, not just backend plumbing.
  5. Per-endpoint overrides — shows you understand that not all traffic is equal (login vs search).

Next: v2 — Distributed Rate Limiting

When we’re ready, the next iteration will tackle:

  • Moving from in-memory to Redis (centralized store)
  • Handling race conditions with atomic operations (Redis Lua scripts / MULTI/EXEC)
  • Consistency vs availability tradeoffs in the rate limiter itself
  • Observability: metrics, dashboards, alerting on throttle rates
Last updated on