Production Concerns

Rate Limiting

● Intermediate ⏱ 13 min read production

Rate limiting controls how many requests a client can make to an API within a time window. Without it, a single misbehaving client — or a misconfigured script, a denial-of-service attack, or a traffic spike — can saturate your backend and degrade service for all users. Rate limiting is a fundamental protection layer for any publicly accessible API.

Why Rate Limit

Rate limiting serves several distinct purposes:

DoS protection: Prevents a single IP or client from flooding the API with enough requests to exhaust server capacity.
Fair usage: In a multi-tenant API, one tenant’s burst shouldn’t degrade other tenants. Limits ensure equitable resource distribution.
Cost control: APIs that call expensive downstream services (LLM inference, SMS, payment processors) need limits to prevent runaway costs from bugs or abuse.
Backpressure: Limits enforce a maximum load your backend is designed to handle, preventing the overload conditions that lead to cascading failures.
SLA tiers: Free-tier clients get 100 req/min; paid clients get 1,000 req/min. Rate limiting is the enforcement mechanism for pricing tiers.

Four rate limiting algorithms: token bucket, leaky bucket, fixed window, and sliding window

Token Bucket

The token bucket algorithm is the most common choice. Each client has a bucket with a maximum capacity of N tokens. Tokens are added at a fixed rate (e.g., 10 tokens/second). Each request consumes one token. If the bucket has tokens, the request is allowed. If the bucket is empty, the request is rejected (HTTP 429).

bucket.capacity = 100        // max tokens
bucket.refill_rate = 10/sec  // tokens added per second
bucket.tokens = 100          // starts full

on each request:
  refill tokens based on elapsed time
  if bucket.tokens >= 1:
    bucket.tokens -= 1
    allow request
  else:
    reject with 429

Key property: Allows bursting. A client that hasn’t made any requests for 10 seconds accumulates 100 tokens (up to capacity) and can then make 100 requests in rapid succession. This is correct behavior for real clients — humans don’t send requests at a perfectly steady rate.

Implementation: Store (tokens, last_refill_timestamp) per client. On each request, compute tokens added since last_refill, add them (capped at capacity), then check and decrement. Atomic operations in Redis (INCR, Lua scripts, or Redis modules) make this concurrency-safe.

Leaky Bucket

The leaky bucket models a bucket with a hole at the bottom: requests enter at the top and leave (are processed) at a constant rate. If requests arrive faster than they drain, the bucket fills. When full, new requests overflow (are rejected).

Unlike the token bucket, the leaky bucket enforces a strictly constant output rate — no bursting allowed. The queue absorbs small spikes, but sustained bursts fill the queue and trigger rejections.

Use case: Systems where you want to guarantee a smooth, constant request rate to a downstream service — for example, protecting a rate-limited third-party API. If you can only call an SMS gateway 10 times per second, a leaky bucket ensures you never exceed that rate even during traffic spikes.

Fixed Window Counter

Divide time into fixed windows (e.g., each minute from :00 to :59). Track a request count per client per window. If the count exceeds the limit, reject. At window rollover, reset the counter.

window_key = client_id + ":" + current_minute  // e.g., "user_42:2024-01-15T14:37"
count = INCR(window_key)
EXPIRE(window_key, 60)
if count > limit:
  reject with 429

Problem — boundary burst: A client allowed 100 requests/minute can make 100 requests at 12:00:59 and another 100 at 12:01:00 — 200 requests in 2 seconds, all within their per-minute limits. This boundary burst can double the effective request rate at window edges.

Advantages: Extremely simple. Two Redis operations per request. Easy to understand and debug. Suitable when the boundary burst is acceptable (e.g., batch jobs, non-real-time APIs).

Sliding Window

The sliding window solves the boundary burst problem by tracking requests over a rolling time window rather than a fixed calendar window.

Sliding window log: Store the timestamp of every request in a sorted set. On each new request, remove timestamps older than the window duration, count remaining entries, and allow or reject.

// Redis sorted set: key = client_id, score = timestamp
ZREMRANGEBYSCORE client_id 0 (now - window_duration)
count = ZCARD client_id
if count < limit:
  ZADD client_id now now
  allow
else:
  reject with 429

This is accurate but memory-intensive — storing every request timestamp for every client. With a 1-minute window and 1,000 requests/minute limit, each client occupies up to 1,000 sorted set entries.

Sliding window counter (approximation): A memory-efficient alternative. Keep counters for the current and previous fixed windows. Estimate the rolling window count as:

count = prev_count × (time_remaining_in_prev_window / window_duration)
      + curr_count

This approximation is accurate to within ~0.1% for most traffic patterns and requires only two counters per client — practical for production at scale.

Algorithm Comparison

	Token Bucket	Leaky Bucket	Fixed Window	Sliding Window
Burst allowed	Yes (up to capacity)	No (constant rate)	Yes (at boundaries)	No (rolling)
Accuracy	High	High	Low (boundary burst)	High (log) / Good (approx)
Memory	O(1) per client	O(queue size)	O(1) per client	O(requests) or O(1) approx
Complexity	Low	Low	Very low	Medium
Best for	General API limiting	Smooth downstream rate	Simple, non-critical	Strict per-user enforcement

Distributed Rate Limiting

In-process rate limiting (a counter in application memory) breaks when you run multiple server instances — each server tracks its own counter, so a client can send N requests to each of K servers for a total of N×K requests. Distributed rate limiting stores state in a shared store visible to all instances.

Redis-based rate limiting is the industry standard. Redis is single-threaded (no race conditions), sub-millisecond latency, and supports atomic operations via Lua scripts or Redis modules.

-- Lua script for atomic token bucket in Redis
local tokens = tonumber(redis.call('GET', KEYS[1]) or ARGV[1])
local now = tonumber(ARGV[2])
local last = tonumber(redis.call('GET', KEYS[2]) or now)
local rate = tonumber(ARGV[3])
local capacity = tonumber(ARGV[4])

local elapsed = now - last
tokens = math.min(capacity, tokens + elapsed * rate)

if tokens >= 1 then
  tokens = tokens - 1
  redis.call('SET', KEYS[1], tokens)
  redis.call('SET', KEYS[2], now)
  return 1  -- allowed
else
  return 0  -- rejected
end

Redis rate limit modules: redis-cell implements token bucket natively as a Redis command (CL.THROTTLE). It handles all the atomic math server-side in a single command.

Rate limiting at the API gateway: Kong, AWS API Gateway, nginx, and most API gateways support rate limiting natively, backed by their own Redis or in-memory stores. This offloads the implementation from application code and centralizes enforcement.

Design Considerations

Return informative headers. Always include X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After in 429 responses. Well-behaved clients use these to back off gracefully. Without them, clients retry immediately and make the problem worse.
Limit on multiple dimensions. A single global limit is easy to bypass. Layer limits: per IP (for unauthenticated), per API key (for authenticated), per endpoint (expensive endpoints get tighter limits), and per tenant. A client within their per-key limit may still be rate-limited per IP if they share an IP with other abusers.
Distinguish rate limiting from throttling. Rate limiting rejects excess requests immediately (HTTP 429). Throttling queues them and serves them at a reduced rate. Use rate limiting for protection against abuse; use throttling for smoothing bursty legitimate traffic before a capacity-constrained downstream.
Whitelist internal traffic. Internal service-to-service calls shouldn’t hit the same rate limits as external clients. Identify internal traffic by a shared secret header or IP range and bypass or apply higher limits.
Graceful degradation under limit. For user-facing features, consider serving cached or degraded responses rather than raw 429 errors. A search that returns slightly stale results is better than an error page.