Production Concerns

Rate Limiting

● Intermediate ⏱ 13 min read production

Rate limiting controls how many requests a client can make to an API within a time window. Without it, a single misbehaving client — or a misconfigured script, a denial-of-service attack, or a traffic spike — can saturate your backend and degrade service for all users. Rate limiting is a fundamental protection layer for any publicly accessible API.

Why Rate Limit

Rate limiting serves several distinct purposes:

Four rate limiting algorithms: token bucket, leaky bucket, fixed window, and sliding window

Token Bucket

The token bucket algorithm is the most common choice. Each client has a bucket with a maximum capacity of N tokens. Tokens are added at a fixed rate (e.g., 10 tokens/second). Each request consumes one token. If the bucket has tokens, the request is allowed. If the bucket is empty, the request is rejected (HTTP 429).

bucket.capacity = 100        // max tokens
bucket.refill_rate = 10/sec  // tokens added per second
bucket.tokens = 100          // starts full

on each request:
  refill tokens based on elapsed time
  if bucket.tokens >= 1:
    bucket.tokens -= 1
    allow request
  else:
    reject with 429

Key property: Allows bursting. A client that hasn’t made any requests for 10 seconds accumulates 100 tokens (up to capacity) and can then make 100 requests in rapid succession. This is correct behavior for real clients — humans don’t send requests at a perfectly steady rate.

Implementation: Store (tokens, last_refill_timestamp) per client. On each request, compute tokens added since last_refill, add them (capped at capacity), then check and decrement. Atomic operations in Redis (INCR, Lua scripts, or Redis modules) make this concurrency-safe.

Leaky Bucket

The leaky bucket models a bucket with a hole at the bottom: requests enter at the top and leave (are processed) at a constant rate. If requests arrive faster than they drain, the bucket fills. When full, new requests overflow (are rejected).

Unlike the token bucket, the leaky bucket enforces a strictly constant output rate — no bursting allowed. The queue absorbs small spikes, but sustained bursts fill the queue and trigger rejections.

Use case: Systems where you want to guarantee a smooth, constant request rate to a downstream service — for example, protecting a rate-limited third-party API. If you can only call an SMS gateway 10 times per second, a leaky bucket ensures you never exceed that rate even during traffic spikes.

Fixed Window Counter

Divide time into fixed windows (e.g., each minute from :00 to :59). Track a request count per client per window. If the count exceeds the limit, reject. At window rollover, reset the counter.

window_key = client_id + ":" + current_minute  // e.g., "user_42:2024-01-15T14:37"
count = INCR(window_key)
EXPIRE(window_key, 60)
if count > limit:
  reject with 429

Problem — boundary burst: A client allowed 100 requests/minute can make 100 requests at 12:00:59 and another 100 at 12:01:00 — 200 requests in 2 seconds, all within their per-minute limits. This boundary burst can double the effective request rate at window edges.

Advantages: Extremely simple. Two Redis operations per request. Easy to understand and debug. Suitable when the boundary burst is acceptable (e.g., batch jobs, non-real-time APIs).

Sliding Window

The sliding window solves the boundary burst problem by tracking requests over a rolling time window rather than a fixed calendar window.

Sliding window log: Store the timestamp of every request in a sorted set. On each new request, remove timestamps older than the window duration, count remaining entries, and allow or reject.

// Redis sorted set: key = client_id, score = timestamp
ZREMRANGEBYSCORE client_id 0 (now - window_duration)
count = ZCARD client_id
if count < limit:
  ZADD client_id now now
  allow
else:
  reject with 429

This is accurate but memory-intensive — storing every request timestamp for every client. With a 1-minute window and 1,000 requests/minute limit, each client occupies up to 1,000 sorted set entries.

Sliding window counter (approximation): A memory-efficient alternative. Keep counters for the current and previous fixed windows. Estimate the rolling window count as:

count = prev_count × (time_remaining_in_prev_window / window_duration)
      + curr_count

This approximation is accurate to within ~0.1% for most traffic patterns and requires only two counters per client — practical for production at scale.

Algorithm Comparison

Token BucketLeaky BucketFixed WindowSliding Window
Burst allowedYes (up to capacity)No (constant rate)Yes (at boundaries)No (rolling)
AccuracyHighHighLow (boundary burst)High (log) / Good (approx)
MemoryO(1) per clientO(queue size)O(1) per clientO(requests) or O(1) approx
ComplexityLowLowVery lowMedium
Best forGeneral API limitingSmooth downstream rateSimple, non-criticalStrict per-user enforcement

Distributed Rate Limiting

In-process rate limiting (a counter in application memory) breaks when you run multiple server instances — each server tracks its own counter, so a client can send N requests to each of K servers for a total of N×K requests. Distributed rate limiting stores state in a shared store visible to all instances.

Redis-based rate limiting is the industry standard. Redis is single-threaded (no race conditions), sub-millisecond latency, and supports atomic operations via Lua scripts or Redis modules.

-- Lua script for atomic token bucket in Redis
local tokens = tonumber(redis.call('GET', KEYS[1]) or ARGV[1])
local now = tonumber(ARGV[2])
local last = tonumber(redis.call('GET', KEYS[2]) or now)
local rate = tonumber(ARGV[3])
local capacity = tonumber(ARGV[4])

local elapsed = now - last
tokens = math.min(capacity, tokens + elapsed * rate)

if tokens >= 1 then
  tokens = tokens - 1
  redis.call('SET', KEYS[1], tokens)
  redis.call('SET', KEYS[2], now)
  return 1  -- allowed
else
  return 0  -- rejected
end

Redis rate limit modules: redis-cell implements token bucket natively as a Redis command (CL.THROTTLE). It handles all the atomic math server-side in a single command.

Rate limiting at the API gateway: Kong, AWS API Gateway, nginx, and most API gateways support rate limiting natively, backed by their own Redis or in-memory stores. This offloads the implementation from application code and centralizes enforcement.

Design Considerations