Message Brokers, Queues & Pub-Sub
Asynchronous messaging lets services communicate without waiting for each other. Instead of Service A calling Service B directly and blocking until B responds, A drops a message into a broker and continues. B picks up the message when it’s ready. This decoupling is one of the most powerful tools in distributed system design — it absorbs traffic spikes, tolerates downstream failures, and eliminates tight coupling between services.
Why Asynchronous Messaging
Synchronous communication (HTTP, gRPC) is simple and intuitive, but it creates temporal coupling: both services must be available at the same moment. If Service B is slow, Service A waits. If B is down, A fails. If traffic spikes, B gets overwhelmed before A can back off.
Asynchronous messaging breaks this coupling:
- Temporal decoupling: Producer and consumer don’t need to be online simultaneously. Messages wait in the broker until the consumer is ready.
- Load leveling: A producer can burst thousands of messages into the queue; consumers process at their own sustainable rate. The queue absorbs the spike.
- Failure isolation: If a consumer crashes, messages accumulate in the broker. When the consumer recovers, it resumes from where it left off — no messages lost.
- Rate limiting at the producer: Backpressure from a full queue signals the producer to slow down, preventing cascading overload.
Message Queues
A message queue is a point-to-point channel: one producer sends a message, exactly one consumer receives it. The queue holds messages until a consumer acknowledges receipt. This is the work-queue pattern — distributing tasks across a pool of workers.
How it works:
- Producer sends a message to the queue.
- The broker stores the message durably.
- One consumer from the pool receives the message (the broker decides which).
- Consumer processes the message and sends an acknowledgment (ack).
- The broker deletes the message after ack. If no ack arrives within a timeout (visibility timeout), the broker re-delivers to another consumer.
Use cases: background job processing (image resizing, email sending), task offloading from web servers, work distribution across worker pools, retry logic for unreliable operations.
Key properties:
- Competing consumers: Multiple consumers read from the same queue; each message is processed by exactly one. Add more consumers to scale throughput.
- Dead letter queue (DLQ): After N failed delivery attempts, the broker moves the message to a DLQ for inspection. Prevents a poison message from blocking the queue forever.
- Visibility timeout: After a consumer receives a message, it becomes invisible to others for a configurable period. If the consumer doesn’t ack within that window, the message reappears for redelivery.
Publish-Subscribe
In the pub-sub pattern, producers publish messages to a topic. Multiple independent subscribers each receive a copy of every message. One event fans out to many consumers.
How it works:
- Producer publishes an event to a topic (e.g.,
order.placed). - The broker delivers a copy to every subscriber of that topic.
- Each subscriber processes its copy independently.
Use cases: event notifications (an order placed triggers inventory, email, analytics, and fraud detection simultaneously), real-time feeds, cache invalidation across multiple services, audit logging.
Subscriptions and filtering: Most brokers support subscription filters so a subscriber only receives messages matching certain attributes (e.g., only orders over $500, or only events from a specific region), reducing processing overhead.
Queues vs Pub-Sub
| Message Queue | Pub-Sub | |
|---|---|---|
| Consumers per message | One (competing consumers) | Many (each subscriber gets a copy) |
| Pattern | Point-to-point (task distribution) | Fan-out (event notification) |
| Message retention | Deleted after ack | Retained per subscriber until each acks |
| Producer awareness | Producer targets a specific queue | Producer targets a topic; doesn’t know subscribers |
| Use case | Job queues, background tasks | Event notifications, fan-out |
Many brokers support both patterns. Kafka uses consumer groups: within a group, only one consumer reads each partition (queue-like); multiple independent groups each read all messages (pub-sub-like). This makes Kafka extremely flexible.
Delivery Guarantees
Brokers differ in what they promise about message delivery:
At-most-once: The broker delivers the message once and forgets it. If the consumer crashes before processing, the message is lost. Fast, but suitable only when occasional loss is acceptable (metrics, telemetry).
At-least-once: The broker retains the message until it receives an ack. If no ack arrives, it redelivers. Messages may be delivered multiple times (if the consumer crashes after processing but before acking). Consumers must be idempotent — processing the same message twice must produce the same result as processing it once.
Exactly-once: Each message is processed exactly once. Hard to achieve in distributed systems. Kafka supports exactly-once semantics within its own ecosystem using transactions and idempotent producers. Cross-system exactly-once (e.g., Kafka to a database) still requires idempotent consumer logic.
At-least-once delivery with idempotent consumers is the practical standard. Exactly-once is expensive to implement end-to-end and often provides false confidence. Build consumers that check whether a message has already been processed (using a unique message ID and a processed-IDs store) — this gives you exactly-once behavior at the application level with at-least-once delivery at the broker level.
Popular Brokers
Apache Kafka: A distributed log optimized for high throughput and durability. Messages are written to partitioned, replicated logs and retained for a configurable period (not just until ack). Consumers maintain their own offset — they can replay old messages. Ideal for event streaming, audit logs, data pipelines, and cases where multiple independent consumers need the same event stream. Throughput: millions of messages/second.
RabbitMQ: A traditional message broker implementing AMQP. Excellent for task queues, routing, and complex message flow patterns (fan-out, topic routing, header-based routing). Messages are deleted after consumption. Better ergonomics for job queues; less suited for high-throughput event streaming.
Amazon SQS: Managed queue service. Standard queues offer at-least-once delivery with best-effort ordering. FIFO queues guarantee exactly-once delivery and strict ordering within a message group. Fully managed — no servers to operate. Pairs naturally with SNS for pub-sub fan-out: SNS topic fans out to multiple SQS queues.
Amazon SNS: Managed pub-sub service. A topic fans out to SQS queues, Lambda functions, HTTP endpoints, email, or SMS. Each subscriber receives every message. Used for notifications and fan-out.
Google Pub/Sub: Managed pub-sub with at-least-once delivery, global scale, and pull or push delivery modes. Integrates tightly with Google Cloud services.
Redis Streams: A lightweight append-only log within Redis. Suitable for moderate-throughput event streaming when you already run Redis and don’t need Kafka’s scale or durability guarantees.
Enterprise Service Bus
An Enterprise Service Bus (ESB) is a centralized integration middleware that routes, transforms, and orchestrates messages between services. It handles protocol translation (HTTP to AMQP), message transformation (XML to JSON), content-based routing, and workflow orchestration.
ESBs were dominant in the 2000s for enterprise SOA (Service-Oriented Architecture). Their central, monolithic nature became a bottleneck as services proliferated. Modern microservices architectures favor "smart endpoints, dumb pipes" — keeping brokers simple and putting routing and transformation logic in the services themselves or in lightweight API gateways. ESBs are still present in legacy enterprise environments but are largely replaced by Kafka, API gateways, and service meshes in greenfield systems.
Design Considerations
- Idempotent consumers are non-negotiable. With at-least-once delivery (the practical default), any message can arrive twice. Every consumer must handle duplicates safely — use unique message IDs and a deduplication store, or make operations naturally idempotent (SET instead of INCREMENT).
- Message schema evolution. Once messages are in production, consumers may be on older or newer versions of the schema. Use a schema registry (Confluent Schema Registry for Kafka, Protobuf, or Avro) and design schemas for backward and forward compatibility. Never remove required fields without a versioned migration.
- Monitor queue depth. A growing queue means consumers are falling behind. Alert before the queue grows so large that catch-up time exceeds your SLA. Queue depth is one of the most important operational metrics for async systems.
- Poison messages. A message that always causes consumer errors will be retried indefinitely, blocking the queue. Always configure a DLQ with a max-retry limit. Monitor the DLQ and alert on any messages there — they represent data that requires human investigation.
- Ordering. Standard queues and most pub-sub systems do not guarantee message order across multiple consumers. If ordering matters (e.g., user events must be processed in the order they occurred), use a single-partition topic per entity (e.g., one Kafka partition per user ID) or a FIFO queue.