Production Concerns

Service Discovery

● Intermediate ⏱ 11 min read production

In a microservices system, services need to find and communicate with each other. In a static environment with fixed IP addresses, you can hardcode endpoints. In a dynamic environment — where containers start and stop, services scale up and down, and instances are replaced after failures — hardcoded addresses don’t work. Service discovery is the mechanism by which services locate each other dynamically.

The Problem

A monolith calls a function. A microservice calls a network endpoint. That endpoint has an IP address and port that change every time a container restarts, scales, or is rescheduled by an orchestrator. Kubernetes might move a pod from node 10.0.0.5 to 10.0.0.12. An autoscaler might bring up 20 new instances of the Orders service during a traffic spike and tear them down an hour later.

Service discovery solves two sub-problems:

Service discovery: client-side (caller queries registry) vs server-side (load balancer queries registry)

Service Registry

A service registry is a database of live service instances — their names, addresses, ports, health status, and metadata. It is a critical piece of infrastructure: if the registry is down, services can’t find each other.

Registries must be:

Popular service registries: Consul (HashiCorp — also provides health checking, KV store, and service mesh), etcd (used internally by Kubernetes), Zookeeper (older, widely used in the Kafka ecosystem), Eureka (Netflix — eventually consistent, designed for AWS).

Client-Side Discovery

In client-side discovery, the calling service (the client) is responsible for finding its target:

  1. The client queries the registry: “Give me all healthy instances of OrderService.”
  2. The registry returns a list of addresses.
  3. The client picks one (using round-robin, least-connections, or random selection) and sends the request directly.

Advantages: The client controls load balancing — it can implement sophisticated routing (latency-based, zone-aware). No extra network hop through a load balancer.

Disadvantages: Every client must implement registry integration and load balancing logic. Changing the load balancing strategy requires deploying new client code. In a polyglot system (Go, Java, Python, Node.js services), each language needs its own registry client library.

Netflix Ribbon (for Java/Spring) is a classic client-side discovery library. It integrates with Eureka and handles load balancing, retries, and circuit breaking client-side.

Server-Side Discovery

In server-side discovery, a router (load balancer or API gateway) sits between the client and the service. The client sends requests to the router; the router queries the registry and forwards to a healthy instance.

  1. The client sends a request to a stable address (the router).
  2. The router queries the registry for healthy instances of the target service.
  3. The router picks an instance and forwards the request.

Advantages: Clients need no knowledge of the registry or load balancing. Any client that can make an HTTP call works — no special libraries. Load balancing strategy is centralized and can be changed without touching client code.

Disadvantages: The router is a critical-path component that must be highly available. An extra network hop for every request.

AWS ALB / ECS service discovery, Traefik with Consul, and nginx with dynamic upstream configuration are all server-side discovery implementations.

DNS-Based Discovery

DNS is the internet’s original service discovery system. Each service gets a DNS name; the DNS server returns the addresses of healthy instances as A records. Clients use standard DNS lookups — no special library needed.

DNS-based discovery is simple but has limitations:

Despite these limitations, DNS-based discovery is widely used — it’s the basis of Kubernetes service discovery and cloud load balancer DNS names (AWS ALB, GCP Load Balancer).

Kubernetes Service Discovery

Kubernetes has built-in service discovery. When you create a Service object, Kubernetes assigns it a stable ClusterIP and a DNS name (service-name.namespace.svc.cluster.local). kube-proxy maintains iptables or IPVS rules to route traffic to healthy pods behind the service.

Pods are automatically registered when they start and deregistered when they stop. Readiness probes gate registration — a pod isn’t added to the service until its readiness probe passes. This prevents traffic from being sent to pods that haven’t finished initializing.

For more advanced needs (traffic splitting, mTLS, circuit breaking), a service mesh (Istio, Linkerd) adds a sidecar proxy to each pod. The sidecar handles service discovery, load balancing, retries, and observability transparently — without changing application code.

Health Checks and Deregistration

A registry is only useful if it reflects reality. Stale entries — instances that are registered but unhealthy — cause failed requests. Health checks keep the registry accurate:

⚠️
The Registration-Health-Check Gap

When a new instance starts, it registers immediately but may not be healthy yet (still warming up, loading caches). When an instance stops, it may deregister but still be receiving in-flight requests. Use readiness probes to delay registration until the instance is ready, and drain connections for a few seconds before deregistering on shutdown.

Design Considerations