Production Concerns

Service Discovery

● Intermediate ⏱ 11 min read production

In a microservices system, services need to find and communicate with each other. In a static environment with fixed IP addresses, you can hardcode endpoints. In a dynamic environment — where containers start and stop, services scale up and down, and instances are replaced after failures — hardcoded addresses don’t work. Service discovery is the mechanism by which services locate each other dynamically.

The Problem

A monolith calls a function. A microservice calls a network endpoint. That endpoint has an IP address and port that change every time a container restarts, scales, or is rescheduled by an orchestrator. Kubernetes might move a pod from node 10.0.0.5 to 10.0.0.12. An autoscaler might bring up 20 new instances of the Orders service during a traffic spike and tear them down an hour later.

Service discovery solves two sub-problems:

Registration: When a service instance starts, it registers its address and port with a central registry. When it stops, it deregisters.
Lookup: When Service A needs to call Service B, it queries the registry to find a healthy instance of B and gets its current address.

Service discovery: client-side (caller queries registry) vs server-side (load balancer queries registry)

Service Registry

A service registry is a database of live service instances — their names, addresses, ports, health status, and metadata. It is a critical piece of infrastructure: if the registry is down, services can’t find each other.

Registries must be:

Highly available: Deployed as a cluster with replication. Consul and etcd use Raft for consensus; Zookeeper uses ZAB. No single point of failure.
Consistent or eventually consistent: Stale entries (a dead instance still registered) cause failed requests. Registries use health checks and TTLs to expire stale entries automatically.
Low-latency reads: Every service call potentially involves a registry lookup. Clients typically cache results locally with a short TTL (5–30 seconds) to avoid a registry query on every request.

Popular service registries: Consul (HashiCorp — also provides health checking, KV store, and service mesh), etcd (used internally by Kubernetes), Zookeeper (older, widely used in the Kafka ecosystem), Eureka (Netflix — eventually consistent, designed for AWS).

Client-Side Discovery

In client-side discovery, the calling service (the client) is responsible for finding its target:

The client queries the registry: “Give me all healthy instances of OrderService.”
The registry returns a list of addresses.
The client picks one (using round-robin, least-connections, or random selection) and sends the request directly.

Advantages: The client controls load balancing — it can implement sophisticated routing (latency-based, zone-aware). No extra network hop through a load balancer.

Disadvantages: Every client must implement registry integration and load balancing logic. Changing the load balancing strategy requires deploying new client code. In a polyglot system (Go, Java, Python, Node.js services), each language needs its own registry client library.

Netflix Ribbon (for Java/Spring) is a classic client-side discovery library. It integrates with Eureka and handles load balancing, retries, and circuit breaking client-side.

Server-Side Discovery

In server-side discovery, a router (load balancer or API gateway) sits between the client and the service. The client sends requests to the router; the router queries the registry and forwards to a healthy instance.

The client sends a request to a stable address (the router).
The router queries the registry for healthy instances of the target service.
The router picks an instance and forwards the request.

Advantages: Clients need no knowledge of the registry or load balancing. Any client that can make an HTTP call works — no special libraries. Load balancing strategy is centralized and can be changed without touching client code.

Disadvantages: The router is a critical-path component that must be highly available. An extra network hop for every request.

AWS ALB / ECS service discovery, Traefik with Consul, and nginx with dynamic upstream configuration are all server-side discovery implementations.

DNS-Based Discovery

DNS is the internet’s original service discovery system. Each service gets a DNS name; the DNS server returns the addresses of healthy instances as A records. Clients use standard DNS lookups — no special library needed.

DNS-based discovery is simple but has limitations:

TTL and caching: DNS responses are cached by resolvers and OS libraries for the duration of the TTL. If an instance fails, its address remains in cache until the TTL expires. Short TTLs (1–5 seconds) reduce the stale window but increase DNS query load.
No load balancing: DNS round-robins across all A records but doesn’t know which instances are healthy or their current load. Unhealthy instances remain in rotation until they’re removed from DNS.
No port information: A records carry IP addresses, not ports. SRV records carry port info but are less universally supported.

Despite these limitations, DNS-based discovery is widely used — it’s the basis of Kubernetes service discovery and cloud load balancer DNS names (AWS ALB, GCP Load Balancer).

Kubernetes Service Discovery

Kubernetes has built-in service discovery. When you create a Service object, Kubernetes assigns it a stable ClusterIP and a DNS name (service-name.namespace.svc.cluster.local). kube-proxy maintains iptables or IPVS rules to route traffic to healthy pods behind the service.

Pods are automatically registered when they start and deregistered when they stop. Readiness probes gate registration — a pod isn’t added to the service until its readiness probe passes. This prevents traffic from being sent to pods that haven’t finished initializing.

For more advanced needs (traffic splitting, mTLS, circuit breaking), a service mesh (Istio, Linkerd) adds a sidecar proxy to each pod. The sidecar handles service discovery, load balancing, retries, and observability transparently — without changing application code.

Health Checks and Deregistration

A registry is only useful if it reflects reality. Stale entries — instances that are registered but unhealthy — cause failed requests. Health checks keep the registry accurate:

Active health checks (pull): The registry or load balancer periodically calls each instance’s health endpoint (GET /health) and removes instances that fail. Consul, ALB, and most load balancers use this model.
Heartbeats (push): Each service instance periodically sends a heartbeat to the registry. If the registry doesn’t receive a heartbeat within a TTL, it marks the instance as dead and deregisters it. Eureka uses this model.
Graceful deregistration: On shutdown, a service deregisters itself before stopping. This gives in-flight requests time to complete and prevents the health check gap (the window between shutdown and the next health check cycle) from routing requests to a dead instance.

⚠️

The Registration-Health-Check Gap

When a new instance starts, it registers immediately but may not be healthy yet (still warming up, loading caches). When an instance stops, it may deregister but still be receiving in-flight requests. Use readiness probes to delay registration until the instance is ready, and drain connections for a few seconds before deregistering on shutdown.

Design Considerations

Cache registry responses locally. Querying the registry on every request adds latency and load. Cache the list of instances with a short TTL (5–30 seconds). Refresh in the background; use the cached value for serving requests. A slightly stale list is better than a blocking registry call on the hot path.
Handle registry unavailability. If the registry is unreachable, clients should serve requests using their cached instance list rather than failing immediately. A stale-but-working list is better than no list.
Use Kubernetes-native discovery when on Kubernetes. Kubernetes Service objects + DNS is sufficient for most workloads. Add a service mesh (Istio, Linkerd) if you need mTLS, traffic splitting, or detailed per-service metrics. Don’t deploy a separate Consul cluster when Kubernetes already solves the problem.
Zone-aware routing. In multi-availability-zone deployments, prefer routing to instances in the same zone to minimize cross-zone latency and data transfer costs. Most service meshes and cloud load balancers support zone-aware routing.
Service naming conventions. Establish a consistent naming scheme for services (e.g., <team>-<service>-<env>) and enforce it. Inconsistent names make it hard to find what you’re looking for in the registry and correlate metrics, logs, and traces across services.