Production Concerns

SLA, SLO & SLI

● Intermediate ⏱ 11 min read production

SLA, SLO, and SLI are the vocabulary of reliability engineering. They give teams a shared, measurable language for discussing how well a system is performing, what targets to aim for, and what the consequences of missing those targets are. Without them, “the system is reliable” is an opinion. With them, reliability is a number you can measure, track, and act on.

SLI, SLO, and SLA Defined

TermFull nameWhat it isExample
SLIService Level IndicatorA metric that measures service behavior99.2% of requests succeeded in the past 7 days
SLOService Level ObjectiveThe target value for an SLI99.9% of requests should succeed per 30-day window
SLAService Level AgreementA contract with external consequences for missing an SLOIf availability drops below 99.5%, customers receive service credits

The relationship: SLIs are what you measure. SLOs are what you aim for. SLAs are what you’re held to. An SLO is always stricter than the SLA — you want to catch problems before they breach the contract.

Defining Good SLIs

Not every metric is a useful SLI. Good SLIs directly measure what users experience. The four golden signals are the canonical starting point:

SLIs should be measured from the user’s perspective, not the infrastructure’s. Server CPU at 80% isn’t an SLI — it’s a symptom. Request error rate is an SLI.

💡
Measure at the Right Point

Measure SLIs as close to the user as possible — at the load balancer or API gateway, not inside the service. A request that the service never saw (because the load balancer returned 502) still counts as a failed request from the user’s perspective. Infrastructure-level metrics measured inside the service miss these cases.

Setting SLOs

An SLO is a target for an SLI over a measurement window. A well-formed SLO specifies:

Examples of well-formed SLOs:

Setting the right target: Start with current performance, not aspirational numbers. If you’re currently at 99.5% availability, set the SLO at 99.5% (or slightly above). An SLO you’re already meeting is a baseline; you can raise it as you improve reliability. An aspirational SLO you’re never meeting creates alert fatigue and is ignored.

Not everything needs a 99.99% SLO. A developer dashboard doesn’t need the same reliability target as a payment API. Higher SLOs cost more to build and operate. Match the reliability target to the business criticality of the service.

SLAs and Consequences

An SLA is a legal or business agreement with defined consequences for missing the target. Consequences are typically financial: service credits, refunds, or contract penalties. SLAs are external-facing — they apply to customers or partners, not internal teams.

SLAs should be set conservatively — looser than your internal SLOs. If your internal SLO is 99.9% availability and you hit 99.85% one month, you want that to be an internal miss that drives improvement, not an SLA breach that triggers customer credits.

A common structure:

Error Budgets

An error budget is the amount of unreliability permitted before breaching an SLO. It reframes reliability as a resource to be allocated, not just a floor to stay above.

If your SLO is 99.9% availability over 30 days:

Total requests in 30 days: 10,000,000
Allowed failures: 10,000,000 × (1 - 0.999) = 10,000 requests
Error budget: 10,000 failed requests (or ~43.8 minutes of downtime)

The error budget creates a shared incentive between reliability and velocity:

Error budgets resolve the classic tension between dev (wants to ship fast) and ops (wants stability). Both teams agree to the SLO upfront; the error budget makes the trade-off explicit and data-driven.

Availability Math

Availability is typically expressed in “nines.” The number of nines determines how much downtime is tolerated per year:

AvailabilityDowntime per yearDowntime per monthDowntime per week
99% (two nines)87.6 hours7.3 hours1.68 hours
99.9% (three nines)8.76 hours43.8 minutes10.1 minutes
99.95%4.38 hours21.9 minutes5 minutes
99.99% (four nines)52.6 minutes4.38 minutes1 minute
99.999% (five nines)5.26 minutes26.3 seconds6 seconds

Availability of dependent systems multiplies: If Service A depends on Services B and C, A’s availability is at most A × B × C. Two 99.9% dependencies give a combined availability of 99.9% × 99.9% = 99.8%. Systems with many dependencies need each component to be significantly more reliable than the end-to-end target.

Design Considerations