Production Concerns

SLA, SLO & SLI

● Intermediate ⏱ 11 min read production

SLA, SLO, and SLI are the vocabulary of reliability engineering. They give teams a shared, measurable language for discussing how well a system is performing, what targets to aim for, and what the consequences of missing those targets are. Without them, “the system is reliable” is an opinion. With them, reliability is a number you can measure, track, and act on.

SLI, SLO, and SLA Defined

Term	Full name	What it is	Example
SLI	Service Level Indicator	A metric that measures service behavior	99.2% of requests succeeded in the past 7 days
SLO	Service Level Objective	The target value for an SLI	99.9% of requests should succeed per 30-day window
SLA	Service Level Agreement	A contract with external consequences for missing an SLO	If availability drops below 99.5%, customers receive service credits

The relationship: SLIs are what you measure. SLOs are what you aim for. SLAs are what you’re held to. An SLO is always stricter than the SLA — you want to catch problems before they breach the contract.

Defining Good SLIs

Not every metric is a useful SLI. Good SLIs directly measure what users experience. The four golden signals are the canonical starting point:

Availability: The proportion of requests that succeeded. Usually measured as successful_requests / total_requests. What counts as “successful” is domain-specific (HTTP 2xx, no application-level error code, etc.).
Latency: How long requests take. Typically expressed as percentiles: p50 (median), p95, p99. p99 latency is more important than average — averages hide the worst experiences. A system where 99% of requests take 50ms but 1% take 10 seconds is not “fast on average.”
Error rate: The proportion of requests returning errors. Overlaps with availability but can be measured separately for different error types (5xx vs 4xx vs timeout).
Throughput / saturation: How much capacity is being used. CPU, memory, queue depth, connection pool utilization. Saturation predicts future failures before they cause user impact.

SLIs should be measured from the user’s perspective, not the infrastructure’s. Server CPU at 80% isn’t an SLI — it’s a symptom. Request error rate is an SLI.

💡

Measure at the Right Point

Measure SLIs as close to the user as possible — at the load balancer or API gateway, not inside the service. A request that the service never saw (because the load balancer returned 502) still counts as a failed request from the user’s perspective. Infrastructure-level metrics measured inside the service miss these cases.

Setting SLOs

An SLO is a target for an SLI over a measurement window. A well-formed SLO specifies:

The SLI being measured
The target value (percentage, threshold, or ratio)
The measurement window (rolling 30 days is standard)

Examples of well-formed SLOs:

99.9% of HTTP requests to /api/orders return a 2xx response, measured over a rolling 30-day window.
95% of requests to /api/search complete within 200ms, measured over a rolling 7-day window.
99% of batch jobs complete within 4 hours of their scheduled start time.

Setting the right target: Start with current performance, not aspirational numbers. If you’re currently at 99.5% availability, set the SLO at 99.5% (or slightly above). An SLO you’re already meeting is a baseline; you can raise it as you improve reliability. An aspirational SLO you’re never meeting creates alert fatigue and is ignored.

Not everything needs a 99.99% SLO. A developer dashboard doesn’t need the same reliability target as a payment API. Higher SLOs cost more to build and operate. Match the reliability target to the business criticality of the service.

SLAs and Consequences

An SLA is a legal or business agreement with defined consequences for missing the target. Consequences are typically financial: service credits, refunds, or contract penalties. SLAs are external-facing — they apply to customers or partners, not internal teams.

SLAs should be set conservatively — looser than your internal SLOs. If your internal SLO is 99.9% availability and you hit 99.85% one month, you want that to be an internal miss that drives improvement, not an SLA breach that triggers customer credits.

A common structure:

Internal SLO: 99.9% availability
External SLA: 99.5% availability (with service credits if breached)
Gap: 0.4% — your “safety margin” between internal target and external commitment

Error Budgets

An error budget is the amount of unreliability permitted before breaching an SLO. It reframes reliability as a resource to be allocated, not just a floor to stay above.

If your SLO is 99.9% availability over 30 days:

Total requests in 30 days: 10,000,000
Allowed failures: 10,000,000 × (1 - 0.999) = 10,000 requests
Error budget: 10,000 failed requests (or ~43.8 minutes of downtime)

The error budget creates a shared incentive between reliability and velocity:

Budget remaining: The team can deploy features, take risks, run experiments. The SLO says they can afford some failures.
Budget exhausted: No new features until reliability improves. Engineering effort shifts to reliability work. This is a policy, not a punishment — it’s enforced by the SLO, not by management.

Error budgets resolve the classic tension between dev (wants to ship fast) and ops (wants stability). Both teams agree to the SLO upfront; the error budget makes the trade-off explicit and data-driven.

Availability Math

Availability is typically expressed in “nines.” The number of nines determines how much downtime is tolerated per year:

Availability	Downtime per year	Downtime per month	Downtime per week
99% (two nines)	87.6 hours	7.3 hours	1.68 hours
99.9% (three nines)	8.76 hours	43.8 minutes	10.1 minutes
99.95%	4.38 hours	21.9 minutes	5 minutes
99.99% (four nines)	52.6 minutes	4.38 minutes	1 minute
99.999% (five nines)	5.26 minutes	26.3 seconds	6 seconds

Availability of dependent systems multiplies: If Service A depends on Services B and C, A’s availability is at most A × B × C. Two 99.9% dependencies give a combined availability of 99.9% × 99.9% = 99.8%. Systems with many dependencies need each component to be significantly more reliable than the end-to-end target.

Design Considerations

Start with fewer, better SLOs. Three well-defined SLOs that the team actually looks at are worth more than twenty SLOs that generate alert fatigue. Start with availability and latency for your most critical user-facing endpoints.
Review SLOs quarterly. As your system evolves and user expectations change, SLO targets need to be revisited. A 99.5% SLO that made sense when you had 100 users may be inadequate with 1 million users paying for a premium product.
Exclude planned maintenance. Planned maintenance windows (database migrations, major deployments) should be excluded from SLI calculations. Include only unplanned downtime. Document the exclusion policy clearly to avoid disputes.
Use SLOs to drive alerting thresholds. Don’t alert on symptoms (CPU at 90%); alert on SLI degradation (error rate above 0.5%). Set alert thresholds at a fraction of the error budget — for example, alert when you’ve consumed 10% of your monthly error budget in a 1-hour window.
Make error budgets visible to leadership. Error budget burn rate dashboards make the reliability/velocity trade-off tangible for non-engineers. When leadership understands that “shipping this feature now costs us 20% of our error budget,” reliability discussions become data-driven rather than opinion-driven.