Site Reliability Engineering (SRE) brings software engineering mindset to operations work—using automation, measurement, and systematic thinking to build and operate reliable systems. Pioneered at Google and now widely adopted, SRE provides principles and practices for managing systems at scale.
This guide provides a framework for SRE practices, addressing error budgets, SLOs, incident management, and organizational considerations.
Understanding SRE
What SRE Is
SRE is discipline focused on reliability:
Engineering approach: Applying software engineering to operations.
Reliability focus: Explicit focus on system reliability as primary objective.
Automation mindset: Automating away operational toil.
Measurement culture: Quantifying reliability and making data-driven decisions.
Shared responsibility: Developers and SREs sharing production responsibility.
SRE vs. Traditional Operations
Traditional Ops:
- Separate from development
- Manual processes common
- Reactive orientation
- Capacity-based staffing
SRE:
- Integrated with development
- Automation expected
- Proactive and preventive
- Error budget-based staffing
SRE vs. DevOps
DevOps: Cultural movement emphasizing collaboration between development and operations.
SRE: Specific implementation of DevOps principles with defined practices.
Google characterizes SRE as "what happens when you ask a software engineer to design an operations function."
SRE Framework
Pillar 1: Service Level Objectives
Defining and measuring reliability:
Service Level Indicators (SLIs):
- Quantitative measure of service behavior
- Examples: latency, error rate, availability
- Measured from user perspective
Service Level Objectives (SLOs):
- Target value for SLIs
- Example: "99.9% of requests complete within 200ms"
- Reflects user expectations
Service Level Agreements (SLAs):
- Contractual commitments (often less aggressive than SLOs)
- Business consequences for breach
SLO design principles:
- User-centric measurement
- Realistic targets
- Actionable when breached
- Regularly reviewed
Pillar 2: Error Budgets
Balancing reliability and velocity:
Error budget concept:
- If SLO is 99.9%, error budget is 0.1%
- Budget available for failures, releases, experimentation
- Connects reliability to release velocity
Error budget policy:
- When budget remains: release freely
- When budget depleted: focus on reliability improvements
- Creates alignment between development and reliability goals
Benefits:
- Objective reliability decisions
- Balances speed and stability
- Creates shared incentive
Pillar 3: Toil Reduction
Automating operational work:
Toil definition:
- Manual, repetitive work
- No enduring value
- Tends to grow with service
Toil management:
- Measure and track toil
- Prioritize automation
- Engineering time for toil reduction
Target: Keep toil below 50% of SRE time.
Pillar 4: Incident Management
Responding to and learning from failures:
Incident response:
- Clear roles (incident commander, communications, operations)
- Defined process
- Severity levels
- Escalation paths
Blameless postmortems:
- Learning from incidents
- Focus on systems, not individuals
- Action items to prevent recurrence
- Shared widely
Incident metrics:
- Mean time to detect (MTTD)
- Mean time to respond (MTTR)
- Incident frequency and severity
Pillar 5: Production Excellence
Operating systems well:
Monitoring and alerting:
- Comprehensive observability
- Actionable alerts (avoid alert fatigue)
- SLO-based alerting
Capacity planning:
- Demand forecasting
- Capacity provisioning
- Headroom management
Change management:
- Safe deployment practices
- Canary deployments
- Feature flags
- Rollback capability
Implementation Approach
Getting Started
Adopting SRE practices:
Start with SLOs: Define what reliability means for your services.
Measure what matters: Implement SLI measurement.
Adopt incrementally: Don't try everything at once.
Start with critical services: Focus SRE practices on high-value services.
Organizational Considerations
SRE team structure:
Embedded model: SREs embedded in product teams.
Centralized model: Dedicated SRE team serving multiple products.
Hybrid model: Core SRE team with embedded members.
Staffing: Typically 1 SRE per 5-10 developers (varies widely).
Cultural Considerations
SRE success requires culture:
Blamelessness: Learning from failure without punishment.
Collaboration: Development and SRE working together.
Engineering mindset: Operations as engineering problem.
Continuous improvement: Always improving reliability.
Key Takeaways
-
Reliability is measurable: SLOs and error budgets make reliability concrete.
-
Error budgets balance speed and stability: Creates shared incentive around reliability.
-
Automate toil: Engineering approach to operational work.
-
Learn from incidents: Blameless postmortems drive improvement.
-
SRE is cultural: Practices require cultural support to succeed.
Frequently Asked Questions
How do we set SLOs? Start with user expectations. Measure current performance. Set achievable targets. Iterate based on experience.
What if we don't have error budget left? Pause non-essential releases. Focus on reliability improvements. Investigate root causes. Rebuild budget before continuing.
How big should our SRE team be? Varies. Common ratios: 1 SRE per 5-10 developers. More critical/complex services warrant higher ratio.
Do we need dedicated SREs or can developers do SRE? Either can work. Dedicated SREs provide focus; developer SRE ("you build it, you run it") provides ownership. Many organizations use hybrid.
What about on-call? SRE typically shares on-call with development. Sustainable on-call rotation. Compensation for on-call burden. Alert quality matters.
Where should we start with SRE? Start with SLOs for critical services. Implement measurement. Introduce error budgets. Build from there.