Why Site Reliability Engineering (SRE)?
SAAS/PAAS based companies, that are in the growth phase, would need to (a) increase velocity and (b) consistently improve reliability, but still (c) stay lean. The above objectives need SRE practice, if not already in place. Below is my recap and if you need to deep-dive into SRE, then refer to books and blogs focussed on SRE practices.
What’s the Role of SRE?
To facilitate -> (a) Rapid development (b) Regression-free release cycles (c) Solving complex problems through automation and (d) Sharing production knowledge across engineering
‘SRE is all about minimizing the risk.’
Where Should the Focus Be for Implementing SRE?
How you drive SRE changes vastly depends upon your engineering team structure, culture, and maturity of the SDLC process. At a high level, the following areas would be a good framework to get started with SRE.
1. Collaboration and Communication
SRE is an engineering culture and is not specific to any vertical function. This will only succeed if all engineering leaders join together and form a shared vision.
(a) Set shared goals and objectives across engineering.
(b) On-board leaders to embrace and execute.
(c) Embrace setting ‘Service Level Objective’ (SLO) culture for various service offerings. How far do we want to go in terms of Reliability and Stability (it costs)?
(d) Well defined product/service ownership, which includes setting SLO’s as part of production readiness requirements.
- Communication: Consistently communicate to all members, across engineering teams. Each leader should take ownership to drive this in whatever style/manner suits them.
(a) Why are we doing it with pros and cons? (Do mention biz drivers.)
(b) What’s expected from each team/member and how does the new structure look like in the future?
(c) How is leadership helping in ramping up knowledge gaps or to mitigate challenges/risks ahead?
(d) High-level milestones with timeline roadmap — how do we get there?
‘Target objectives that are actually achievable.’
2. People and Team Structure
(Mix the talent, but optimize the best.)
- Culture: Seek help, collaborate, teach and be accountable
- Hire the right talent (talent for the future as well as the breadth of experience).
- Structure the team to spread knowledge evenly.
- Hire to solve complexity and not to scale (scale through automation).
- The team should be excited about solving complex problems.
- >50% of SRE time should be spent on quality engineering work.
‘SRE culture should be a force multiplier.’
3. Tooling / Platform
(Start with a POC (start small) and build upon it.)
- Use the STAR approach to implement any changes. Do not introduce any change (however small it is) without a clear objective and expected result.
- Implementation success: Listen to users (engineers), adapt, and drive the adoption.
- Templates > (a) Dashboards with key performance metrics (Keep it Simple, Stupid), (b) Playbooks for best practices, (c) Common plugins/templates for monitoring and security.
- Automated tools/playbooks are version controlled and follow the self-service model.
‘Automation focus is towards solving complex problems.’
4. Release Engineering (RE)
Release Engineering should not be an afterthought, Instead, this is one of the primary pillars to be built to consistently test and validate the releases. This is the core function of SRE:
- Principles: (a) Self-Service, (b) High-Velocity, (c) Consistency, (d) Enforce ACL’s and Policies
- Quality and security should be shifted to the left of SDLC and should be part of release engineering automation.
- Every release should be validated through the lens of KPIs, and compared to set SLO’s.
- Load and capacity are to be tested and certified pre-release.
- Build a testing environment to detect zero MTTR which avoids production bugs.
- Production readiness certified to onboard/release any service.
‘Service reliability is all about knowing a user’s tolerance level.’
- Instrumentation of collecting time-series data on client and server-side
- Make it easy for developers to add various metrics to monitoring.
- Build a set of reusable SLI templates for each common metric; this also makes it simpler for everyone to understand what a specific SLI means.
‘If you cannot monitor something, than it has no business to be in production.’
6. Debrief/RCA (PostMortem)
- Fix once (tactical and strategic).
- Blame none and focus on the solution, process, and technology.
- Have a process to keep track of issues and actions taken to fix, through the proper release cycle.
- Conduct periodic architecture and production review meetings.
‘You cannot find the right solution, without experiencing a problem.’
Note: It’s fearlessness and being collaborative are key characteristics to forming an effective SRE team.