By: Martin Smith (@martinb3)
Edited by: Jennifer Davis (@sigje)
Abstract
SRE was born out of thinking about reliability as a product feature. However, all of the industry focus in the last few years on things like SLOs and Error Budgets and Production Engineering teams, and others, that constitute "doing SRE," sometimes means teams don’t take advantage of a product-centric approach to reliability these days. And they lose some of the advantages of doing so as a result. This post covers some project maturity levels, some suggestions for thinking about reliability as an SRE engaged in those kinds of projects, as well as what kinds of collaboration might be most successful in driving reliability-as-product-feature in each phase.
A brief history
Site Reliability Engineering, or SRE for short, was born in 2003 out of a need to improve service reliability at Google. Often described as, “an implementation of DevOps,” the practice of SRE aims to treat operations as a software problem that can be addressed through software engineering techniques.
And according to a survey by the DevOps Institute, SRE has truly taken off. This approach has been widely adopted, with 22% of organizations saying they have an SRE team in 2021. This shift can also be seen with the rise of conferences like USENIX’s SREcon which began in 2014, or the release of the popular, “Google SRE book,” a few years later in 2016.
Whether or not your organization has an SRE team that plans work using SLOs and Error Budgets, regularly reduces toil through automation, or has adopted one of the many SRE rules of thumb, the basic premise of what impact SRE can have sometimes gets lost -- that operations is a software problem. Or, shifting the focus back to the customer perspective, that reliability is a product feature that we build.
Having held DevOps Engineer and Site Reliability Engineer roles in the past, and having been a technical lead for SRE teams, I’ve had many opportunities to define the role, activities, and most importantly, the impact of an SRE team. In each case, I’ve found that focusing back on our customers’ experience of reliability has been the most useful framing when speaking to company leaders about SRE team’s, “why,” instead of reciting a long, confusing list of things SREs might do in a quarter. I’ve also found that it’s an easy litmus test for myself to ensure I’m working on the right things at the right time. If I can’t explain how my work affects customer reliability, keeping in mind that reliability for operators usually leads to reliability for customers, it might be a sign that I need to work on something else.
Shifting focus back to product reliability
Shifting the focus from operations and software engineering to talking about reliability as a product feature has some major benefits. First, it helps our organizations better understand what reliability might mean for them and their product(s) -- whether that’s resilience (tolerant of failure), scalability (can function with large volumes of work), observability (understanding internal state from outputs), or security (trust of the system). These are all product capabilities that often aren’t well understood, but fundamentally all matter to customers.
Reliability benefits from product management support (communication with stakeholders, building roadmaps, helping with prioritization and decisions, etc). For example, do you know who your internal stakeholders are for the scalability of your product? What’s on the roadmap for observability over the next 6 months? 2 years? And importantly, what metrics will you collect to be sure you’ve accomplished those goals and delivered on that roadmap? How does it align with other features’ roadmaps? As a friend and former colleague of mine says, “reliability is a product feature whether you devote engineering time to it or not.” If you don’t explicitly plan for that, your customers will implicitly make their own assumptions about your reliability.
Reliability may start to sound like any other product feature, with both internal and external stakeholders, and that’s by design. Making reliability an explicit part of your organizational planning also has many benefits. Thoughtworks’ Technology Radar (Volume 25) from October of this year recommends adoption of this kind of thinking -- that even internal teams should think of themselves as product teams. They also recommend using concepts from the popular Team Topologies book to figure out how to organize these internal teams. In reviewing examples of team structures from the book, many organizations have adopted Simon Wardley’s Pioneer-Settler-Town Planner (or “PST”) framework, too.
Let’s take a look at how one might apply these two ideas (reliability as a product feature, having a specific team profile) to improve the effectiveness of an SRE team.
- First, there’s no one-size-fits-all approach to improving reliability; different stages of a project will benefit from different kinds of SRE involvement. In this post, I’ll divide products/services into three levels of maturity: beginning, growing, and established.
- Then, I’ll describe what kinds of SRE work could be most effective at that maturity level, using the PST framework.
Here’s a graphic that explains the PST framework’s three kinds of roles/activities in more detail.
Team Profiles, from blog post Pioneers, Settlers and Town Planners by Simon Wardley
Beginning phase (with Pioneer SREs)
In new projects, there’s often uncertainty and unanswered questions. Small changes in direction could have large future benefits, but experimental work may be completely discarded, too. SREs can drive reliability at this stage by helping teams build prototypes, fail faster, and make agile decisions, all with reliability as a top of mind concern.
Have you ever had a project get close to production/release without thinking about reliability or operational burdens? “Pioneer SREs” can help. They should be part of the team that’s working to deliver a new product development, evaluate vendors, build out proofs of concept, or make major architectural changes. At this stage of a project, any work to “cover” reliability gaps should be identified or entire directions could be changed due to reliability concerns raised by the team.
Embedding in a team building the new product or feature is a great way for SREs to drive reliability early on in these kinds of projects. When teams only consult briefly on reliability or operational concerns, often the final output doesn’t adequately reflect customer or engineering expectations of reliability of the product or operability of the internals.
The success of Pioneer SREs can be measured by looking at how quickly new products or features show up on the roadmap, how quickly vendor implementations happen, or how quickly a project moves from, “exploration,” to, “concrete proposal.”
The largest risk in this phase is having your SRE team end up owners of the system’s reliability, since they helped design it. Hiding the overall reliability of your system from the other developers, behind an SRE team, will typically turn into a situation where the SRE team ends up being treated as an operational team for any product/service problems. Well-scoped embedding engagements can help avoid this problem by emphasizing that embedded SREs are a training resource for the rest of the team to learn, not coverage for the team once the embedding is over.
Growing phase (with Settler SREs)
In this phase, projects are often working to build production-quality infrastructure, launch to customers, or scale to the required audience. SREs can help actually build mature and scalable components from the initial prototypes. They could also level up the engineering organization on how to prepare for any new operational burdens by emphasizing best practices like automating away toil or choosing good SLOs.
Continuing to embed with teams is a great way for SREs to have a hand in the reliability of a nearly-launched product or feature, especially if SREs influence the team to build for observability, scalability, and security into the product. Consulting with teams on production readiness, especially for brand new teams or brand new services, is another way that SREs can ensure that everything reaching production will meet the original reliability requirements of the product, as well as operational best practices (e.g. automation instead of manual database migrations).
At this phase, SRE building and maintaining an idea of Production Readiness is especially important as a product or organization scales. This ensures a consistent approach to reliability across products or services, as well as creates a minimum bar for reliability that must be satisfied. SREs at this stage may even build automation into a pipeline to guarantee minimum scale or ensure resilience on specific failures.
The success of Settler SREs can be measured by looking at how many new services and features are safely being launched into production, as well as examining things like ease of observability (e.g. effective logging, metrics, or monitoring). Success in this phase is also about establishing patterns that make projects successful (e.g. proposal templates). Project retrospectives are a great way to find those patterns as well as improve SRE engagement with the project.
Established phase (with Town Planner SREs)
In this most mature phase, products or services are usually already generally available, and systemic issues like overall architecture or developer tooling are the most likely to impact reliability.
SREs can influence reliability here by identifying and working to resolve systemic reliability issues (e.g. repeated incidents, poor SLO choices, lack of on-call process, etc). Driving continuous improvement is a very common way that SREs influence reliability at this phase.
In addition, SREs can often identify ways to reduce operational burdens or eliminate large scale toil during this phase, whether through technical automation or architecture changes, or through helping teams build process, knowledge, skills, tools and techniques they need for large scale projects to be repeatedly successful and reliable.
This can be a phase where some SREs will feel there’s a stigma associated with doing less technical work, but the impact of this work cannot be overstated -- it’s where SRE can act as a true multiplier as more and more teams and products/services are launched. Examples include running an incident management program, SLA program, On-call Program, Disaster Recovery/Business Continuity planning, or even a Chaos Engineering program. A strategy to address this concern is to pair SREs with a technical program management function (TPM) so that SREs can focus most on the technical aspects of improvement while TPMs can help with the organizational changes needed to improve a process or execute a program.
Measuring the success of Town Planner SREs can be especially tricky. You might look for simple metric improvements like fewer incidents, reduced incident duration, reduced pages, improved SLO targets, or number of DR tests -- but isolating the SRE impact to these kinds of metrics can be difficult. Qualitative feedback from an SRE team’s internal customers is also frequently used to measure success at this stage. The most impactful SREs at this stage tend to cause paradigm shifts for the other development teams, and often even for their own SRE teammates.
Wrapping up
[PST is] how you take a highly effective company and push it [...] towards a continuously adaptive system. May 8th, 2020 @swardley
I hope that the grouping above is useful to readers for structuring work to drive reliability at various levels of product maturity. Reliability-as-a-product-feature isn’t a magic bullet to solve for an organization that doesn’t understand where it fits in the market or what kind of value it delivers, nor will it make a large difference with an unhealthy product management practice that might not know how to develop and drive delivery of a product and its features over time.
As mentioned earlier, there usually isn’t a, “one-size fits all,” approach to driving reliability. You may still need to establish some best practices for your organization such as “Limit toil to 50% of our work” or “Every product feature that goes live must have a reliability review.” Combined with these kinds of rules of thumb, the proposed divisions and strategies above should help focus your team(s) to make the biggest improvement to reliability for your products and services.
In researching this post, it was helpful to review how organizations “do SRE” at various organizations and companies. Continuous improvement was a clear shared trait among them. It’s also worth reviewing the huge amount of content out there about how SRE can effectively collaborate with other teams (e.g. embedding SREs); a poor relationship or failed collaboration with another team can jeopardize all of your efforts.
I invite and encourage you to write about and share your own experiences, both good and bad, focusing on reliability as a first class product feature at your organization. Special thanks to my own SRE team for the many discussions and ideation sessions on how we can best work to drive reliability. And special thanks to Jennifer Davis, Michael Lumsden, David Nolan, Jordan Rinke, and Kerim Satirli for feedback and editing on this post.
No comments:
Post a Comment