December 10, 2021

Day 10 - Assembling Your Year In Review

By: Paige Bernier (@alpacatron3000)
Edited by: Jennifer Davis (@sigje) and Scott Murphy (@ovsage)

Intro

There are a few moments in my career that I have been struck by a story told with data. When I set out as a Site Reliability Engineer into the big wide world I wanted to capture that data storytelling magic and have adapted a presentation I call the “Year in Review”.

My first company had a tradition of taking a moment to pause and review the year by the numbers. The showstopper was the chart showing the amount of data ingested year over year since the founding.

In a single glance that chart conveyed a story that would take hours to tell!

It communicated the incredible efforts the employees took to scale the system to handle ingesting, processing, publishing and storing an ever increasing mountain of data. It illustrated how far the company had come and we were confronted head on with the realization that “what got you here, won’t get you there”.

The biggest impact I have seen comes after the presentation. Discussions from Year in Reviews have sparked sweeping oncall management changes as well as minor, but important, changes in the way developers engage with the SRE team.

Before diving into implementation details, let’s look at why this type of data storytelling is such a powerful tool by examining the core purpose of SRE

The Mission of SRE

The mission of an SRE team is to improve system reliability by facilitating change.

System reliability is the sum of hundreds of decisions humans make when developing, deploying, and maintaining software systems; it is not an intrinsic property1 of the systems (Patrick O’Connor, 1998). SRE job descriptions tout phrases like “evangelize a DevOps culture” and “influence without authority” acknowledging our roles as change agents.

And as often heard, “change is hard”. As change agents, we are often faced with conflicting priorities, multiple stakeholders internal and external, and fear of the new and unknown.

However, just as often we hear “change is the only constant”. Whether it’s hardware improvements, operating system upgrades, security vulnerability announcements, software dependencies, or the software that we manage as a service, we are constantly monitoring and implementing change.

Combine these two axioms, for extra difficulty:

Ask any engineer who has been forced into a major operating system upgrade when the version of software they’re running requires the previous OS.

As an SRE I often want to make changes across the entire engineering organization such as developing oncall onboarding, ensuring that we are monitoring the customer’s experience, clarifying the lines of responsibility between developers and operators and more!

These types of changes that affect everyone is difficult to effectively implement until two things are true:

  • Is there a shared understanding of the current state?
  • Is there agreement that the current state needs to change?

This does not mean there needs to be consensus on what changes need to be made!

Is there a shared understanding of the current state?

The answer to this can be a resounding “Yes!” after your Year in Review presentation. Here’s why:

Humans learn best from stories, feelings, senses, and opinions commonly known as qualitative data. Focusing only on these exclusively you risk coming to broad conclusions without nuance or context.

Businesses claim to operate on data, facts and figures, or quantitative data. Focusing purely on the numbers you risk having too many details leading to irrelevant rabbit holes.

In fact, the two seemingly disparate viewpoints aren’t at odds at all. You can even validate findings by using the other category of data.

Feel: “Our monitoring sucks, none of the last 5 pages I got were actionable”

Fact: The primary oncall was paged 5 times out of business hours last week

Finding: Team X is getting paged frequently for non-actionable reasons

Hosting a “Year in Review” means weaving a story using the quantitative data about what occurred in your systems with the qualitative “anec-data” from a human perspective to build a foundation to introduce change.

Is there agreement that the current state needs to change?

This is a more complex endeavor - identifying and implementing change is the hard work of collaborating across teams, roles and competing incentives, motives, and needs. Think of “Year in Review” as a springboard for driving discussion and debate to align on “do we agree something needs to change?”

What does this look like in practice?

At a previous company I heard from engineers and managers alike that the oncall rotations were in need of a shake up. This was an excellent starting place where everyone agreed that there was a problem but was having trouble implementing the necessary changes.

With a goal in mind to identify what exactly the oncall issues were my team tailored a “Year in Review” focused mainly on oncall metrics such as alert noise, hours oncall per engineer, pages received per engineer. Slides illustrated the deluge of alert storms no human could possibly investigate in a given shift and were largely unactionable noise. The impact of not addressing this problem was clear, we were likely missing important signals in the noise and oncalls weren’t able to effectively prioritize their time.

After reviewing the data as a group, my team facilitated a brainstorm to address the barriers to changing the rotations:

  • How to handle ownership when multiple teams contribute code?
  • What are the “hot potato” services no one feels comfortable owning?
  • What services are unofficially owned by a single engineer that needs documentation?
  • What is the goal of a low urgency or warning alert?

Based on the main discussion and others in standups and sidebars, my team proposed new team-service ownership and rotations. Several weeks and few rounds of revisions later we merged the PR with our new Terraformed oncall rotations!

DIY “Year in Review”

So, how do you create a “Year in Review” for an SRE team? To start, I typically have a few things in mind about what I think happened and what the data will show. It is fascinating to see where your perception of the system and reality diverge. You can kick off your process by asking a couple of questions:

  • What story are you expecting the data to tell?
  • What changes do you think need to be made in the next year to improve reliability?
  1. Book a meeting with all parties (including engineers, managers, sre, qa, ops, product managers). If there is an existing meeting like an All-Hands or Demo Hour sign up for a presentation slot
  2. Kick off a brainstorming session and have participants list out possible changes to include. Such as new features launched or infrastructure expansions to new regions, or even doubling the organization size.
  3. Ask teams (including managers)
    1. What data they would find interesting
    2. What data they could contribute from their domain
  4. List the company-specific tooling for data sources like:
    1. Version Control
    2. CI/CD
    3. Monitoring
    4. Incident Management
    5. Ticket tracking system
    6. Documentation store
    7. Support ticket system
  5. Enlist the help of others to gather the interesting metrics over the past year or year over year. Some suggestions are:
    1. Noisiest alerts
    2. Number of environments
    3. Oncall engineers
    4. Number of services
    5. Ratio of oncall engineer to number of services oncall for
    6. Age of dependencies/libraries
    7. # of hours oncall per person
    8. Number of features launched
    9. # of after hour pages
    10. Ratio of warning alerts to pages
    11. Number of production deploys rolled up by day
    12. Number of open incident AIs
    13. Ingress traffic or other indicator of system load
    14. Most viewed documentation pages
    15. Most search documentation terms
    16. Time to first PR
    17. ….and so much more!
  6. Slice and dice the data trying out top 10 lists, total sum, or segment by using whatever constructs your company has such as:
    1. Department
    2. Service
    3. Team
    4. Product Feature
  7. Group the data into themed areas “oncall” “production” “onboarding” etc. If you have convinced folks to co-present with you each person can be responsible for presenting a different theme
  8. Assemble into a slide deck with one chart per slide to maximize impact
  9. Hold the meeting and present your findings,
  10. Discuss! In the meeting, after the meeting before the next Year In Review how you interpreted the data compared to others
  11. Publish the data and your queries so everyone can explore and answer their own questions

Parting Thoughts

SREs are uniquely suited to facilitate a Year in Review bringing a system-wide perspective on the people, processes, and technology and mission to improve reliability. Keep in mind that much like effecting change, hosting a Year in Review is not a solo effort!

Going solo means you will only capture YOUR thoughts which will almost certainly be tempered by the unique vantage points from others. The more perspectives you invite, the fuller the story of your system will be.

Please share your favorite data storytelling moments or Year in Review stats with me on Twitter at @alpacatron3000

Citation

O’Connor, P. (1998) Standards in reliability and safety engineering [Article]. Elsevier Science Limited, 9 Dec. 2021.

https://www.sciencedirect.com/science/article/abs/pii/S095183209883010X

Notes


  1. Since the SRE field is still getting established outside of Google, I started to read perspectives from Reliability Engineering in other disciplines. A nugget from Patrick O’Connor’s “Standards in reliability and safety engineering” paper sparked a spicy but important revelation about reliability.

    “Those reliability standards which apply mathematical/ quantitative methods are also based on the inappropriate application of “scientific” thinking. An engineered system or a component has no intrinsic property of reliability, expressible for example as a failure rate. Truly scientifically based properties of systems and components include mass, power output, etc., and these can therefore be predicted and measured with credibility. However, whether a missile or a microcircuit fails depends upon the quality of the design, production, m~nten~ce and use applied to it. These are human contributions, not “scientific”. “ 

No comments:

Post a Comment