December 6, 2016

Day 6 - No More On-Call Martyrs

Written By: Alice Goldfuss (@alicegoldfuss)
Edited By: Justin Garrison (@rothgar)

Ops and on-call go together like peanut butter and jelly. It folds into the batter of our personalities and gives it that signature crunch. It’s the gallows from which our humor hangs.

Taking up the pager is an ops rite-of-passage, a sign that you are needed and competent. Being on-call is important because it entrusts you with critical systems. Being on-call is the best way to ensure your infrastructure maintains its integrity and stability.

Except that’s bullshit.

The best way to ensure the stability and safety of your systems is to make them self-healing. Machines are best cared for by other machines, and humans are only human. Why waste time with a late night page and the fumblings of a sleep-deprived person when a failure could be corrected automatically? Why make a human push a button when a machine could do it instead?

If a company was truly invested in the integrity of its systems, it would build simple, scalable ones that could be shepherded by such automation. Simple systems are key, because they reduce the possible failure vectors you need to automate against. You can’t slap self-healing scripts onto a spaghetti architecture and expect them to work. The more complex your systems become, the more you need a human to look at the bigger picture and make decisions. Hooking up restart and failover scripts might save yourself some sleepless nights, but it wouldn’t guard against them entirely.

That being said, I’m not aware of any company with such an ideal architecture. So, if not self-healing systems, why not shorter on-call rotations? Or more people on-call at once? After all, going 17 hours without sleep can be equivalent to a blood alcohol concentration of 0.05%, and interrupted sleep causes a marked decline in positive mood. Why trust a single impaired person with the integrity of your system? And why make them responsible for it a week at a time?

Because system integrity is only important when it impacts the bottom line. If a single engineer works herself half-to-death but keeps the lights on, everything is fine.

And from this void, a decades-old culture has arisen.

There is a cult of masochism around on-call, a pride in the pain and of conquering the rotating gauntlet. These martyrs are mostly found in ops teams, who spend sleepless nights patching deploys and rebuilding arrays. It’s expected and almost heralded. Every on-call sysadmin has war stories to share. Calling them war stories is part of the pride.

This is the language of the disenfranchised. This is the reaction of the unappreciated.

On-call is glorified when it’s all you’re allowed to have. And, historically, ops folk are allowed to have very little. Developers are empowered to create and build, while ops engineers are only allowed to maintain and patch. Developers are expected to be smart; ops engineers are expected to be strong.

No wonder so many ops organizations identify with military institutions and use phrases such as “firefighting” to describe their daily grind. No wonder they craft coats of arms and patches and nod to each other with tales of horrendous outages. We redefine what it means to be a hero and we revel in our brave deeds.

But, at what cost? Not only do we miss out on life events and much-needed sleep, but we also miss out on career progression. Classic sysadmin tasks are swiftly being automated away, and if you’re only allowed to fix what’s broken, you’ll never get out of that hole. Furthermore, you’ll burn out by bashing yourself against rocks that will never move. No job is worth that.

There is only one real benefit to being on-call: you learn a lot about your systems by watching them break. But if you’re only learning, never building, you and your systems will stagnate.

Consider the following:

  1. When you get paged, is it a new problem? Do you learn something, or is it the same issue with the same solution you’ve seen half a dozen times?
  2. When you tell coworkers you were paged last night, how do they react? Do they laugh and share stories, or are they concerned?
  3. When you tell your manager your on-call shift has been rough, do they try to rotate someone else in? Do they make you feel guilty?
  4. Is your manager on-call too? Do they cover shifts over holidays or offer to take an override? Do they understand your burden?

It’s possible you’re working in a toxic on-call culture, one that you glorify because it’s what you know. But it doesn’t have to be this way. Gilded self-healing systems aside, there are healthier ways to approach on-call rotations:

  1. Improve your monitoring and alerting. Only get paged for actionable things in an intelligent way. The Art of Monitoring is a good place to start.
  2. Have rules in place regarding alert fatigue. The Google SRE book considers more than two pages per 12 hour shift too many.
  3. Make sure you’re compensated for on-call work, either financially or with time-off, and make sure that’s publicly supported by management.
  4. Put your developers on-call. You’ll be surprised what stops breaking.

For those of you who read these steps and think, “that’s impossible,” I have one piece of advice: get another job. You are not appreciated where you are and you can do much better.

On-call may be a necessary evil, but it shouldn’t be your whole life. In the age of cloud platforms and infrastructure as code, your worth should be much more than editing Puppet manifests. Ops engineers are intelligent, scrappy, and capable of building great things. You should be allowed to take pride in making yourself and your systems better, and not just stomping out yet another fire.

Work on introducing healthier on-call processes in your company, so you can focus on developing your career and enjoying your life.

In the meantime, there is support for weathering rough rotations. I started a hashtag called #oncallselfie to share the ridiculous circumstances I’m in when paged. There’s also The On-Call Handbook as a primer for on-call shifts and a way to share what you’ve learned along the way. And if you’re burned out, I suggest this article as a good first step toward getting back on your feet.

You’re not alone and you don’t have to be a martyr. Be a real hero and let the pager rest.


Tyler Neely said...
This comment has been removed by the author.
Tyler Neely said...
This comment has been removed by the author.
Mark McCullough said...

Giving your manager an account on the servers should be a disciplinable offense. Their job is not to do your job, but to empower you to do yours, obtaining budget, additional project time, training, resources, etc. So of course your manager should not be on-call in the rotation. They get paged out when the on-call fails to call in, or when that outage turns into a management issue and they are needed to handle the reporting to upper management so you can resolve the technical problem.

Brian Bilbrey said...

That all rings true, Alice. The good news at my firm is that we're pretty proactive about dealing with the root causes of pages, and trying to ensure that we automate away whatever we can. That's true for both for technical incidents, and automation/tooling as many of our other tasks as possible. That last is important for us because the automation removes many sources of mistakes, which often seemed to manifest in pages in the wee hours.

Thanks for taking the time to write the sixth day of sysadvent!


Chris Short said...

Thank you for writing this. I have worked at places where the whole development team was too good to be on-call. They've all been outsourced along with the Ops team that worked with them because of their demands.

Matt Siegel said...


First, thank you for an excellent article. Your ongoing tweets and this unvarnished assessment provide a clear view of the reality of ops on-call, as currently practiced.

I'll admit to having very limited knowledge of ops, but great interest since it's vitally important to any service.

To continue discussion about possible improvements, may I ask your advice about an unconventional direction that might fulfill some of your (I believe reasonable) requirements:

1. No distinction between engineers and operations personnel; each person *is* an engineer.
2. During a shift, an engineer is programming or doing other ordinary non-emergency work, unless paged.
3. Shifts are no longer than 9 hours.
4. Shifts are scheduled for an engineer's normal work hours.
5. Engineers are *physically* located in different, appropriate timezones to enable 24/7 responsiveness while maintaining normal work hours.

I realize this is unusual, and I'd love to hear any problems you see; or obviously, ideas for better overall strategy.

Indebted for your outstanding efforts to document and analyze,

Unknown said...

My manager is on a primary on-call rotation and it's fantastic. It ensures our runbooks are solid, helps us escalate important issues (especially outside our own group's control because our manager has connections to other managers), and forces an understanding of exactly what we're working on. That said, my manager has a technical background. Don't knock it if you haven't tried it, you may find it has hidden benefits.