Ops and on-call go together like peanut butter and jelly. It folds into the batter of our personalities and gives it that signature crunch. It’s the gallows from which our humor hangs.
Taking up the pager is an ops rite-of-passage, a sign that you are needed and competent. Being on-call is important because it entrusts you with critical systems. Being on-call is the best way to ensure your infrastructure maintains its integrity and stability.
Except that’s bullshit.
The best way to ensure the stability and safety of your systems is to make them self-healing. Machines are best cared for by other machines, and humans are only human. Why waste time with a late night page and the fumblings of a sleep-deprived person when a failure could be corrected automatically? Why make a human push a button when a machine could do it instead?
If a company was truly invested in the integrity of its systems, it would build simple, scalable ones that could be shepherded by such automation. Simple systems are key, because they reduce the possible failure vectors you need to automate against. You can’t slap self-healing scripts onto a spaghetti architecture and expect them to work. The more complex your systems become, the more you need a human to look at the bigger picture and make decisions. Hooking up restart and failover scripts might save yourself some sleepless nights, but it wouldn’t guard against them entirely.
That being said, I’m not aware of any company with such an ideal architecture. So, if not self-healing systems, why not shorter on-call rotations? Or more people on-call at once? After all, going 17 hours without sleep can be equivalent to a blood alcohol concentration of 0.05%, and interrupted sleep causes a marked decline in positive mood. Why trust a single impaired person with the integrity of your system? And why make them responsible for it a week at a time?
Because system integrity is only important when it impacts the bottom line. If a single engineer works herself half-to-death but keeps the lights on, everything is fine.
And from this void, a decades-old culture has arisen.
There is a cult of masochism around on-call, a pride in the pain and of conquering the rotating gauntlet. These martyrs are mostly found in ops teams, who spend sleepless nights patching deploys and rebuilding arrays. It’s expected and almost heralded. Every on-call sysadmin has war stories to share. Calling them war stories is part of the pride.
This is the language of the disenfranchised. This is the reaction of the unappreciated.
On-call is glorified when it’s all you’re allowed to have. And, historically, ops folk are allowed to have very little. Developers are empowered to create and build, while ops engineers are only allowed to maintain and patch. Developers are expected to be smart; ops engineers are expected to be strong.
No wonder so many ops organizations identify with military institutions and use phrases such as “firefighting” to describe their daily grind. No wonder they craft coats of arms and patches and nod to each other with tales of horrendous outages. We redefine what it means to be a hero and we revel in our brave deeds.
But, at what cost? Not only do we miss out on life events and much-needed sleep, but we also miss out on career progression. Classic sysadmin tasks are swiftly being automated away, and if you’re only allowed to fix what’s broken, you’ll never get out of that hole. Furthermore, you’ll burn out by bashing yourself against rocks that will never move. No job is worth that.
There is only one real benefit to being on-call: you learn a lot about your systems by watching them break. But if you’re only learning, never building, you and your systems will stagnate.
Consider the following:
- When you get paged, is it a new problem? Do you learn something, or is it the same issue with the same solution you’ve seen half a dozen times?
- When you tell coworkers you were paged last night, how do they react? Do they laugh and share stories, or are they concerned?
- When you tell your manager your on-call shift has been rough, do they try to rotate someone else in? Do they make you feel guilty?
- Is your manager on-call too? Do they cover shifts over holidays or offer to take an override? Do they understand your burden?
It’s possible you’re working in a toxic on-call culture, one that you glorify because it’s what you know. But it doesn’t have to be this way. Gilded self-healing systems aside, there are healthier ways to approach on-call rotations:
- Improve your monitoring and alerting. Only get paged for actionable things in an intelligent way. The Art of Monitoring is a good place to start.
- Have rules in place regarding alert fatigue. The Google SRE book considers more than two pages per 12 hour shift too many.
- Make sure you’re compensated for on-call work, either financially or with time-off, and make sure that’s publicly supported by management.
- Put your developers on-call. You’ll be surprised what stops breaking.
For those of you who read these steps and think, “that’s impossible,” I have one piece of advice: get another job. You are not appreciated where you are and you can do much better.
On-call may be a necessary evil, but it shouldn’t be your whole life. In the age of cloud platforms and infrastructure as code, your worth should be much more than editing Puppet manifests. Ops engineers are intelligent, scrappy, and capable of building great things. You should be allowed to take pride in making yourself and your systems better, and not just stomping out yet another fire.
Work on introducing healthier on-call processes in your company, so you can focus on developing your career and enjoying your life.
In the meantime, there is support for weathering rough rotations. I started a hashtag called #oncallselfie to share the ridiculous circumstances I’m in when paged. There’s also The On-Call Handbook as a primer for on-call shifts and a way to share what you’ve learned along the way. And if you’re burned out, I suggest this article as a good first step toward getting back on your feet.
You’re not alone and you don’t have to be a martyr. Be a real hero and let the pager rest.