By: Matt Stratton (@mattstratton)
Edited By: Rhommel Lamas (@rhoml)
On-call is a fact of life for many of us. If you’re responsible for production systems, carrying a pager (although most likely a virtual one, these days) is part of what we do. Being available to help with production issues might be unavoidable, but that doesn’t mean we can’t find ways to make it more enjoyable. Here are 15 different tips and tricks (some more helpful than others) that can take the sting out of your on-call rotation.
15. Have a fun alert sound
Don’t stick with the boring “alarm” sounds that your phone or on-call application provides. Mix it up! Perhaps “I Will Survive” or “Boulevard of Broken Dreams”. (All kidding aside, it’s really good for your self-care to rotate your on-call notification tone regularly. This prevents a negative physiological response to hearing the same tone every time you get paged)
14. Save up your binge watching
Want a reason to look forward to your next on-call shift? Save up all those episodes of The Mandalorian to watch them during your rotation, and you’ll be happy to go on-call. Just don’t let Baby Yoda distract you from troubleshooting that troublesome kubernetes pod.
13. Replace your pager with a confetti cannon
This great idea comes from @table_delete. If every time you get alerted, your room is covered with confetti, how much more fun could you be having? You really can’t have more fun than a confetti cannon. This is proven by science.
12. Practice makes permanent
It’s quite possible to go a long time between incidents, depending upon your application and the size of your on-call rotation. You don’t want to be wasting time during an incident by trying to remember how to log into PagerDuty, or what ssh key to use to log into production. This is why practicing incident response is important - game days are great! They can be associated with your team’s chaos engineering practices or just regular drills. The key is to practice doing incident response in a calm and safe place.
11. Pipe your alerts through ponysay
Nobody likes getting an alert, but if it can come from an adorable pony, that makes it a lot more bearable:
10. Add more single points of failure
This tip comes courtesy of @TheJewberwocky on Twitter. There’s nothing more disappointing that a simple on-call shift. Add some more challenges by designing your infrastructure to have multiple failure possibilities, just to spice things up. It will keep you on your toes and add some zest to the experience.
9. Dedicate an on-call hoodie
This might sound silly, but it can be fun to pick out a hoodie that you only wear during your on-call shift. This helps put you in the right frame of mind when you put it on, but the best part is that when your rotation is over, you get to take it off. Seriously, give it a try.
8. Treat incidents as a gift
In many ways, incidents really are a gift - they are a way for our systems to tell us something we didn’t already know. You can take this to the next level by connecting your incident response system up to your Amazon wishlist, connected to your boss’s corporate credit card - for every resolved incident, something on your wishlist is shipped to you!
(Note - You are responsible for determining the impacts of credit card fraud in your local jurisdiction. I am not a lawyer.)
7. Add the element of surprise
On-call overrides are a fact of life. Make them more fun by writing a simple script to re-assign the override in PagerDuty to a random number generator, aligned to an entry with every employee on your team. It’s chaos engineering for on-call!
6. Schedule a fun dinner
Do you have a favorite restaurant that you never get to go to? Make a reservation during your on-call shift! Yes, you run the risk of your awesome meal being interrupted due to MySQL errors, but it gives you a reason to enjoy your rotation. Just make sure the restaurant has good wifi and to-go boxes.
5. Stop caring
This might be career-limiting advice, but according to @101010Lund on Twitter, if you decide to not care about doing a good job, or stop worrying about the well-being of your systems, you might have more “fun” during on-call.
You might not have so much fun in your next 1:1 meeting with your boss, but that’s a different article.
4. Have a good incident response process
And by this, I don’t mean a process that is “Step 1. Page someone. Step 2. Fix the broken thing”. That’s not a process.
Everyone’s process is different, and that’s okay, because everyone’s organization is different. But having a clear process for incident response makes it a LOT less stressful when you actually do get paged. I’m a big fan of PagerDuty’s Incident Response Guide (disclaimer: I am a PagerDuty employee, but that guide is product-agnostic). Whatever process you follow, make sure you practice it, and keep everyone up to date on any changes.
3. Assign someone to run interference on executives
There’s nothing more distracting or annoying during an incident than when your CIO is constantly jumping onto the call to ask for status updates or to offer “helpful” motivation like “I want this fixed in five minutes!” One tip is to have a buddy whose job is to distract executives while everyone else is working the incident. This takes skill, and probably some knowledge of your executives’ hobbies. But having someone chatting with your boss about their most recent pinball tournament win, or getting tips on how to find the best dog boarding kennels, can really help the others on the team focus on resolving the incident.
(Note: your mileage with distracting executives may vary, but keeping stakeholders up to date outside of the response call is actually a helpful technique!)
2. Take an override after a tough incident
This really needs to be a team policy/approach, but it’s a fine idea to have the pattern that if someone has to deal with an incident, someone else on the team will take over on-call as an override for the rest of that person’s shift. Granted, this requires having an on-call rotation of more than one person. And you may want to define within your team what thresholds of incidents makes this happen (for example, incidents lasting more than 30 minutes, etc).
1. Have a venting solution
Finally, it’s important to have a way to blow off steam while dealing with a stressful incident. Maybe your company has a #YELLING channel in Slack (if you don’t, you should!). You also can feel free to make judicious use of social media - it’s a little known fact, but Twitter was invented to provide a channel for grumpy sysadmins to complain during incidents. Just be cautious about providing specifics - I recommend replacing your company name with “The Bluth Company” and the names of your systems with Firefly characters. An example tweet might be: “The Bluth Company is having a major issue because Malcolm Reynolds has exhausted all resources on the River Tam cluster”.
People will just think you’re writing some terrible fanfiction. It’ll work out great.
Whatever techniques you use to help your mental state during on-call, I hope that your outages are few, your incidents are manageable, and that your postmortems are blameless. Happy holidays!
No comments:
Post a Comment