December 21, 2019

Day 21 - Being kind to 3am you

By: Katie McLaughlin (@glasnt)
Edited by: Cody Wilbourn (@codywilbourn)

I like sleep.

Sleep is just the best. Waking up when my body is done, my batteries are 100% charged, and I can get going with the best possible start to my day. I can get up, do a bit of exercise, get my coffee, and start my day. I can operate to the best of my ability. I can be productive and happy and get things done.

But when I'm tired. Oh, when I'm tired...

I don't operate well when I'm tired. Being unable to focus or see properly or think straight really inhibits my ability to be productive. My brain doesn't work right, and I just can't even.

So when I get paged, especially at night, I'm tired. I'm not operating at my best. I'm not going to be 100% there. I'm not going to be as quick thinking.

But if I'm paged, stuff be broken, yo; so I have to get in and fix it.

So what I need to do is to set myself up to be the best I can be when -- not if, when -- the pager goes off at 3am.

By "being paged", I mean that my phone has decided to make loud noises in the middle of the night to tell me that something is wrong, because I have previously setup a monitoring system for my servers that tell me if they aren't responding to ping and are immediately offline. Or if they are at high disk space and are at risk of becoming offline. Or if my inbound traffic is triggering autoscaling that isn't able to handle the load.

Alerts that immediately require human intervention.

Even at 3am.

Oh, you have an on call rotation that's not "me, myself, and I"? Or you "follow the sun"? Good for you. Keep doing that. Having the people who are already awake being paged? That's great. A lot of places don't have that luxury.

And it might not be an actual "3am pager". It could be the apocryphal 3am pager. Something is going to come by which means you won't be at your best -- when you're sick, tired, or just not with it. These tips and tricks can help you when you're not 100%, and you can use the time you are 100% to provide feedback into this system.

These are all based on my personal experiences of being the gal with the pager for years, across many environments, many roles, and many companies. From co-location, to web hosting, to machine learning pipelines, to platforms as a service. I have just a bit of experience at being awake at 3am in front of a laptop because an evil little app on my mobile has woken me up.

The Essentials.

Basically? Documentation. You should really write something down.

Documentation scales. Documentation is there when none of your coworkers are. Documentation is there when your senior database administrator is on leave. Documentation is there after your contractors engagements have ended.

I mean, it's useful to have documentation, sure. But where to have that documentation is also important to consider.

You could go whole hog and do a full knowledge base and documentation management system, but that all requires buy-in and resources. Sometimes, you can't get that.

And what you really want is a good night's sleep. You just want to throw some notes down somewhere.

Your tool you use to write things down could be anything: a Google Doc, OneNote, Emacs, Vim, VSCode... But consider where these docs live. Making sure that everyone on your on-call rotation can access your Google Drive or OneDrive. Or consider putting the docs closer to your working space: the wiki on your GitHub pages, or say under your username on your Confluence page. Or even a text file in a private project repo, in a pinch.

But, where you put these notes needs to have some important features.

Editable. It needs to be editable. Yes, sure, but editable means something important here. Wherever you're storing this stuff, it needs to be able to be edited easily. You need to be able to readily add new information, and remove out of date information. This might seem obvious, but it's such an important feature. If not, you could be stuck with, say, a "documentation repo" that needs to have content approved before it can be merged, and that is a huge blocker.

Searchable. Paper doesn't work here. You need to have something you can Control-F at 3am, wherever you are in the world. This is especially useful for those road warriors, the sysadmins who travel while on call. What I've also found super useful is creating a custom search engine. Having a keyword I can throw into a new tab of by browser to search my notes wiki is so helpful. I personally have custom search keywords for a number of services like Wikipedia or Twitter, as well as a keyword for my company wiki, and another for the code store. If I come across a problem that’s not documented in either the wiki or code, then it’s probably something third party and I have to search the public internet, or worse: knowledge is trapped inside someone's head. At 3am, information in someone one else's head is useless.

Discoverable. This is where having a wiki excels. A web-based system means that your coworkers can see it too. They can also use your custom search engines to find your notes, and perhaps collaborate and help improve them!

Access control. Consider that you probably don't want your internal Terraform docs in your public GitHub repo. Your sysadmins who have the credentials to be able to provision should probably be able to see those docs, especially the docs of where to find that magic ssh key that enables the ability to deploy from Ansible. This is going to be highly context specific, but it's probably a good idea to have this within your company's authentication barrier (or "firewall", if you still have one of those).

But the question is, what do you write down? What do you want to be able to discover at 3am?

Well, this is really going to depend on your environment.

Are you working in a Docker shop? Kubernetes? Lots of networking? What is super useful for 3am is sharpened tools. Commands with all those strange flags, more esoteric actions, or inspection scripts. Things that aren't aliased (though having these in your docs in case you forget what you have around is super useful). Having a copy-pasteable command that does something like: show me all the load balancers that have high memory usage, display the Docker containers with high CPU, show me the pod balance across the region.

Ensure these stored tools are easily usable. There's nothing worse than having those leading dollar signs or non-obvious environment variables in your stored commands that means you have to think about editing them before they can be used.

Make sure these commands are safe. Don't put any destructive commands in these caches unless they're clearly marked as such and have big red warning signs.

Especially avoid destructive chains of commands that start with basic search commands. For instance, show me all the docker containers... then delete them all.

You want to make sure that 3am you isn't blindly using a tool that's going to make things worse.


Stepping up.

So now that you have the initial basics, you need to think about their evolution. Stepping up this repository of useful hints, how can you make it work for you?

Again, your mileage may vary, but I can offer advice for what I've seen work. I'm a sysadmin, I'm not your sysadmin.

Integration. Integration is a big step. And so, so useful. Having your personal cache end up being moved into an "SRE Tips" page that appears on the home page for your on-call rotation information. Having it linked up in the channel topic of your firehouse chat channel. Making it readily available as well as useful.

Templates. Templates are great. When you have large repetitive tasks that also need custom care and attention to detail (be it new physical server deployment, or new client on-boarding), turn it into a template that you can copy each time. Even something as simple as making sure you link to the AWS EC2 search for the name of the server, and then any custom notes. Especially if one of these widgets has custom configurations outside of your provisioning automation that could be overwritten if you aren't careful (ask me how I know!).

Contextual Integration. Another big bonus is not just linking to the cache, but having it contextual. One fleet I maintained had a lot of different machines across different operating systems and virtualisation types. Physical machines, VMware, KVM, Xen; Linux, Windows, different versions of those in between. And depending on the service or the server that was having the issue there would be a link on the Nagios alert to the documentation for that particular service or server. This meant having a swap alert on a Linux box would immediately show the sysadmin on call a link to the basic debugging for that service. If there was a listing for the service specifically for that server, that would be shown instead. This was incredibly useful for those pesky machines that were notorious for having memory leaks or other bugbear issues.

Post mortems. When there is an issue, document some of the debugging steps that were used into the useful tips doc. This could be something as simple as saving a copy of a sanitized bash history somewhere, but is so very very useful when a senior SRE can show a junior which of those sharpened tools they used.

Which, in itself, brings me to the third major point.


The Feedback Loop.

These caches don't just appear overnight. They evolve over time as they are used -- and are useful -- for those on call. Having a cache of the flags on an esoteric CLI is one thing, but having a well oiled bag of tricks is another.

That goes double for recurring issues.

Now, this is different from one-off things, and I want to focus on this for a moment.

In an ideal world, no issue should happen more than once, because, hey, all problems are immediately fixed by the on-call engineer and will never happen again, right?

For anyone who has ever worked in operations for any period of time (or dev, for that matter), you know there's always a compromise between workarounds and root cause fixes. That server that keeps alerting due to critical disk space usage? Is it easier to occasionally clean up the old web server logs, or setup a scheduled task that archives logs older than a week?

The engineering time to create such a script that's appropriate for the environment is non-trivial if it doesn't already exist, especially when considerations like data retention or GDPR come involved that affect that implementation.

So, sometimes it's easier to, say, change monitoring to soft-alert at a 10% diskspace free level so an engineer can cleanup things during the day, as opposed to waking up the engineer at 3am with a critical 5% free alert that would result in the same action.

Tuning alerts and actions for recurring events is absolutely valid, even in cases where "Yes, we'll fix that Soon™️".

You can even start applying automation to these manual functions. Something simple like adding a for-loop to the start of a command to apply it to many servers. Or making that for-loop smarter by turning it into an Ansible playbook that can check for properties on the server before applying commands. Taking the commands in your bag of tricks and turning them into cron jobs, or somesuch.

Again, it's going to depend on your environment, both machine and people.

But in all of this, the biggest thing that I can suggest: turn this into a learning opportunity for the members of your team. This feeds back into the discoverability and feedback loop steps, but also makes sure that juniors or other team members "learn the weather".

If you have a junior that's starting to shadow in your on-call rotation, show them the iffy machines; give them a chance to debug things themselves, but work with them to solve issues in a timely manner. Make sure that any of those "We'll fix that soon" are noted, or even better: the alerts are modified, for your junior's sake. This is so, so important so that people know what to expect.

And when you finally get to fixing things, please make sure you communicate this. Having one sysadmin, or heaven forbid, a BOFH, being the only one that knows the temperament of your system doesn't scale. There's a certain joy when your entire on-call rotation are able to be pseudo-meteorologists and can just sorta *tell* what might be going on.

While it might be great that things finally get fixed, that those alerts go away, I've been here long enough to know it's not always that simple. Infrastructure changes almost always cause other issues down the line. Make sure you communicate these with your team, and in your documentation cache.

Bonus points: if you're going to be personally making big changes that might have the slightest chance of raising alerts, take the pager. Please. Especially if it's using your normal working hours and you're not already on call.

Remember that you should be reviewing this data and deprecating when required. Having a cache of information that is out of date isn't just annoying, it can be catastrophic. Your on-call engineer that finds the old fix-everything command that used to be the silver bullet that is now the WORST THING TO RUN... they should not be finding that in their search results. Deprecating content could be as simple as moving into a cache that's *not* searchable in your main search keyword, but still keeping it around in a secondary system.

Or, once considered in the light of day, deleting it entirely.

Having a lack of information at 3am is bad. Having actively harmful information is worse.


Empathy.

Because at the end of the day, empathy is critical.

Showing empathy for your fellow engineer who is going to be thankful for that full night sleep is paid back in kind.

Showing self-care by giving yourself the tools to help you get your job done so you can go back to counting sheep.

Making sure your junior or new on-call engineers don't freak out in the middle of the night because you left them a note about that upgrade, so those new errors they're seeing are totally okay (well, not okay, but not unexpected).

Thinking more about how less pages makes everyone sleep easier, and what can be done to achieve that.

Working in a team is hard, but as soon as you start expecting work out of hours, especially when on-call is involved, practicing explicit empathy makes things so much easier for everyone involved.

Get started now.

You're not your best when you're tired, but you'll do your future self a favour by starting your bag of tricks today. A sanitised bash history, an odd command here and there, just start somewhere. Evolve it, and it'll help you on those early morning calls so you can get back to sleep.

No comments :