December 15, 2015

Day 15 - Fear and Loathing in Systems Administration

Written by H. “Waldo” Grunenwald (@gwaldo)
Edited by Shaun Mouton (@sdmouton)

“DevOps Doesn’t Work”

The number of times that I’ve heard this is amazing. The best thing about this phrase is that the people who say it are often completely right, even if for very wrong reasons.

Who Says This?

Well, let’s talk about the people who most commonly have this reaction: SysAdmins. I’m going to use the term “SysAdmins” as a shorthand for a broad group. The people in this group have widely varying titles, but it is most commonly “Systems”, “Network”, or “Operations” follwed by “Administrator”, “Engineer”, “Technician”, or “Analyst”.

In some companies, these folks have the best computers in the place. In others, they have to live with the worst. Their workspace probably isn’t very nice, and almost certainly has no natural light. If there is a pager rotation, they are almost certainly on it. If there isn’t a rotation, they’re basically on-call all of the time.

During the course of a normal day they might have to switch contexts between disaster planning, calculating power and HVAC needs for a new datacenter, scrambling to complete an outage-driven SAN migration, rushing to address urgent requests to help people with their email, to troubleshoot a printing problem, or suss out why someone can’t get to their electric company’s bill pay website. They may be the sole person with database expertise in the company, or they may work on a team of dozens.

The work is largely invisible except when something fails, in which case it’s highly visible and widely impacting.

Bug vs Incident

These are typically cynical people, because there are only so many times that you can’t make the team/department/company party for ostensibly “celebrating our successes” because something’s broken, and you’re left to clean up after the “success”. There are only so many times that one sees a new project announced and begins to hire more people. When asked who’s going to support the new project, the response is a blank look and “you are”. The “…of course.” may not be vocalized, but it’s probably there. When asked how many people they get to hire to help with the workload, the response is a combination of “sorry, but there wasn’t anything left in the budget”, “it won’t be that much more work”, or a variation of the “team player / good soldier” speech. There are only so many times one can take getting your requests for training or conference budget rejected out of hand, and have your requests for training or conference budget laughed out of the room.

They probably have basic working knowledge of a half-dozen programming languages, but most likely they often think in Shell. They probably know at least three ways of testing if a port is open, and probably have a soft spot in their heart for a couple of shell commands.

They may have seen or participated in a DevOps initiative that consisted of a team or position rename, or helpfully suggested that they install some Config Management and Monitoring software so that “we can DevOps now…” or “so we can do Agile”. When they hear “DevOps” or “Agile”, what they are hearing is is Let’s take the same people who can’t handle a planned release schedule or make whatever effort that they need to squeak by the Change Board and Release Management requirements, and give them unfettered access to Production. Clearly, I’m not paged often enough.

So what is one to do? How is one to maintain their sanity in the face of increasing job scope, increasing demand for access and velocity, and little hope for an effective new-hire count? Not to mention continuing to juggle the existing volume of requests, and continuing to grease the existing gears to keep the machine running.

Get Help

Get Help

Please note that I’m not saying “just”. There’s nothing just about this situation; there is nothing simple about any of this, and Justice hasn’t been seen in a long time in an environment where this is the norm. Most of these changes are difficult. They will take work, and will require convincing other teams to join in your cause.

Admitting you have a Problem

The problem (probably) isn’t technical. It’s almost entirely social.

Because SysAdmins are typically responsible for the environment, the easiest way to assure that the state is stable is to lock everyone else from it. While this helped with the goal of “keeping out unexpected changes”, it had a number of side effects.

First, a kind of learned helplessness has set in. Your customers and teammates became so used to being “hands-off”, that they don’t have the wherewithal to meet reasonable expectations. Since they’re uncomfortable making any changes, all changes must be made by the SysAdmins. This leads to your time being taken by having to perform lots of low-value tasks.

Some teams settle on the pattern of “hands off Production, but you have access to Staging”, but this is fraught with peril. The most common problem that stems from this is “Configuration Drift.” Config Drift is when you have different settings in one environment (or server) than the others. When the cost to discover what Production looks like is high, it’s more likely that people will either use defaults, make assumptions, or use the same configs that they use in their IDEs. “Works on my machine”, indeed.

This is a problem well-solved by Configuration Management tools, but you still need to be willing to trust your peers and give them access. If you want to be part of the process of validating changes, you could put in place the structures that allow a pull-request and code-review workflow, something that your Software Engineering peers should be very accustomed to! Granting access to see the existing configs and the ability to propose changes also shares responsibility for your team’s environments and contributes to feelings of ownership. Denying colleagues the ability to effect necessary configuration changes contributes to the root problems of configuration drift and learned helplessness.

Stop Feeding the Machine

Don't feed the machineYour value is not in doing the work, but rather being able to make the decision to do the work.
I’ll be the first to say that “Automating ALL THE THINGS” is a flawed goal. At work, it’s usually said in the context of a Project, rather than part of a philosophy of Continual Improvement (Think Toyota Kata). You shouldn’t have to engage in an “Automation Project” to improve your environment. Build into your schedule time to solve one problem. Pick something that is rough, manual, and repeatable. Remove a small piece of friction. Move on to the next one. Hint: Logging into a server to make a configuration change should be a cue to implement configuration management!

While I agree that everything being automated, not everything should be automatic. Decision-making is complex, and attempting to codify all of the possible decision-making points is a fantastic way to make yourself insane. Not to mention that documenting your decision-making processes may be an unwanted look inside your brain. Caveat Implementor. (Or perhaps that’s just me…)

All of the units of work should be automated. But the decision to run the now-automated tasks can be left to a human. When you find that there is a correlation between steps, those pieces should be wrapped together. Automation isn’t a project into itself. It should be iterative. Pick something that’s painful. Make it a little smoother. Repeat. Ideally, you have time blocked out for Continuous Improvement. If not, create a meeting, or create a weekly project to do so. Review the issues that you’ve experienced lately, and pick something to make better. It might be worth making into a project, but it won’t be an ALL-THE-THINGS project. Create a scope of effort. Take the time to plan goals and requirements.

Whatever you don’t automate must be documented. Beyond the typical benefits of documentation, it also serves as “Functional Requirements” for someone else to pick up when they can help you with providing a solution. Try to recognize whether documenting or automating takes longer. Perhaps this piece of documentation will bet better served by “Executable Documentation” (i.e. code).

Clarify Your Role

Role-Playing Group

You should attempt to pick apart the parts of your work, and attempt to describe them. One way to make this a fun exercise is to use other job titles to describe the work.

Are you an “Internet Plumber”? How much of your job could be described as “Spelunking” into the deep dark caverns of Legacy systems?

If you want, you could ascribe Superhero names to these parts of your work. The added bonus is that it not only describes a role, but also a demeanor associated with them. When ‘bad code’ makes it to Production, do you go “Wolverine” on that dev team?

Could you describe part of your role as “Production Customs Official”? Are you the gateway to Production? If so, are you actually equipped to do that? Here’s a quick test: When you say “no, that can’t go”, do you get overridden?

More importantly, is this what you want to do?

Prepwork

You will need to prepare for this. Most SysAdmin teams do not have a healthy relationship with the rest of the business. You will need to initiate the healing.

Take someone to lunch. Preferably someone who you don’t know well. Ask questions, and listen to the answers. It is not time to defend yourself or your team. It’s time to find out what the business needs from someone else’s perspective. Ask what they think that your team’s role is in toward achieving that success. Ask what they think your team does well, and where there are gaps between what you have now and excellence.

Speak their language

Rosetta Stone

You probably recognize their words, but you need to go out of your way to speak them. To communicate your message, you will need meet them on their turf. This may seem terribly unfair - “Why can’t they meet me on my terms?!” - but I’m guessing that has not been working out well for you so far.

Not only do you need to use their language, but you need to communicate over their medium. And identifying who they are is step one in learning to speak it. It’s probably not IRC, and only writing it in email is a good way for it to be ignored.

If you’re speaking to management, be prepared to write a presentation. Executives especially like to see a slide-deck. It doesn’t have to be slick. It probably shouldn’t have sounds or much in the way of transitions, but a presentation can help to lay the groundwork for a conversation.

Discuss Scope, Staffing, and Priorities

Gantt Chart

Now that you have described your role, we also need to describe everything that you support.
What Products do you support? It’s entirely possible (likely, even) that the people and teams that you support don’t actually know what you’re responsible for. It could be argued that most of them shouldn’t need to know. But if you have been saying “no” to protect yourself, it’s a sign that you are significantly overextended. You need to have a real discussion with your leadership about your role, scope, and staffing.

In order to have this discussion, you need to prepare. You need to come up with a fairly comprehensive list of the products and teams that you support. This is a list of every team, and their products, the components and tasks that belong to you for each. Don’t forget all of the components that “nobody owns” but somehow people come to you to fix or implement (CI, SCM, Ticketing, Project, and Wiki tools seem to be common examples). Are you also responsible for Directory Services? Virtualization platform? Mail/Chat/Phones? Workstation Purchasing and provisioning? Printers? Do you manage the Storage, Networking, etc? Don’t be afraid of getting into details. It can help to provide clearly written potential impacts the company if some of these “hidden” services stop working? Your leadership might not know what LDAP or Directory Services are, but they’ll understand if nobody can log into their machines, they can’t pull information to build reports, and by-the-way nobody can deploy code because it relies on validating credentials…

What is most important to the company? What do you need to succeed? How much more staff do you need? What tooling or equipment would help you work more efficiently? Does code deploy even when it fails testing? How many outages have arisen due to this happening?

Demonstrate Cost and Value and Revisit Priorities


faux ink stamp "Priority"In order to have meaningful discussions with people in your company who aren’t necessarily technical, you need to be able to relate to a language that they speak. Regardless of team duties, the lingua franca of most teams is money. As Engineers, most of us prefer to think in terms of the tech itself, but in order to describe an impact, a unit of monetary value is a proxy for impact that most non-technical people can understand, even if they don’t grasp the details.

It is a helpful (if difficult and uncomfortable) habit to get into, but I encourage you to consider the components of cost that goes into every incident or task.

What is the cost of a main-site outage? How much revenue does this feature bring in? Why are you spending so much on infrastructure and effort to make that component Highly-Available? Why does it matter that you do that piece of maintenance? Show the negative value of doing things they way they are (Opportunity Cost), versus investing time to improve the automation around it. Describe how doing this maintenance work reduces your context switching, unplanned outages, and lost reputation of your company. Describe the benefit in increased visibility to the business, and Agency to be gained by your peers on other teams.

Why put in place these tools to let product teams self-serve? Describe that the features that the company’s teams spend so much time and effort (read: “money”) creating means nothing if those features aren’t available for customers to use. That having those features not available costs money in terms of feature billing, and reputation cost. If they claim that they’re doing Agile, but can’t do Continuous Delivery, they’re not really Agile, and the whole point of that framework is to improve delivery of value to the customer and the business!

Further, show how systems relate. It doesn’t have to be terribly detailed. Describe that the features that the customers use are reliant on x, y, and z components of infrastructure. Draw the lines from LDAP to storage to your CI tool to testing code to artifacts delivered to Production. Then show some of the other systems that have similar dependencies.

Once the picture emerges showing how everything is reliant on unexciting things like LDAP, your Storage cluster, and that janky collection of angry shell and perl scripts that keep everything working, realization will begin to dawn.

Congratulations, you’ve just effectively communicated Value.

Align Responsibility with Authority

Are you held responsible for apps written by other people? Who gets paged when “the app” goes down? How does that make sense?

Get Devs on-call for their apps. SysAdmins should be escalated to. Devs can triage and troubleshoot their own apps more readily than you can. They get to call in the cavalry when they get stuck. They don’t need to know everything about the systems, and they don’t need to resolve everything. When a fault occurs and they need help, they stay on the call, pairing with you as you diagnose, troubleshoot, and resolve. That way, they don’t need to escalate to you for that thing the next time it occurs, and can collaborate on automating a permanent fix.

When teams aren’t responsible for their products - When they aren’t paged when it fails - they are numb to the pain that they inflict. They’re not trying to cause pain; they just don’t feel it. It’s especially easy to argue this for teams that proclaim that they use Agile development methods: If they claim to want “continuous feedback”, there is nothing more visceral for providing feedback than the feeling of being awoken by a pager in the middle of the night. When the inevitable exclaimation comes that “we can’t interrupt our developers”, ask if it makes sense to interrupt someone else.
Even being aware of the pain (say, hearing how many times you were paged last night) can elicit sympathy, but that’s a far cry from the experience of being paged yourself.

Further, this is what that list of responsibilities is for. Asking each team to take responsilbity for their own products, you will still likely have a hefty list of services that you provide that you are on-call for. As these set in, point out the staffing numbers. This may be a matter of the places that I have worked, but I have never seen a Developer-to-SysAdmin ratio of less than 5-1. In most places it is much higher. By adding these teams to pager rotations, they drastically reduce the load on you. By not adding them to pager rotations, they are complicit in your burnout.

Stop saying “No”


No No'sSysAdmins have a reputation for saying “No”. The people who are asking are probably not trying to make your life worse; They’re probably just trying to get their work done. They might not know what their “simple request” involves, and that it probably isn’t necessary.

But by not having Responsibility aligned with Authority, you may have been stuck with the pain of other people’s wishes. You know that fulfilling their request will cause you pain, so understandably, you say “no”. What often happens next is that they escalate until they hit someone sufficiently important enough to override you.

This is the basis for why SysAdmins feel steamrolled by everyone else, and everyone else feels held hostage by SysAdmins.

But all hope is not lost.

Stop saying “No”.

“Yes, but …” is a very powerful thing.

“Yes, but …” can be used to get you help.

“Yes, I can set that up for you, but we don’t have capacity to run it for you.” What happened there? You agreed that the request is reasonable. You set expectations of the level of support that you can give. You left the requestor with several options to continue the conversation.
  • They might have hiring reqs that they can’t fill. You can negotiate for some of them to go to your team, as you’re clearly understaffed.
  • Some of their engineers may join your team as a lateral move. They’ll need mentorship and training, but this kind of cross-training is invaluable. It’s a force multiplier. It also sets precedent.
  • They might take the responsibility for the Thing. They run it. They get paged for it. Of course you will probably have to be an escalation point to assist when it fails, but it’s their product. This again sets precedent.

Delegate

Most SysAdmins are stuck doing tasks that provide very little value because they restrict access to their peers. To my mind, there is one perfect example: “Playing Telephone”.

When I say “Playing Telephone”, I’m talking about the situation where someone (let’s say a Developer) wants logs from the application, but they don’t have access to get them. They request the logs from you. You fetch the log requested and provide it to them. “No, not that log, this log…” You fetch. “Hmm, I’m not seeing what I’m looking for, could you check in here for something that says something like this …?” And so on, and so on…

I don’t know what you’re hoping to prevent by restricting access, but if this scenario ever happens, you should know that you’re providing Negative Value. Again, let’s try to remember that your peers are not out to get you, and can probably be trusted to be reasonable humans if you meet them mid-way.

With that framework in mind, it’s time that you demonstrate some trust, and Delegate to them. Give them access. Your value is not in the logon credentials that you have, otherwise you’re just a poorly-implemented “Terminal-as-a-Service”.

Even better than giving access, is giving Tooling. Logging into a server should be an antipattern for most work! You need some better tooling. So, with the example of logging, let’s talk tooling.

Logging

First, logging into boxes to get logs is just dumb. Sure, you could wrap a tail command in a Rundeck job, but let’s Centralize those logs while we’re at it.

SysLog is better than nothing, but not by much. Shipping logs is easy, but consuming them as something useful is not. Batteries not included.

If your company wants to spend the money on Splunk, then encourage that. Splunk is a fantastic suite of tools, but I might wave you away from it if you’re not going to use it for everything. It’s going to be expensive, and if you’re not going to spend enough to use it for everything, there will be confusion as to what’s in there, and what’s stored elsewhere.

ELK (ElasticSearch + Logstash + Kibana, sometimes mistakenly simplified to “Logstash”), or “Cloudy Elk” / “ELK-as-a-Service” is a good middle-ground. ELK is Free (as-in-beer), and very featureful.

Take your Centralized logging of choice, and provide your customers with the url to the web interface. Send them links to the “How to use” docs, and get out of their way!

Terminal-as-a-Service

Put a Bird on itIf someone asks you to “run this command for me”, you need to put a button on it.

You don’t need to RUN-ALL-THE-THINGS!

Rundeck is a fantastic tool to “Put a button on it”. Other people use their CI tools (like Jenkins or Bamboo) for this. My friend Jeremy Price gave an Ignite Talk at DevOpsDays NYC 2015 that describes this.

Personally I like Rundeck, because it’s pretty easy to make HA, tie it into LDAP for credentials, manage permissions, and by shipping it’s logs (see what I did there?), you get Auditing of who ran what and when!

If you have some data that Must be restricted, try to isolate those cases from the rest of your environment. You shouldn’t have to restrict Everything just because Something does need isolation.

Deploying Code. Yes, to Production

Why would you want to have to deploy other people’s code?! Do you really provide any value in that activity? If the deployment doesn’t go well, you’re launching another game of “Telephone”.
What if you make it easy for them to do it? Empower them with trust and tooling, making it easy to do the right thing! Give them tooling to see that the deploy succeeded! Logs are a start, but Metrics Dashboards that show changes in performance conditions and error rates will make it plain to see if a deployment was successful!

This Freedom doesn’t come free. Providing tooling doesn’t absolve the development teams of the need to communicate; in fact, it’s likely that they’ll have to communicate more. They will need to be watching those dashboards and logs to see for themselves the success of every deploy. They will also be more readily on-hand to help triage the inevitable instances when it doesn’t go swimmingly.

Us

I say “They” in this article a lot. And that is because, by default, most organizations that I have been a part of or heard stories of have had a strong component of “Us-Versus-Them.” It’s only natural for there to be an “Us” and a “Them”, but thinking in those terms should be a very short-term use of the language. Strive for the goal of a “We” in your interactions at work, and reinforce that language wherever possible. While it may not be My job to do “foo”, it is Our job to ensure the team and company is successful.

While that may sounds like some happy-go-lucky, tree-hugging, pop-psychological nonsense (and it is…:), the goal here is to get you, the beleaguered SysAdmin the help that you need, in order to improve the capabilites of the business.

Coda

There is so much more to this topic, particularly the shift away from a Systems team supporting a bunch of Project teams to a series of largely self-sustaining Product teams, but that will have to wait for another day.

The psychological damage done to SysAdmins by their peers can make us bitter and cynical. I encourage my people to try to see that “They” aren’t trying to make life difficult for you, but it’s very likely that Authority and Responsibility are misaligned. I likewise encourage my people to take steps to make their lives better. A ship’s course is changed in small degrees over time.

When someone says “DevOps Doesn’t Work”, they’re absolutely correct. DevOps is a concept, a philosophy, a professional movement based in trust and collaboration among teams, to align them to business needs. A concept doesn’t do work, and a philosophy does not meet goals - people do. I encourage you to seek out ways of working better with your fellow people.

Gratitude

I’d like to thank my friends for listening to me rant, and my editor Shaun Mouton for their help bringing this article together. I’d also like to thank the SysAdvent team for putting in the effort that keeps this fun tradition going.

Contact Me

If you wish to discuss with me further, please feel free to reach out to me. I am gwaldo on Twitter and Gmail/Hangouts, and seldom refuse hugs (or offers of beverage and company) at conferences. Death Threats and unpleasantness beyond the realm of constructive Criticism may be sent to:
Waldo
c/o FBI Headquarters 
935 Pennsylvania Avenue, NW
Washington, D.C.
20535-0001

4 comments :

ron said...

Fascinating stuff, thanks

Ben C said...

This should be required reading for every sysadmin. I've been writing and editing for SysAdvent since 2009 and this may be the best article that's ever been published here.

Waldo said...

Thank you, folks! I appreciate it!

skinp said...

Great content, thank you. Deserves a place in my *read once a year* list.