December 8, 2014

Day 8 - From Operator to Guide: Lessons Learned Moving from Engineering into Management

Written by: Mathias Meyer (@roidrage)
Edited by: Ben Cotton (@funnelfiasco)

As a company grows from a tiny startup to a small business, curious things happen. The small startup struggles to figure out how to scale a product and the infrastructure supporting it, with as few people as possible in as many roles as possible.

The small business struggles with other pains like growing a team, creating a set of shared values, growing the customer base, and focusing on longer-term strategies. It’s not so much about survival anymore as it is about building something sustainable.

Our little company stepped from one phase into the other this year, and the challenges have been interesting, especially for someone like me, who went from being an infrastructure engineer to being a manager.

Our little company, an open source project at heart, has a history of trying to be as open as we can about our production issues and nurturing a blameless culture.

When you’re a small team, that can be an easy seed to plant and to get everyone on board. After all, everyone’s already responsible for everything, so switching hats is easy.

But as a manager, you’re now confronted with making sure your team and your customers are happy and that the business is growing (hopefully because your people and customers are happy).

That’s where push comes to shove with a blameless and just culture, balancing production pressure, people happiness and productivity into something that fosters critical thinking, a culture of always looking out for and learning from unexpected events, and making sure that your product’s stability is improved based on experience in production.

Letting go and delegating responsibility

As an infrastructure engineer, I’ve been involved in responding to a lot of our outages in our early days and years, writing the postmortems and giving public talks about our culture.

I still rotate through on-call and respond to production incidents, but our primary focus, as a team, has become to make sure that knowledge of handling outages is spread to as many people as possible. That meant one thing I’ve been having trouble doing over the past couple of months: letting go.

When you want other people to take on the responsibility to handle production incidents, you need to step back a little. You turn from being an operator to being a guide. You get out of the way.

The only way to build trust in your team is to let people learn from mistakes, whoever’s mistakes they are. As Richard Cook said in “How Complex Systems Fail”:

Failure free operations require experience with failure.

When you want your team to be able to respond to outages, make their own experiences and share them with the team, it’s important to let go. You can still be of a helping hand of course, you can ask (helping) questions during outages to get as much information as possible or to nudge people in different directions.

After all, everyone is prone to their own biases, and not everyone on your team will have experience will all aspects of your systems.

What did I do over the last couple of months? I’ve been a bystander for maintenances and outages, giving encouragement and getting stuff out of the way where I can.

People need to feel a sense of ownership and responsibility. If you as a manager continue to be the leading hand during outages, you’ll always be the first line of response and you’ll take away those experiences from your team.

Just a little bit of process

Handling a production incident is one part of the story. The other is what comes afterwards. You can choose to ignore what happened, but that’s leaving a great opportunity to learn on the table.

Depending on where your team members come from, this can be a challenge. Not every company fosters a culture of looking at what happens in production and to make sure the appropriate learnings are taken away and used for future improvements. Some prefer to find a root cause (usually a human) and fire them. Problem solved.

Others only have little experience in conducting both incident response and a postmortem. Two things are important to get everyone up to speed:

Mentoring

Teach your team what you know, share the knowledge of what’s considered a good incident response and the minimum requirements are for a postmortem.

Get those things into writing (an Ops Playbook is a good idea to have) and make sure that people read it, commit to it, and are actively involved in improving it.

Mentoring is one of your bigger tasks as a manager, and if you want to get people to take responsibility and to follow in yours or make their own footsteps, it’s the best place to start. You can mentor them personally or share knowledge internally by giving talks, feedback on previous incidents and postmortems, writing on your internal blog and creating discussions around what people think.

Setting Expectations

Once you’ve shared knowledge, you can agree on expectations. Those can also be manifested in the Ops Playbook.

In our case, internal postmortems need to be filed within 24 hours after the incident.

The idea isn’t to put high pressure on people, but to make up for the fact that humans quickly forget important details, in particular after a stressful situation.

If there was extended impact for our customers, a public postmortem should go up on our status page within 72 hours after the outage.

You may not need a strict process around this, but having a few guidelines helps encourage everyone to stop and think for a moment about what you can take away from an incident and how you can improve operations, development or your product in the future.

Production pressure and continuous improvement

As a manager, it’s too easy to fall into the trap of production pressure, especially when you’re reporting to other managers. You may be more accountable to stake- or share holders.

How can you combine this with the idea of continuous improvement, in particular if you have a lot of outages that require fixes and improvements?

The answer to that is surprisingly simple. When you don’t apply learnings from incidents, they will continue to interrupt your customers in whatever task your product is assisting them with, leading to increased unhappiness, churn and effectively reduced revenue.

One of our struggles recently has been to keep up with remediation items from previous outages. This is my fault, not my team’s. It’s a manager’s responsibility to give room for scheduling in the remediations their team deems necessary. It’s their responsibility to encourage everyone to be involved in the process.

When remediation items aren’t cleared up, it’s the manager’s responsibility to find out what’s blocking the team and to get that out of the way.

I encourage you to look at how Etsy is conducting postmortems, how they store them and making sure that remediation items are followed up on.

Nudging people to think critically

Traditional thinking (or the Old View of Human Error (see “The Field Guide to Understanding Human Error”, Sidney Dekker) suggest that there is a root cause for any incident. The Five Whys suggest that it only takes a few questions to find it.

Root cause is appealing to engineers because it gives them one thing to fix. Root cause is appealing to management because it gives them one thing to blame (and to fire).

Root cause analysis avoids critical thinking. When your team stops at a single point and focuses only on that, you remove any opportunity to figure out the bigger picture, the motivations people had at the time of the incident or during events leading up to them.

Thinking beyond the root cause requires critical thinking and asking questions, as uncomfortable as they may be. Just so we’re clear, they can and should be uncomfortable to you as a manager as well.

One of our challenges has been encouraging people to think beyond a simple resolution to an outage, beyond just this one cause.

Sometimes that means asking them a few more questions, sometimes you as their manager need to ask yourself questions on how you can improve the organizational environment or the processes that people were acting based on.

From generalization to specialization

In the early days of most startups, you’ll find everyone doing everything. We were firm believers in generalization in the early days, as both our team size and the little revenue didn’t allow for specialization early on.

Today we find ourselves working towards more specialized roles. We encourage people to grow into more specific roles or responsibilities (without stressing on the titles too much) to increase the level of ownership people feel about our product, but also to remove ambiguity in who’s responsible for what.

We started longing for specialization, not just in engineering, as our business grew to a stable customer base and revenue.

That includes defining responsibilities in incident response. Who’s the main point of communication? Who facilitates the team’s response? How many people should be involved in a scheduled maintenance? How and when are issues escalated to involve more people?

All those questions can add a little bit more process, but they help in reducing ambiguity. Because when everyone’s responsible for everything, is anyone really responsible?

Towards a culture of learning

Running a business with customers and a team distributed across the globe makes it hard to stop and reflect on what you’re doing, on the things that have happened in production and to figure out what you can learn from it.

It’s easy to just keep going and leave things on the sidelines, thinking “We can always fix this later.”

But whatever you leave on the sidelines is going to come back and bite you later, causing more embarrassment when it affects your customers.

Continuous learning requires everyone to take a step back, and it requires you as a manager to give everyone room to do so, room to improve, room to learn and to gain more experience.

Be the manager you always wanted to have

Growing into a manager role is a unique chance to be the manager you always wanted to have. You’ll have to incorporate a bigger picture into this idea, but it’s up to you to shape how you treat your team and your production system. That doesn’t mean leaving aside everything your previous managers did, because some of that may have a good reason.

You can build a culture based on the old view of human error (believe me, it’s all too easy to fall into this trap), or you can treat your team with the empathy that you always wanted from your previous managers (yes, those who always put the schedule first and people last).

The important thing to remember is that as a manager, you need to trust the people at the sharp end of the action to do the right thing. It’s very likely the very same thing you would’ve wanted your manager to do or that you expected as an operator.

As Richard Cook said:

Actions at the sharp end resolve all ambiguity.

Your role as a manager is now to reduce ambiguity over time. If you don’t, your engineers will do it for you, giving you plenty of opportunity to learn and improve their environment, possibly causing unhappiness all around.

It’s been an interesting change in role for me, and I’m bound to make mistakes and learn from them. But learning is what this is all about, for you, your team, and your company.

If you have any stories to share or want to talk about culture and management, I’d love to talk. Shoot me an email.

No comments :