This was written by Kate Matsudaira.
It’s the end of the year, time for contemplation ... and resolutions. As you think back on this year, were you a Fire Marshal or a Fire Fighter?
Being a fire fighter can seem rewarding; they swoop in save the day and play the hero. However, in a highly effective team there should be no heroes. The recognition that comes with saving the day often outweighs the silence that comes from a smooth, seamless deployment or issues resolved without customer impact.
The thing about operations and infrastructure teams is that a lot of the work isn’t noticeable unless it is done wrong. Executives pay attention and take notice when things go awry or problems, like outages, occur. Great work, streamlined processes, and reduced costs are sometimes harder to see.
Far too often, operations and devops teams enter into a death spiral of reactivity - a positive feedback loop without preventive planning. These teams fail to allocate time to proactive measures, which, in turn, leads to crisis after crisis. How does such a team, otherwise beaten down and demoralized by operational problems, find a way to reboot into a "proactive" state of being?
Whether you are a team member or the team manager, you can help pull your team of fighters into a more organized brigade setup to put fires out before they become a problem.
When it comes to fire fighting, there are typically two things at play: the way we select problems to solve, and how we solve them.
Selecting Problems
- Reactive teams work on every problem that comes up.
- The project in progress is always the one with the most urgent deadline.
- The priority of the problem is often dictated by the originator of the request, instead of the business need.
Solving Problems
- Reactive teams give all problems to the fastest resolver, regardless of past or current assignments.
- There are often lots of inefficiencies – too many people in meetings or involved in resolving incidents.
- Since everything is urgent, there isn’t enough focus on the root causes; resulting in lots of patches and hacks.
In order to move from being fire fighter to a fire marshal, it is important to devise a strategy that will address both. Here are some suggestions and recommendations to get you started:
- All equipment should be operable by multiple people. You wouldn’t want a fire department with one only person who could operate the truck. The same goes for software. Are there any systems or tools that only one person knows how to debug or diagnose? Inevitably, these single points of failure can result in fire drills and can cause inefficiencies. Taking the time to train multiple people on each part of the system will create redundancy and improve overall team knowledge of operations, and invariably, it creates opportunities through previously unnoticed synergies.
- Plan to plan. It is easy to get caught up into what is urgent and tactical, but proactive planning won’t happen without explicitly creating time for it. Moreover, periodically moving players between the duties of ops tactics and strategy planning helps balance the load and can provide an often much needed break from the line of fire. Draw a clear "line in the sand" between the strategy and operations; build a firewall, but fairly, so as to not prevent progress.
- Put players on assignments that play to their strengths. It is true that learning new things and pushing people to challenge themselves is wise when it comes to long term career development. However, in a team that is more reactive and constantly dealing with issues, it may be best to give each person an assignment they will be able to knock out of the park. By giving people the work that they are best at, they can complete it quickly and hopefully have extra cycles to help with more proactive preventive assignments. Moreover, if there are players on your team with a higher tolerance to crisis, setup roles that align players with their tolerance level. Some people thrive on the reactivity, and can play the role of your "trauma surgeons," so to speak.
- Focus on root cause. When there is a crisis it is important to focus on what will stop the damage, and get the systems back up and operating normally. And in reactive situations once things are stabilized a complete diagnosis can be put on the backburner with the temporary solution put together on gut feeling. Resulting in more brittle and unstable software. By helping create a culture of understanding and training other problem solvers people are able to take on more responsibility and solve problems properly.
- Show signs of real change. If you are adopting a new paradigm or cultural shift, aim for shock-and-awe: do something remarkable to demonstrate that you are breaking the cycle of reactivity (e.g. take the most legacy behavior/process and either eradicate it or drastically modify it). It is usually pretty obvious when a team is under water from incidents, but it can be hard to see what has been postponed in lieu of all these urgent problems. What are the important items that need to be tackled? Make a list or add them to your backlog; simply cataloging this work will provide visibility and then progress.
- Let go of the "unhappy's". Both figuratively and literally. Very seldom is change welcomed with open arms, especially among us engineers who tend to like patterns and routine. Try to enlist help at the top from your manager or other senior team members. Expect to be met with some resistance, and when it happens, ask questions and try to understand their concerns; you may be surprised by some good insights and ideas. Be bold; come with a plan and enact quickly. Show confidence but flexibility and willingness to listen.
- Create a positive feedback cycle. A key part of any change is to show progress and small victories. Signify progress by planning an off-site or organizing team building exercises that otherwise would've been impossible due to constant fire-fighting. Leverage metrics like constant trend analysis reporting with statistically useful "human" measures that show progress and improvements. Just don't reward people just for making more work for themselves. That's like rewarding professional fire fighters for acts of arson! Further, it sends a wrong message about the kind of work behavior that is valued.
Some of these suggestions are easier to execute in a leadership role, but it doesn’t matter what your level or job is. Taking time to step back and think about proactively addressing issues before they arise is a great way to improve your work for you and your teammates. Bonus, that in many cases can also lead to upward advancement and promotions.
Leadership is about influence, and management is about authority; and most of these suggestions can be achieved without any formal authority. Although I would still recommend getting your manager or team lead aligned with your mission – it will make it even easier to have a strong ally.
And remember, you're training a team of Fire Marshals, not Fire Fighters. Firefighting is an ephemeral state. Instead, a constant state of vigilance and fire prevention is what you want to engender.
Further Reading
- Blameless Post-Mortems
- Resilience Engineering
- Introduction to Visible Ops
- (video) Leveling Up - Taking your engineering & operations role to the next level
2 comments :
the number of heroic firefighters on a team is most strongly influenced by the others on the team that dont give a shit about anything.
If you want to reduce firefighting, first make sure that noone is sleeping at the wheel two desks away.
Also, make sure that the fire marshals are not forbidden by policy to improve stuff.
In my past experience, the head sysadmins and if they're held on a tight leash or not made all the difference, bad ones set broken standards and cause 1000s of hours of overtime within ops, while good ones ensure smooth operation for years.
My hypothesis would be, whereever you see a team of fire fighters that are completely reactive, someone is not doing their job right.
I find it very hard to shift my team to a more proactive viewpoint as our manager is reactive. He says it's because we're understaffed and don't have enough time to be proactive, and that we'll be better next year when we get more staff. I'd like to see that.
Post a Comment