By: Joshua Timberman (@jtimberman)
You’re the SRE on call and, while working on a project, your phone buzzes with an alert: “Elevated 500s from API.”
You’re a software developer, and your team lead posts in Slack: “Hey, the library we use for our payment processing endpoint has a remote exploit.”
You work on the customer success team and, during a routine sync with a high-profile customer, they install the new version of your client CLI. Then, when they run any command, it exits with a non-zero return code.
An incident is any situation that disrupts the ability of customers to use a system, service, or software product in a safe and secure manner. And in each of the incidents above, the person who noticed the incident first will most likely become the incident commander. So, now what?
What does it mean to be an incident commander?
Once an individual identifies an incident, one or more people will respond to it. Their goal is to resolve the incident and return systems, services, or other software back to a functional state. While an incident may have a few or many responders—only one person is the incident commander. This role is not about experience, seniority, or position on an org chart; it is to ensure that progress is being made to resolve the incident. The incident commander must think about the inputs from the incident, make decisions about what to do next, and communicate with others about what is happening. The incident commander also determines when the incident is resolved based on the information they have. After the incident is over, the incident commander is also responsible for conducting a post-incident analysis and review to summarize what happened, what the team learned, and what will be done to mitigate the risk of a similar incident happening in the future.
Having a single person—the incident commander—be responsible for handling the incident, delegating responsibility to others, determining when the incident is resolved, and conducting the post-incident review is one of the most effective incident management strategies.
How do you become an incident commander?
Organizations vary on how a team member can become an incident commander. Some call upon the first responder to an incident. Others require specific training and have an on-call rotation of just incident commanders. However you find yourself in the role of incident commander, you should be trusted and empowered by your organization to lead the effort to resolve the incident.
Now that you’re incident commander, follow your organizations’ incident response procedure for the specifics about what to do. But for more general questions, we’ve got some guidance.
What are the best strategies for communication and coordination?
One of an incident commander’s primary tasks is to communicate with relevant teams and stakeholders about the status of the incident and to coordinate with other teams to ensure the right people are involved.
If your primary communication tool is Slack, use a separate channel for each incident. Prefix any time-sensitive notes with “timeline” or “TL” so they are easy to find later. If higher-bandwidth communication is required, use a video conference, and keep the channel updated with important information and interactions. When an incident affects external customers, be sure to update them as required by your support teams and agreements with customers.
In the case of a security incident, there may be additional communications requirements with your organization’s legal and/or marketing teams. Legal considerations to communicate may include:
- Statutory or regulatory reporting
- Contractual commitments and obligations to customers
- Insurance claims
Marketing considerations to communicate may include:
- Sensitive information from customer data exposure
- “Zero Day” exploits
- Special messaging requirements, e.g. for publicly traded companies
When should you hand off a long-running incident?
During an extended outage or other long-running incident, you will likely need a break. Whether you are feeling overwhelmed, or that you would contribute better by working on a solution for the incident itself, or that you need to eat, take care of your family, or sleep—all are good reasons to hand off the incident command to someone else.
Coordinate with your other responders in the appropriate channel, whether that’s in a Slack chat or in a Zoom meeting. If necessary, escalate by having someone paged out to get help. Once someone else can take over, communicate with them on the latest progress, the current steps being taken, and who else is involved with the incident. Remember, we’re all human and we need breaks.
How should you approach post-incident analysis and review?
One of an incident commander’s most important jobs is to conduct a post-incident analysis and review after the incident is resolved. This meeting must be blameless: That is, the goal of the meeting is to learn what happened, determine what contributing factors led to the incident, and take action to mitigate the risk of such an incident happening in the future. It’s also to establish a timeline of events, demonstrate an understanding of the problems, and set up the organization for future success in mitigating that problem.
The sooner the incident analysis and review meeting occurs after the incident is resolved, the better. You should ensure adequate rest time for yourself and other responders, but the review meeting should happen within 24 hours—and ideally not longer than two business days after the incident. The incident commander (or commanders) must attend, as they have the most context on what happened and what decisions were made. Any responders who performed significant remediation steps or investigation must also attend so they can share what they learned and what they did during the incident.
Because the systems that fail and cause incidents are complex, a good analysis and review process is complex. Let’s break it down:
Describe the incident
The incident commander will describe the incident. This description should detail the impact as well as its scope, i.e., whether the incident affected internal or external users, how long it took to discover, how long it took to recover, and what major steps were taken to resolve the incident.
“The platform was down” is not a good description.
“On its 5 minute check interval, our monitoring system alerted the on-call engineer that the API service was non-responsive, which meant external customers could not run their workflows for 15 minutes until we were able to restart the message queue” is a good description.
Successful incident analysis should identify the contributing factors and places where improvements can be made to systems and software. Our world is complex, and technology stacks have multiple moving parts and places where failures occur. Not only can a contributing factor be something technical like “a configuration change was made to an application,” it can be nontechnical like “the organization didn’t budget for new hardware to improve performance.” In reviewing the incident for contributing factors, incident commanders and responders are looking for areas for improvement in order to identify potential corrective actions.
Corrective action items
Finally, incident analysis should determine corrective action items. These must be specific work items that are assigned to a person or a team accountable for their completion, and they must be the primary work priority for that person or team. These aren’t “nice to have,” these are “must do to ensure the safe and reliable operation of the site or service.” Such tasks aren’t necessarily the actions taken during the incident to stabilize or remediate a problem, which are often temporary workarounds to restore service. A corrective action can be as simple as adding new monitoring alerts or system metrics that weren’t implemented before. It can also be as complex as rebuilding a database cluster with a different high availability strategy or migrating to a different database service entirely.
If you’ve recently been the incident commander for your first incident—congratulations. You’ve worked to solve a hard problem that had a lot of moving parts. You took on the role and communicated with the relevant teams and stakeholders. Then, you got some much needed rest and conducted a successful post-incident analysis. Your team identified corrective actions, and your site or service is going to be more reliable for your customers.
Incident management is one of the most stressful aspects of operations work for DevOps and SRE professionals. The first time you become an incident commander, it may be confusing or upsetting. Don’t panic. You’re doing just fine, and you’ll keep getting better.
If you’re new to post incident analysis and review, check out Howie: The Post-Incident Guide from Jeli.