Incident Management 101: Communicate, Solve, and Analyze
In the world of software development, incidents and issues happen all the time; sometimes major ones but mostly minor. Some companies have well-defined checklists and rulebooks for handling incidents, perhaps because they move fast and break things often. Others do not have a rulebook because incidents are a rarity, and they let people’s best judgment prevail in terms of how these incidents are handled.
Next time you find yourself in the eye of a storm, remember these three principles of good incident management.
- Communicate: The moment you find yourself an owner of an incident, start communicating. Address your communication to everybody who should know. This could be via email, team chat, or whatever else your company uses to communicate broadly. Let people know about the nature of incident, what is your current understanding of the ongoing impact on business, and what are you planning to do next. Update this thread every time you have something important to say. If you get stuck and it is taking longer for you to make progress, make sure you communicate that too. The last thing you want is for the stakeholders to start pinging you directly for updates which will slow you down. The updates on the communication should stop only after the issue is resolved and your last update should include a summary of the fix.
- Solve: Your goal during an incident is to find the best possible solution in the shortest amount of time, in order to minimize business impact. Seek help as soon as you need it, and make sure you are reaching out to those who you think are in the right position to be able to help you. If the ongoing impact is severe, and you are unable to make progress, escalate up your management chain to let them know that you will need help from someone more knowledgeable in a certain area. Let them find you someone who can step in and help.
- Analyze: Once the storm is over, the next step is to dive deep to find out what caused the issue, what can be improved to prevent similar incidents in the future, and come up with a list of action items. The action you and your team will take after the incident will show the level of maturity of the team as well as the organization. Good teams always improve things in the aftermath of an incident. They focus more on learning and understanding the past mistakes and spend less or no time finger-pointing or blaming.
These simple steps can be followed even if your team or organization is not in the habit of expecting any of this from you. Use this as an opportunity to show your maturity and raise the team bar. This approach will help reduce the stress of incident management by putting you in a problem-solving mode. You and your team will emerge from the crisis much better, stronger, and with a deeper understaning of your applications and business.