Incident Handling Knowledge Base and Observability
TLDR
We (#observability channel members x backend-infra) provide (not limited to backend engineers, because anyone who cares for engineering incidents can use them):
- Knowledge base at https://29022131.atlassian.net/wiki/spaces/IHP/overview, feel free to contribute there or extend it for your teams. It contains:
- baseline recommendation (may not be 100% compatible for your team) for incident handling and post-mortem processes
- having a written guide will save you time to explain things to new joiners (especially if you include it in the onboarding process)
- not having guide and SOP can cause confusion during incident handling, slowing down incident mitigation
- recommendations for incident tracking and incident document template
- to measure incident metrics properly and become more data-driven
- incident runbook catalog, each runbook contains some conditional steps for a specific symptom of problem
- to help anyone investigate + mitigate the problem, because some similar incidents in the past happened to different teams, but not all teams could handle them properly
- Slack channel to discuss incident-handling (or observability) at: #observability. Feel free to join and discuss there.
What you can do: Please read the section “What You Can Do” below.
Background
Incidents are bad for us because it will have some impact internally (for example loss of engineering time) and externally (example: loss of revenue). Therefore, we want to prevent incidents as much as we can. But, there are so many ways for incidents to happen and we may not be able to prevent them all. So at least we should be able to stop similar incidents from happening again (or at least mitigate them quicker next time if we can’t prevent it).
Having an incident post mortem is one way to help us prevent similar incidents. If we do it right, we can uncover the root causes of the incident and fix them.
Based on our survey on Q1 2020:
- Not all teams have incident-related guides
- Not all teams track their incidents
- Not all teams do post-mortem
If we want to improve the reliability of Traveloka system, we will need to:
- Have incident-related guides and proper SOP
- Measure the incident-related metrics to get current state and see how it compares to past state (become more data-driven)
Why?
Having incident-related guides and proper SOP
- Having guides will help everyone (especially new-joiners) to mitigate incident quickly -> reduce incident mitigation time
- With proper post-mortem process, we can prevent similar incidents or at least mitigate them faster -> reduce number of incidents and incident mitigation time
- If we document our lessons learned properly, we can help our team (or even other team) when encountering similar incidents (example: memory leak) -> reduce incident mitigation time
Measuring the incident-related metrics
- If we don’t measure, we can’t be sure whether we are improving or not
- Sometimes can’t properly justify doing a project or purchasing new tools, example:
- Should we purchase expensive tools to help incident investigation (need to know the total impact value of recent incidents)
- Should we work on new features or should we improve our testing coverage/methodology (need to know whether there are many bug-related incidents recently or not)
What We Want to Share
- Incident-handling knowledge base: https://29022131.atlassian.net/wiki/spaces/IHP/overview
- Baseline recommendation (may not be 100% compatible for your team) for incident handling and post-mortem processes. You can create your team guides on top of some parts of the knowledge base.
- Slack channel to discuss: #observability
- Feel free to discuss things that are related to incident handling or observability here. You can also ask for feedback from others.
- Recommendation for incident tracking and incident document template
- The goal is to make sure you document and track incidents properly. If you already have an incident tracker or incident template that works for you, you may not need to 100% follow our recommendation.
- Incident runbook catalog, each runbook contains some conditional steps for a specific symptom of problem
- To help anyone investigate + mitigate the problem, because some similar incidents in the past happened to different teams, but not all teams could handle them properly
- By documenting this, the knowledge from any incident responder won't be completely lost when that person leaves the company
What You Can Do
As engineering leads/manager
- Create guideline on how to determine incident severity level, this is related to incident severity levels
- Create and define SOP for Incident tracking in your team to ensure every incidents is tracked especially for major incident, this is related to incident tracking
- Create and define the incident handling roles in your team, to make it easier to collaborate and communicate between incident responders, this is related to roles during incident handling
- Create and define SOP for contacting your team, this is related to how to escalate, please update that page later
As anyone who will be involved during engineering incident
- Read the materials in the Confluence space, our recommendation by priority:
- Work alongside your team to define the proper guides for your team
- Contribute to knowledge base
- If your team has knowledge base, please update it regularly so that the lessons learned from each incident is not lost
- Please update or create a new runbook (guide for incident runbook) when you encounter a new incident symptom or have a new way to mitigate certain symptoms. You can also recommend tools to use there. The runbook may help everyone who encounters similar incidents in the future.
- If you have an idea that you think can benefit other teams, please propose them to community, can be via #observability