We (#observability channel members x backend-infra) provide (not limited to backend engineers, because anyone who cares for engineering incidents can use them):

Knowledge base at https://29022131.atlassian.net/wiki/spaces/IHP/overview, feel free to contribute there or extend it for your teams. It contains:

baseline recommendation (may not be 100% compatible for your team) for incident handling and post-mortem processes

having a written guide will save you time to explain things to new joiners (especially if you include it in the onboarding process)
not having guide and SOP can cause confusion during incident handling, slowing down incident mitigation

recommendations for incident tracking and incident document template

to measure incident metrics properly and become more data-driven

incident runbook catalog, each runbook contains some conditional steps for a specific symptom of problem

to help anyone investigate + mitigate the problem, because some similar incidents in the past happened to different teams, but not all teams could handle them properly

Slack channel to discuss incident-handling (or observability) at: #observability. Feel free to join and discuss there.

Background

Incidents are bad for us because it will have some impact internally (for example loss of engineering time) and externally (example: loss of revenue). Therefore, we want to prevent incidents as much as we can. But, there are so many ways for incidents to happen and we may not be able to prevent them all. So at least we should be able to stop similar incidents from happening again (or at least mitigate them quicker next time if we can’t prevent it).
Having an incident post mortem is one way to help us prevent similar incidents. If we do it right, we can uncover the root causes of the incident and fix them.
Based on our survey on Q1 2020:

Why?

Having incident-related guides and proper SOP

Having guides will help everyone (especially new-joiners) to mitigate incident quickly -> reduce incident mitigation time
With proper post-mortem process, we can prevent similar incidents or at least mitigate them faster -> reduce number of incidents and incident mitigation time
If we document our lessons learned properly, we can help our team (or even other team) when encountering similar incidents (example: memory leak) -> reduce incident mitigation time

Measuring the incident-related metrics

If we don’t measure, we can’t be sure whether we are improving or not
Sometimes can’t properly justify doing a project or purchasing new tools, example:

Should we purchase expensive tools to help incident investigation (need to know the total impact value of recent incidents)
Should we work on new features or should we improve our testing coverage/methodology (need to know whether there are many bug-related incidents recently or not)

What We Want to Share

Incident-handling knowledge base: https://29022131.atlassian.net/wiki/spaces/IHP/overview

Baseline recommendation (may not be 100% compatible for your team) for incident handling and post-mortem processes. You can create your team guides on top of some parts of the knowledge base.

Slack channel to discuss: #observability

Feel free to discuss things that are related to incident handling or observability here. You can also ask for feedback from others.

Recommendation for incident tracking and incident document template

The goal is to make sure you document and track incidents properly. If you already have an incident tracker or incident template that works for you, you may not need to 100% follow our recommendation.

Incident runbook catalog, each runbook contains some conditional steps for a specific symptom of problem

To help anyone investigate + mitigate the problem, because some similar incidents in the past happened to different teams, but not all teams could handle them properly
By documenting this, the knowledge from any incident responder won't be completely lost when that person leaves the company

What You Can Do

As engineering leads/manager

Create guideline on how to determine incident severity level, this is related to incident severity levels
Create and define SOP for Incident tracking in your team to ensure every incidents is tracked especially for major incident, this is related to incident tracking
Create and define the incident handling roles in your team, to make it easier to collaborate and communicate between incident responders, this is related to roles during incident handling
Create and define SOP for contacting your team, this is related to how to escalate, please update that page later

As anyone who will be involved during engineering incident

Read the materials in the Confluence space, our recommendation by priority:

Work alongside your team to define the proper guides for your team

You may need to extend or override some of the materials for your team, because not all of them may be compatible for you
You can see other teams’ guides at list of incident handling guides as your additional reference
Please put the link of your team’s guides at list of incident handling guides later

Contribute to knowledge base

If your team has knowledge base, please update it regularly so that the lessons learned from each incident is not lost
Please update or create a new runbook (guide for incident runbook) when you encounter a new incident symptom or have a new way to mitigate certain symptoms. You can also recommend tools to use there. The runbook may help everyone who encounters similar incidents in the future.
If you have an idea that you think can benefit other teams, please propose them to community, can be via #observability

Incident Handling Knowledge Base and Observability

TLDR