Incident Handling Knowledge Base and Observability

TLDR

We (#observability channel members x backend-infra) provide (not limited to backend engineers, because anyone who cares for engineering incidents can use them):

What you can do: Please read the section “What You Can Do” below.

Background

Incidents are bad for us because it will have some impact internally (for example loss of engineering time) and externally (example: loss of revenue). Therefore, we want to prevent incidents as much as we can. But, there are so many ways for incidents to happen and we may not be able to prevent them all. So at least we should be able to stop similar incidents from happening again (or at least mitigate them quicker next time if we can’t prevent it).
Having an incident post mortem is one way to help us prevent similar incidents. If we do it right, we can uncover the root causes of the incident and fix them.
Based on our survey on Q1 2020:

If we want to improve the reliability of Traveloka system, we will need to:

Why?

Having incident-related guides and proper SOP

Measuring the incident-related metrics

What We Want to Share

What You Can Do

As engineering leads/manager

As anyone who will be involved during engineering incident