Update to Incident Handling Process
See latest update here:
https://29022131.atlassian.net/wiki/display/ENG/Incident+Handling+and+Prevention
Hi all,
Learning from past several production incidents, we're making changes as follows:
- We put Site-Wide incident in a different category with normal incident due to more crucial resolution speed and complexity involved.
- Only Deri, Budi, Denni is expected to be the PIC handling Site-Wide incident (with @web-infra and @site-infra as primary supporting parties). They will in turn involve people as necessary. We don't expect anyone to be able to be the PIC and coordinate for this type of incident.
- On Site-Wide incident, your response should be to immediately tag Deri, Budi, Denni, or call their phone numbers if necessary
Special grafana dashboard and guides for resolving common issues are provided for assisting with Site-wide incidents.