[PSA] Traveloka - Datadog Automatic Incident Creation
Overview
Incident by Breaching SLO Monitors
In the existing condition, some of the product teams in Traveloka already have SLO to monitor their service performances from a user/business perspective, any monitors attached to the SLOs that breach the threshold should be identified as an incident, currently there is no automatic way on creating an incident based on breaching monitors.
Incident by AWS Resource Backup
We are currently adopting AWS Backup Manager and are aware that sometimes backup jobs do fail. When those backup jobs fail teams will be alerted and a SEV-3 Incident will be declared. The Incident will only be declared on the Production account only and there will only be one active incident at a time, which means you can have multiple failed jobs but only 1 incident.
This PSA is originally posted on:
[PSA] Automatic Incident Creation Based on SLO Monitors
How the automation works
Incident by Breaching SLO Monitors
- The automation collects all Datadog Monitor status for all SLOs
- If any of the Monitors attached to the SLO have been in an alarm state for a certain duration, or have been intermittently ON and OFF in a certain duration/period, then the automation will create a new incident for that specific SLO because of their any of their Datadog Monitor has breached state with the severity SEV-2
- If all of the Monitors attached to the SLO that is currently in an open incident by the automation have been in the ok state for a certain duration without intermittently ON and OFF records, then the automation will decide to resolve the incident for that SLO
Incident by AWS Resource Backup
- The automation collects data from all aws accounts using EventBridge
- If there are any failed backup jobs that come from the Production account it will invoke the Automatic Incident API and create an Incident
Expected Incident Created on Production Rollout
Incident by Breaching SLO Monitors
The following is the list of SLO that will trigger the incident creation on production rollout:
Incident by AWS Resource Backup
Who is this announcement for?
All product teams that have/planning to have SLO Monitor on Datadog and currently implementing AWS Backup configuration
What do you need to do?
Incident by Breaching SLO Monitors
- Automation will automatically pager the team based on the team tag attached to the SLO if the SLO triggers the incident creation, if you have not properly configured the Monitors attached to your SLO, please fix the Monitors so it won’t false alarm your team
Incident by AWS Resource Backup
Why is this needed?
The advantages of having an automatic way of creating an incident:
- Help service team to automatically be alerted whenever one of their monitors attached to their SLOs is breaching the threshold to maintain their SLOs target
- Help service team track their MTTR (mean time to restore) metrics that help stakeholders correlate change activity to system stability, track problematic applications, and identify opportunities to improve the stability of applications
Timeline
- 2022-04-18, Pre-release notice
- 2022-05-17, Pre-release notice
- 2022-05-17, Release the automation to production
Appendix