Incident by Breaching SLO Monitors
In the existing condition, some of the product teams in Traveloka already have SLO to monitor their service performances from a user/business perspective, any monitors attached to the SLOs that breach the threshold should be identified as an incident, currently there is no automatic way on creating an incident based on breaching monitors.

Incident by AWS Resource Backup
We are currently adopting AWS Backup Manager and are aware that sometimes backup jobs do fail. When those backup jobs fail teams will be alerted and a SEV-3 Incident will be declared. The Incident will only be declared on the Production account only and there will only be one active incident at a time, which means you can have multiple failed jobs but only 1 incident.

How the automation works

The automation collects all Datadog Monitor status for all SLOs
If any of the Monitors attached to the SLO have been in an alarm state for a certain duration, or have been intermittently ON and OFF in a certain duration/period, then the automation will create a new incident for that specific SLO because of their any of their Datadog Monitor has breached state with the severity SEV-2
If all of the Monitors attached to the SLO that is currently in an open incident by the automation have been in the ok state for a certain duration without intermittently ON and OFF records, then the automation will decide to resolve the incident for that SLO

The automation collects data from all aws accounts using EventBridge
If there are any failed backup jobs that come from the Production account it will invoke the Automatic Incident API and create an Incident

Expected Incident Created on Production Rollout

Incident by Breaching SLO Monitors
The following is the list of SLO that will trigger the incident creation on production rollout:

Who is this announcement for?

All product teams that have/planning to have SLO Monitor on Datadog and currently implementing AWS Backup configuration

What do you need to do?

Automation will automatically pager the team based on the team tag attached to the SLO if the SLO triggers the incident creation, if you have not properly configured the Monitors attached to your SLO, please fix the Monitors so it won’t false alarm your team

Why is this needed?

Help service team to automatically be alerted whenever one of their monitors attached to their SLOs is breaching the threshold to maintain their SLOs target
Help service team track their MTTR (mean time to restore) metrics that help stakeholders correlate change activity to system stability, track problematic applications, and identify opportunities to improve the stability of applications

[PSA] Traveloka - Datadog Automatic Incident Creation

Overview

How the automation works

Expected Incident Created on Production Rollout

Who is this announcement for?

What do you need to do?

Why is this needed?

Timeline

Appendix