[PSA] Traveloka - Datadog Automatic Incident Creation

Overview

Incident by Breaching SLO Monitors
In the existing condition, some of the product teams in Traveloka already have SLO to monitor their service performances from a user/business perspective, any monitors attached to the SLOs that breach the threshold should be identified as an incident, currently there is no automatic way on creating an incident based on breaching monitors.

Incident by AWS Resource Backup
We are currently adopting AWS Backup Manager and are aware that sometimes backup jobs do fail. When those backup jobs fail teams will be alerted and a SEV-3 Incident will be declared. The Incident will only be declared on the Production account only and there will only be one active incident at a time, which means you can have multiple failed jobs but only 1 incident.

This PSA is originally posted on:
[PSA] Automatic Incident Creation Based on SLO Monitors

How the automation works

Incident by Breaching SLO Monitors

Incident by AWS Resource Backup

Expected Incident Created on Production Rollout

Incident by Breaching SLO Monitors
The following is the list of SLO that will trigger the incident creation on production rollout:

Incident by AWS Resource Backup

Who is this announcement for?

All product teams that have/planning to have SLO Monitor on Datadog and currently implementing AWS Backup configuration

What do you need to do?

Incident by Breaching SLO Monitors

Incident by AWS Resource Backup

Why is this needed?

The advantages of having an automatic way of creating an incident:

Timeline

Appendix