Launch of ECI Warden for tvlk-eci-dev

ECI is launching Warden to tvlk-eci-dev on 3 February 2020, Monday. Please read on below for some important information :point_down::skin-tone-2:

Q: Can I have some more context on this tool?
A: You can see the socialisation slide deck and backlog tasks here (https://tvlk.slack.com/archives/CG0PUD2U8/p1579079973070100)

Q: What are the resources being auto-managed at this stage?
A: RDS instances, EC2 non-bastion instances (see next question) and Fargate services.

Q: Why exclude bastion instances?
A: We found that most users place files (e.g. sqitch folder) in the $HOME directory, instead of /sda directory which is mounted by the EBS volume. So when the instance gets restarted, all data is lost, even though we have persisted EBS storage. Another reason is that SSH keys will need to be re-generated and re-registered due to the new instance. Note this is temporary as we are working on a sqitch automation solution as CorpTech DevOps are working on sqitch automation (https://29022131.atlassian.net/browse/CTD-404).

Q: How do I exclude resources?
A: This is an attempt to cut infra costs, so please don’t exclude your resource if you are not using it :bows:But if absolutely required, you can exclude a resource from auto shutdown and auto startup, by adding a tag with the key-value pair AutoManage=false.

Q: Who has permission to add tags?
A: All users should add tags through terraform.
Example for Postgres:

Update your Postgres RDS module.
Update the version in caller module.

But again, let’s have valid justification for excluding resources or else it defeats the purpose of this tool :smile:

Q: What is the downtime enforced?
A: Right now 11PM SGT to 8AM SGT, i.e. 10PM WIB to 7AM WIB. Open to feedback.

Q: Will there be many alerts going off during the downtime?
A: YES. This is something we observed during the test run. One of our pilot teams had numerous monitors on Datadog, which were triggered once their RDS were switched off.

To solve this, we applied Datadog Downtime feature on those resources, filtering them by Service/Project (https://docs.datadoghq.com/monitors/downtimes/#manage-downtime). Please enable this if you have Datadog monitors for your resources.

You need to set up 2 downtimes:

1 for Monday to Friday outside office hours

https://app.datadoghq.com/monitors#downtime?id=805700555

1 for the whole Saturday and Sunday

https://app.datadoghq.com/monitors#downtime?id=803344748

Note that the time window is larger than the actual downtime because the AWS resources take some time to warm up and stabilize.

Q: If I need to use my resource during downtime, how do I proceed?
A: This need can be categorised to 2 types, planned VS sudden

For planned needs: Add exclusion tags through Terraform PR and get it approved.

For sudden needs: Simply turn on the resource, use it and turn it back off.

Q: I need to use multiple databases and Fargate services during downtime, how do I turn them on & off in bulk?
A: You may request help from any PowerUser to execute the AutoStart & AutoStop Lambda functions. However do note that this will affect all resources, so if it’s not too troublesome to turn on/off manually, please do so.

We also have a backlog feature request to add a lambda function that turn on/off all resources of a specific service: https://29022131.atlassian.net/browse/PHOENIX-1928

P.S. You can help to develop this too :thumbsup::skin-tone-2:

Q: Who has permission to run the AutoStart & AutoStop Lambda functions?
A: For now, only PowerUsers. Even then, we should not run this manually on an ad-hoc basis as it will turn on/off ALL resources in ECI multi-account, which should almost never be required.

Q: Do you have a slack channel for notifications?
A: Yes, it’s #eci-warden-dev