ECI is launching Warden to tvlk-eci-dev on 3 February 2020, Monday. Please read on below for some important information :point_down::skin-tone-2:
Q: Can I have some more context on this tool?
A: You can see the socialisation slide deck and backlog tasks here (https://tvlk.slack.com/archives/CG0PUD2U8/p1579079973070100)
Q: What are the resources being auto-managed at this stage?
A: RDS instances, EC2 non-bastion instances (see next question) and Fargate services.
Q: Why exclude bastion instances?
A: We found that most users place files (e.g. sqitch
folder) in the $HOME
directory, instead of /sda
directory which is mounted by the EBS volume. So when the instance gets restarted, all data is lost, even though we have persisted EBS storage. Another reason is that SSH keys will need to be re-generated and re-registered due to the new instance. Note this is temporary as we are working on a sqitch automation solution as CorpTech DevOps are working on sqitch
automation (https://29022131.atlassian.net/browse/CTD-404).
Q: How do I exclude resources?
A: This is an attempt to cut infra costs, so please don’t exclude your resource if you are not using it :bows:
Q: Who has permission to add tags?
A: All users should add tags through terraform.
Example for Postgres:
But again, let’s have valid justification for excluding resources or else it defeats the purpose of this tool :smile:
Q: What is the downtime enforced?
A: Right now 11PM SGT to 8AM SGT, i.e. 10PM WIB to 7AM WIB. Open to feedback.
Q: Will there be many alerts going off during the downtime?
A: YES. This is something we observed during the test run. One of our pilot teams had numerous monitors on Datadog, which were triggered once their RDS were switched off.
To solve this, we applied Datadog Downtime feature on those resources, filtering them by Service/Project (https://docs.datadoghq.com/monitors/downtimes/#manage-downtime). Please enable this if you have Datadog monitors for your resources.
You need to set up 2 downtimes:
Note that the time window is larger than the actual downtime because the AWS resources take some time to warm up and stabilize.
Q: If I need to use my resource during downtime, how do I proceed?
A: This need can be categorised to 2 types, planned VS sudden
For planned needs: Add exclusion tags through Terraform PR and get it approved.
For sudden needs: Simply turn on the resource, use it and turn it back off.
Q: I need to use multiple databases and Fargate services during downtime, how do I turn them on & off in bulk?
A: You may request help from any PowerUser
to execute the AutoStart & AutoStop Lambda functions. However do note that this will affect all resources, so if it’s not too troublesome to turn on/off manually, please do so.
We also have a backlog feature request to add a lambda function that turn on/off all resources of a specific service: https://29022131.atlassian.net/browse/PHOENIX-1928
P.S. You can help to develop this too :thumbsup::skin-tone-2:
Q: Who has permission to run the AutoStart & AutoStop Lambda functions?
A: For now, only PowerUser
s. Even then, we should not run this manually on an ad-hoc basis as it will turn on/off ALL resources in ECI multi-account, which should almost never be required.
Q: Do you have a slack channel for notifications?
A: Yes, it’s #eci-warden-dev