Hisbl-mongod-02 Re-sync
h1. Issue Details
h2. Summary
Hisbl-mongod-02 got sudden high incoming query request on November 14th. Because of this, the mongo was suddenly stuck and could not handle any traffic, such as insert, update, delete, and sync with primary.
h2. Chronology
Timezone: GMT +7
h3. 2017-11-14
- 18:40:00 hisbl-mongod-02 got incoming high request queries
- 18:40:00 - hisbl-mongod-02 got incoming high request queries
- 18:40:00 - Nov 14 20:30 - hisbl-mongod-02’s replication lag kept increasing until 1.28 hours.
- 21
:25:00 - hisbl-mongod-02 got replication lag reduced to 42s
- 21
:30:00 - Nov 15 10:50 - hisbl-mongod-02’s replication lag reach 10H
h3. 2017-11-15
- 10
:50:00 - Got a report pricing tracking was off. Switch to hisbl-mongod-03 for readPreference
- 11:10:00 - Restart hisbl-mongod-02. Hisbl-mongod-02 already entered the stale mode.
- 20:10:00 - Finished creating tower playbook to backup all mongo files. Run the backup mongo data file via tower. It failed because root didn’t have enough space, 8 GB left. The ebs attached only to folder /var/lib/mongod. Mongo could not even start anymore.
h3. 2017-11-16
- 10:40:00 - Long running queries happened
- 10:55:00 - Hisbl-mongod-03 restarted. Pricing tracking works as usual.
- 12
:00:00 - Emptied up the some space inside hisbl-mongod-02 root folder, delete all subfolder of /var/lib/mongodb, and restart mongodb. Re-sync process started. Mongo’s current state was starting
- 23:59:00 - Re-sync process seemed to be over, state already change to recovering, but it’s not showing any mongo metrics. All mongostats were off.
h3. 2017-11-17
- 10
:30:00 - Hisbl-mongod-02 restarted. Mongo stats showed up again. The state became starting again. The re-sync start from beginning. Mongostat replication lag became 47.9 years
- 16
:30:00 - Hisbl-mongod-02 finished re-sync, state already changed to recovering, no more high disk read per second.
h2. Symptoms
- hpt-app is not responding for pages that accessed hisbl-mongod troubled node.
- There were sudden very high queries at that time. These high queries also caused high mongo fault, and blocked other operation, such as insert, update, and delete.
- Replication lag occurred and increased to a few hours lag.
h2. Impact
First of all, the pricing operation team and other product managers were disrupted because they could not check or analyze pricing correctness. Even though it had been mentioned in the code for secondaryReadPreference = true, the queries were still pointed to hisbl-mongod-02, and resulted in failed data retrievals.
Other than that, this mongo turned to the stale mode when trying to sync with mongo primary. It was caused by late-handling mitigation. The incident was realized on the morning around 09:40, but the the first system already started the day before at 18:30; which are more than 12 hours. Because of that, the secondary failed to sync with mongo primary.
h2. Mitigation
- Exclude slow secondary from query (quick action), switch to hisbl-mongod-03
- Disable page in HPT that is suspected to generate the heavy query
- Get SOP for re-sync process: https://29022131.atlassian.net/servicedesk/customer/portal/11/TOSD-2480
- Restart mongo services via tower
- Delete old mongo data in /var/lib/mongodb/*
- Re-synced hisbl-mongod-02 with primary
- Tried to scp one file from hisbl-mongod-03 to hisbl-mongod-02 but failed because of access
- Delete pages that create lots of queries to hisbl-mongod-
- TBD -- mongo hasn’t been successfully re-synced
h2. References