Incident ticket
h1. Issue Details
h2. Summary
An edit of Geo data triggers infinite loop and high open files which cause epic crash in backstage clusters (NE, APRSAPI) that crashes various products dependent on it (primarily Hotel, Hotel+Flight, Experience), and to lesser extent Train, Flight, Connectivity. On sept 17 ~11PM all problematic geo data have been corrected but geo cache located in aprsg-memcached that's currently used by aprsapi-app still contains wrong data as it store up to 6 hours old of data. Hence, flushing the memcache is the only reasonable action to be able to make all clusters that process cached geo data to be available.
h2. Chronology
Timezone: GMT+7
h3. 2017-09-17
- 08:30 -- all systems are up & products are healty. Non-Indonesia geos are disabled, All NE instances were used.
- 13:00 -- neA that contains various fix from BEI was being used in production, neB was prepared as hot backup
- 13:08 -- Hotel is down, aprsapi lbint contained no healthy instances.
- 13:47 -- Hotel product recovered.
- 17:00 -- All geos enablement started
- 17:01 -- Release a fix into aprsapi
- 17:28 -- All geos were enabled
- 17:38 -- Hotel, experience products were down. All geos but ID were disabled.
- 17:58 -- all aprsapi were restarted
- 16:10 -- hotel apis were openned, using neB. Hotel detail page is not working. aprsapi was down one by one
- 16:12 -- hotel is down, apis were routed off
- 16:20 -- aprsapi was rolled-back to 3w-ago release version
- 16:27 -- hotel was back online.
- 19:52 -- hotel product was down, all aprsapi instances were tagged unhealthy. NE was fine.
- 20:06 -- suspect of bad geo data was discovered, found 2 problematic geoids
- 20:46 -- geo data was corrected in NE's DB, neA was restarted
- 21:18 -- hotel was back online.
- 22:30 -- prevention of infinite loop in aprsapi was released, hotel product stable
- 22:49 -- discovered another source of infinite loop + aprsapi was using cached geo data, 2 problematic geo data was still cached.
- 23:28 -- memcache flush request was created
- 23:45 -- flush was done. Hotel product has been stable since.
h2. Symptoms
- Many blocked threads on tv-general threadpool on aprsapi, ne
- StackOverflow errors on aprsapi
- High number of open files on aprsapi, ne
- Out of memory error --> causing java app stopped
h2. Impact
- Hotel partially up during 2017-09-17 and (experience, package) product mostly down. If memcache was not flushed at the time, surely hotel and experience would experience continuous downtime.
h2. References