On-call Report 2-8 August 2021
Action taken:
Muffle false alarm when experience isn't found (Get product detail scraping) (xpepapi)
Symptoms:
- Sudden huge API request count increase with consistent traffic for a period of time
- There are many requests with invalid request payload (experienceId field is filled with word instead of number of ID)
- Request exception is high
Action taken:
- Returning empty response for invalid experienceId
Related thread:
https://tvlk.slack.com/archives/C016KJRAT5Y/p1627862571017500
One time issues:
High latency from POT (xpedisc, xpesrch)
Symptoms:
- Sudden huge RPC client count increase from us.
- From the we log, we can see many of error message containing
GEN-TOO-MANY-REQUESTS
word. It's an error from BMG. Looks like because of the rate limiter.
Related thread:
High latency from AXES (throught POT) (xpesrch)
Symptoms:
- High RPC client latency on xpesrch
Root cause:
- There was an allotment update from AXES, triggered by someone.
Related thread:
https://tvlk.slack.com/archives/C016KJRAT5Y/p1628155855046900
Pending tasks/bugs:
High latency on regenaratepdfvoucherasynce
method call (xpeops, xpebi)
Symptoms:
- Many UPDATE_BOOKING events coming to xpebi at a moment, which triggers voucher regeneration
- xpeops consume the events, and regenerate many vouchers at the same time
- CPU usage & system load is high
Proposed solution:
- Limit the rate of voucher being regenerate by throttling the events consumption
Jira ticket:
Related thread:
Recurring high xpedisc latency issues (xpedisc)
Symptoms:
- xpedisc RPC server latency is high
- Occuring on Wednesday, at around 03:50 (GMT+7) for about 30 minutes
Current hypothesis:
- Most of the required cache is expired at the same time, causing the needs to refetch the fresh data, but because there is burst of requests, the provider can't handle the request. In the end, many requests got timeout.
Current action:
- Waiting for the next on-call cycle PIC to confirm whether the issue is already fixed or not, after the last 6 August 2021 release.
Related thread:
High latency issue on pricing service (lpcquot) (xpesrch)
Impact:
- There are booking that has the invalid price, because the data is obtained from the cache during the timeout as a fallback.
Waiting action:
- Investigate the fallback mechanism, we need to make sure that the customer won't be able to purchase if the price isn't from the fresh data
Jira ticket:
Related thread:
Low free space on xpedata's MongoDB (xpedata)
Jira ticket:
Related thread:
xpedisc high RPC server latency issue (xpedisc)
Symptoms:
- Number of requests to xpedisc starts increasing, ususally almost at the highest traffic peak. But the increase isn't really significant.
- RPC server latency is high, but the RPC client latency is normal (can be seen in SRS's log, their RPC server latency isn't as high as our xpedisc RPC server latency)
- The number of open files are high in xpedisc
- The number of requests to xpedata has increased significantly (almost twice) after the last 6 August 2021 release
Current hypothesis:
- There are thread blocking processes that preventing xpedisc to serve further requests
- The requests increase
Current mitigation:
- Increase xpedisc's instance count twice. From minimum of 4 to 8 instances, and maximum of 5 to 10 instances.
Jira ticket:
Related thread: