On-call Report 2-8 August 2021

Action taken:

Muffle false alarm when experience isn't found (Get product detail scraping) (xpepapi)

Symptoms:

Sudden huge API request count increase with consistent traffic for a period of time
There are many requests with invalid request payload (experienceId field is filled with word instead of number of ID)
Request exception is high

Action taken:

Returning empty response for invalid experienceId

One time issues:

High latency from POT (xpedisc, xpesrch)

Symptoms:

Sudden huge RPC client count increase from us.
From the we log, we can see many of error message containing GEN-TOO-MANY-REQUESTS word. It's an error from BMG. Looks like because of the rate limiter.

Related thread:

https://tvlk.slack.com/archives/C016KJRAT5Y/p1627854722006200

High latency from AXES (throught POT) (xpesrch)

Symptoms:

High RPC client latency on xpesrch

Root cause:

There was an allotment update from AXES, triggered by someone.

Pending tasks/bugs:

High latency on `regenaratepdfvoucherasynce` method call (xpeops, xpebi)

Symptoms:

Many UPDATE_BOOKING events coming to xpebi at a moment, which triggers voucher regeneration
xpeops consume the events, and regenerate many vouchers at the same time
CPU usage & system load is high

Proposed solution:

Limit the rate of voucher being regenerate by throttling the events consumption

Jira ticket:

https://29022131.atlassian.net/browse/EXP-12704

Related thread:

Recurring high xpedisc latency issues (xpedisc)

Symptoms:

xpedisc RPC server latency is high
Occuring on Wednesday, at around 03:50 (GMT+7) for about 30 minutes

Current hypothesis:

Most of the required cache is expired at the same time, causing the needs to refetch the fresh data, but because there is burst of requests, the provider can't handle the request. In the end, many requests got timeout.

Current action:

Waiting for the next on-call cycle PIC to confirm whether the issue is already fixed or not, after the last 6 August 2021 release.

Related thread:

https://tvlk.slack.com/archives/C016KJRAT5Y/p1628025172061900

High latency issue on pricing service (lpcquot) (xpesrch)

Impact:

There are booking that has the invalid price, because the data is obtained from the cache during the timeout as a fallback.

Waiting action:

Investigate the fallback mechanism, we need to make sure that the customer won't be able to purchase if the price isn't from the fresh data

Jira ticket:

(To be updated)

Related thread:

https://tvlk.slack.com/archives/C016KJRAT5Y/p1628172101065800

Low free space on xpedata's MongoDB (xpedata)

Jira ticket:

(To be updated)

Related thread:

https://tvlk.slack.com/archives/C016KJRAT5Y/p1628172162066200?thread_ts=1628172101.065800&cid=C016KJRAT5Y

xpedisc high RPC server latency issue (xpedisc)

Symptoms:

Number of requests to xpedisc starts increasing, ususally almost at the highest traffic peak. But the increase isn't really significant.
RPC server latency is high, but the RPC client latency is normal (can be seen in SRS's log, their RPC server latency isn't as high as our xpedisc RPC server latency)
The number of open files are high in xpedisc
The number of requests to xpedata has increased significantly (almost twice) after the last 6 August 2021 release

Current hypothesis:

There are thread blocking processes that preventing xpedisc to serve further requests
The requests increase

Current mitigation:

Increase xpedisc's instance count twice. From minimum of 4 to 8 instances, and maximum of 5 to 10 instances.

Jira ticket:

(To be updated)

Related thread:

On-call Report 2-8 August 2021

Action taken:

Muffle false alarm when experience isn't found (Get product detail scraping) (xpepapi)

One time issues:

High latency from POT (xpedisc, xpesrch)

High latency from AXES (throught POT) (xpesrch)

Pending tasks/bugs:

High latency on regenaratepdfvoucherasynce method call (xpeops, xpebi)

Recurring high xpedisc latency issues (xpedisc)

High latency issue on pricing service (lpcquot) (xpesrch)

Low free space on xpedata's MongoDB (xpedata)

xpedisc high RPC server latency issue (xpedisc)

High latency on `regenaratepdfvoucherasynce` method call (xpeops, xpebi)