Oncall Report 10-16
Action taken
- Release XPE services
- Hotfix xpesrch to mitigate sold out issue
- scale out xpepapi temporarily
- scale out xpesrch
- hotfix xpeops to add more logs
Stakeholder Inquiries
- complete validity from and validity to data
- Infra connections
- Degradation on booking number
- Search result not sorted
- Sold out caused by NPE
PagerDuties
- Pot daily latency from several APIs (ticket, product detail, booking, issuance)
- latency from geo services
- Auto resolved, no action taken
- xpepapi high cpu usage and system load
- I temporarily scale out to cope with the situation
- the root cause was flush cache event that slightly bigger than usual
- Hanging booking exceed threshold
- high API request from landmark API.
Pending Tasks/Bugs
- product category mismatch between demand and pot
- sort by price bug
Sync after meeting
Topic: xpesrch improvement
Goal
- reduce latency in xpesrch
fact: - there were times where rpc server latency reached ~8 minutes (yes, it’s minute, not second)
Question to explore
1.1. is the timeout config is properly in place
1.2. is the logic use blocking loop that caused more latency on server side? (explore paralel)
1.3. is introducing async call to pot will reduce latency and free up resource to serve other usecase?
1.3.1 example: get review request
2. increase xpesrch throughput
Question
2.1. introduce async caller to pot?