In order to face incoming next EPIC SALE, APR team decided to conduct stress tests to check the availability and realibility of our services.
After several stress tests, we notice aprso (one of apr service to serve omni data) experience high latency due cpu 100% usage at aprso-es
After checking at aprso-es' metrics, the search thread pool also increased that indicate we need more cpu core hence we need more cpu core's to add more worker
So we decided to vertical scale our es
thread Pools and Search Request Errors in Elasticsearch:
https://qbox.io/blog/thread-pools-elasticsearch-search-request-errors
aprso-es cpu usage reached 100%:
https://app.datadoghq.com/dashboard/4xm-mtr-4yv/apr-elasticsearch-summary-screenboard?from_ts=1565856000000&to_ts=1565860774440&live=false
aprso high latency and errors:
https://app.datadoghq.com/dashboard/2wz-yra-q5f/apr-aprso---service-health?from_ts=1565858110422&to_ts=1565859134086&live=false&tile_size=m
Scale up ES instance type to c5.2xlarge.elasticsearch
Unexpected behaviour although the es docs said this process is safe