Buspapi Aws Support Transcript (9 July 2018)

Amazon Web Services
Jul 9, 2018
02:02 PM +0700

Dear Amazon Web Services Customer, Thank you for reaching out to AWS Support. My name is Daniel it will be my pleasure to assist you today. After reviewing your case and the previous work that my colleagues Nick and John had done it becomes challenging for us "AWS Premiums Support" to investigate your case further without logs of when the event took place. This is due to our limited visibility in the OS. But to ensure all underlying hardware did not face problems I took a look on all instances mentioned in this case and all underlying hardware that is hosting then, and I could not find issues on our side as all equipment is working as expected. As you mention, this instances ate behind a load balancer, and I noted that your examples are more significant instances sizes m4.large. What I usually suggest to customers is "If possible" change the instance type to s smaller instances like a t2. kind and have more instance behind the load balancer so if one instance fails it would not be a big deal as you have much more case backing your app. Keep in mind that before you do that you need to make sure your application can work this way. At this moment as you mention the issue is no longer a problem for you "as is not happening again" but to ensure peace of mind for you, I would like to suggest you please run the proposed tools that my college Nick had mentioned to you in previous correspondences and I will be more than happy to keep this case locked to me so in the fact that this issue happens again in the next four days you can reach out to me direct, and I can promptly work on a cause of this issue. I hope the above can help you further and please don't hesitate in contact me in a case of any event on your application. Wish you a good day further. Best regards, Daniel D. Amazon Web Services Check out the AWS Support Knowledge Center, a knowledge base of articles and videos that answer customer questions about AWS services: https://aws.amazon.com/premiumsupport/knowledge-center/?icmpid=support_email_category We value your feedback. Please rate my response using the link below. =================================================== To contact us again about this case, please return to the AWS Support Center using the following URL: https://console.aws.amazon.com/support/home#/case/?displayId=5186571871&language=en (If you are connecting by federation, log in before following the link.) *Please note: this e-mail was sent from an address that cannot accept incoming e-mail. Please use the link above if you need to contact us again about this same issue. ==================================================================== Learn to work with the AWS Cloud. Get started with free online videos and self-paced labs at https://aws.amazon.com/premiumsupport/knowledge-center/?icmpid=support_email_category We value your feedback. Please rate my response using the link below. =================================================== To contact us again about this case, please return to the AWS Support Center using the following URL: https://console.aws.amazon.com/support/home#/case/?displayId=5186571871&language=en (If you are connecting by federation, log in before following the link.) Please note: this e-mail was sent from an address that cannot accept incoming e-mail. Please use the link above if you need to contact us again about this same issue. ==================================================================== Learn to work with the AWS Cloud. Get started with free online videos and self-paced labs at http://aws.amazon.com/training/ ==================================================================== Amazon Web Services, Inc. is an affiliate of Amazon.com, Inc. Amazon.com is a registered trademark of Amazon.com, Inc. or its affiliates.
Was this response helpful? Click here to rate:

ReadOnly (Role)
Jul 8, 2018
10:14 PM +0700

Hi, sorry for the mistake on mentioning the second instance. the other instance id is i-09210aef691c11642 and the load balancer is an elastic load balancer. As for the time being it seems that the problem haven't occurred again within these past days without apparent changes on application side.

Amazon Web Services
Jul 7, 2018
02:08 AM +0700

Hello again, This is John with AWS Premium Support. As my colleague is currently unavailable, I will be assisting you with your follow-up question. In your reply, you mentioned the instance twice (i-00cf395479ef243b0). Did you mean to do this? Based on the context of your inquiry, it looks like you may have meant to provide another instance since you mentioned two being in a cluster. Related to the cluster, can you please provide which kind of cluster (Ex. Application load balancer, Elastic Load Balancer, Network Load Balancer, Elasticache node, etc.)? Also, if there's another instance we should be checking, please include that instance ID so we can investigate its underlying hardware for you. As to the first instance mentioned, I checked its underlying host for the past ten days and can confirm that the host has been operating well within normal parameters. If you've another instance you'd like me to check, I'd be happy to assist further. Looking forward to your response. Best regards, John H. Amazon Web Services Check out the AWS Support Knowledge Center, a knowledge base of articles and videos that answer customer questions about AWS services: https://aws.amazon.com/premiumsupport/knowledge-center/?icmpid=support_email_category We value your feedback. Please rate my response using the link below. =================================================== To contact us again about this case, please return to the AWS Support Center using the following URL: https://console.aws.amazon.com/support/home#/case/?displayId=5186571871&language=en (If you are connecting by federation, log in before following the link.) *Please note: this e-mail was sent from an address that cannot accept incoming e-mail. Please use the link above if you need to contact us again about this same issue. ==================================================================== Learn to work with the AWS Cloud. Get started with free online videos and self-paced labs at http://aws.amazon.com/training/ ==================================================================== Amazon Web Services, Inc. is an affiliate of Amazon.com, Inc. Amazon.com is a registered trademark of Amazon.com, Inc. or its affiliates.
Was this response helpful? Click here to rate:

ReadOnly (Role)
Jul 6, 2018
06:06 PM +0700

Hi, thank you for the response and the suggestion on the method to monitor the CPU process usage, we may try to incorporate it for better understanding of a possible future incidents. Regarding the particular one that this case addressed to, one of the initial hypotheses on possible hardware issue was based on the abnormal behaviour this particular instance (i-00cf395479ef243b0) that was not happening on another instance in the same cluster (i-00cf395479ef243b0) (happens a bit, but a lot less severe and only happens once the first machine starting to have problem to serve the request right) under a load balancer setup running the same service. What do you think on this particular case? another note, It's also quite interesting that you were mentioning on the no existence of another CPU related spike on the host as at around that time another service from a different team was also having an unexpected CPU spike that are isolated from ours (different investigation is happening for this case as theirs seems to be more complex and have more possible factors, and there seems to be no correlation app wise with ours)

Amazon Web Services
Jul 6, 2018
12:06 AM +0700

Hello, Thank you for contacting AWS Support. I understand that you would like to investigate why instance i-00cf395479ef243b0 experienced an unexpected spike in CPU Utilization today. Reviewing the CloudWatch metrics, I do see the spike in CPU Utilization today between 18.44 and 19.40(UTC+7), as well as a second spike between 21.20 and 21.51(UTC+7). However, I did not find any evidence of a hardware issue on the underlying host, and did not find that other instances on the host experienced a similar CPU spike. CPUUtilization: https://console.aws.amazon.com/cloudwatch/home?region=ap-southeast-1#metricsV2:graph=~%28region~%27ap-southeast-1~metrics~%28~%28~%27AWS*2fEC2~%27CPUUtilization~%27InstanceId~%27i-00cf395479ef243b0%29%29~period~60~stat~%27Maximum~start~%272018-07-05T10*3a13*3a00Z~end~%272018-07-05T17*3a29*3a00Z%29 As you mentioned, I don't see any increase in network traffic to or from this instance, and I also verified there was not an increase in activity on the root EBS volume. NetworkIn: https://console.aws.amazon.com/cloudwatch/home?region=ap-southeast-1#metricsV2:graph=~%28region~%27ap-southeast-1~metrics~%28~%28~%27AWS*2fEC2~%27NetworkIn~%27InstanceId~%27i-00cf395479ef243b0%29%29~period~60~stat~%27Maximum~start~%272018-07-05T10*3a13*3a00Z~end~%272018-07-05T17*3a29*3a00Z%29 NetworkOut: https://console.aws.amazon.com/cloudwatch/home?region=ap-southeast-1#metricsV2:graph=~%28region~%27ap-southeast-1~metrics~%28~%28~%27AWS*2fEC2~%27NetworkOut~%27InstanceId~%27i-00cf395479ef243b0%29%29~period~60~stat~%27Maximum~start~%272018-07-05T10*3a13*3a00Z~end~%272018-07-05T17*3a29*3a00Z%29 VolumeIdleTime: https://console.aws.amazon.com/cloudwatch/home?region=ap-southeast-1#metricsV2:graph=~%28region~%27ap-southeast-1~metrics~%28~%28~%27AWS*2fEBS~%27VolumeIdleTime~%27VolumeId~%27vol-0eff4356bfe0f31de%29%29~period~60~stat~%27Maximum~start~%272018-07-05T10*3a13*3a00Z~end~%272018-07-05T17*3a29*3a00Z%29 This leads me to believe that a process on the instance was not behaving as intended, but given my limited visibility in the OS running on the instance, I can not be certain of the root cause of this issue. By default, Linux does not log CPU usage per process, so unless you already have a third party monitoring tool running, I don't think we can be sure of exactly what happened. As previously mentioned, a second spike occurred after you opened this case, so to help troubleshoot further, I would recommend installing a tool such as atop. Atop is similar to top, in that it can be used from the shell to monitor system resources, however it can also be run as a service. When run as a service, it will generate a binary log file at /var/log/atop/atop_YYYYMMDD that can be played back with atop in the event that another spike occurs, and should help to determine which specific process is causing this issue. https://www.atoptool.nl/ https://www.tecmint.com/how-to-install-atop-to-monitor-logging-activity-of-linux-system-processes/ Please feel free to reach out to us again if you have any further questions or concerns. Best regards, Nick S. Amazon Web Services Check out the AWS Support Knowledge Center, a knowledge base of articles and videos that answer customer questions about AWS services: https://aws.amazon.com/premiumsupport/knowledge-center/?icmpid=support_email_category We value your feedback. Please rate my response using the link below. =================================================== To contact us again about this case, please return to the AWS Support Center using the following URL: https://console.aws.amazon.com/support/home#/case/?displayId=5186571871&language=en (If you are connecting by federation, log in before following the link.) *Please note: this e-mail was sent from an address that cannot accept incoming e-mail. Please use the link above if you need to contact us again about this same issue. ==================================================================== Learn to work with the AWS Cloud. Get started with free online videos and self-paced labs at http://aws.amazon.com/training/ ==================================================================== Amazon Web Services, Inc. is an affiliate of Amazon.com, Inc. Amazon.com is a registered trademark of Amazon.com, Inc. or its affiliates.
Was this response helpful? Click here to rate:

Excellent
ReadOnly (Role)
Jul 5, 2018
08:34 PM +0700

Unexpected CPU Spike on Instance i-00cf395479ef243b0 (Production). The incident happens on 5 July 2018 and started at 18.44 (UTC+7) and raise the open files of the system to a production affecting level starting at 19.40 (UTC+7), so far there have not been any expected factor/correlation on app side (based on request count, exceptions, computation, etc) as such deemed the incident factor seems to be a hardware or another issue. The incident was handled by restarting and re-deploying to the troubled instance.Please, advise. Thank you. Instance ID(s): i-00cf395479ef243b0
Attachments:
Screenshot from 2018-07-05 20-23-25.png