Sharing Session by Amazon - Production Incidents (Large Scale Events) and Postmortems (Correction of Errors)

Hi all,

Amazon will give us a sharing session for handling incidents and doing post mortem.

How does Amazon and AWS service teams build services/products while also operating those services: the content of this session are real processes, tools and practices that our Service teams and SDEs practice at AWS/Amazon every day.

We investigate how the 2-pizza-team philosophy works, dive deep into the operational metrics that our product developer teams measure (& why), mechanisms to maintain their operational excellence and how success criteria and goals look like for them, a service ownership culture.

Our discussion will focus on how a high bar of Operational Excellence is achieved through repeatable Amazon specific mechanisms across DevOps, Agile processes, metrics-driven decisions, how our weekly metrics meetings look, what metrics are our service teams goaled on. We will cover the end to end process of handling large scale events: from how our services organize their oncall rotations, to how they handle the events and their post-mortems through our COE process.

We will finish with a round of discussion and Q&A.

If you are interested to join the session, please fill the form at https://bit.ly/30VON9l before 4 August 2020 23.45 GMT+7. You can click the link: https://bit.ly/3eWP3tK to add the event to your calendar.

Thank you!