We had a major incident last weekend with 7hr (9hr web) downtime for Transport, Connectivity, 20hr+ downtime for Experience, Accommodation, Package, due to Local service crash.
The irony is this is the largest incident we've had in the history of Traveloka, with ironically the tiniest corner case root cause we've encountered (Geographical data innocently modified Friday night, causing infinite loop).
The slow root cause discovery was due to lack of observability (esp. logging of certain key domain-specific operations) providing hint to the source of the problems, combined with some less robust design choice and some missing defensive code practices considering criticality of Local service.
While there are short-term follow up prevention, mitigation, migration, and architecture reviews that are already scheduled for Local + Accom + some related (TV) teams, this incident, even if rare, surfaces systematic problems that we start to have as the organization grows, and there i.
One big takeaway is the missing next level functional technology leadership for defining and governing (across the company) proper processes, tools, instrumentation, standards, and guidelines on implementation to each team, and when, and whether to take debt or not, according to tiers of severity, and how it creates blocking or non-blocking requirements in project spec, roadmap planning.
When we were a startup, 1 incident once in a while is okay. As 300+ person, 30-team organization, 1 incident per quarter per team can statistically result in 1 incident per 3 days. Not only 1) surface area and potential frequency of incident increases, but 2) some critical components bear more scale and impact on its fault that it requires the next level of standard of care we might not even have yet in the company, because we lack architecture experts at the next level.
(Our availability of most products in months prior to this have been around 99.8% (1.5 hour downtime per month). Compare this with Google Cloud's 99.999% for its data center (5 minute per year (this is not a typo)). (most global companies are between 99.9% and 99.99%). To achieve 99.9% or more for some critical components (e.g. NE, User services, etc.) needs another level of expertise + governance we need to acquire)
The fix is not fast, but the outline is as follows:
This is on top of fixes and preventions already to implement across Local team + some Accom teams.
This might impact product roadmaps of most teams. In the future, we would like to converge to 60:20:
Ultimately, Sustainability will only help the business sustain its growth, so as the company grows, all leaders (not just Tech leaders, but All leaders) in the company should increasingly hold all metrics in balance - immediate vs. lagging, upside vs. risk mitigation, growth vs. sustainability.
Special thank you to team members who are up since Saturday 7:30am through Sunday and Monday handling this incident notably Denni who takes initaitive leading the group for an extended period, Evan as main Local PIC, Prashant who enlists India-based architect on the spot, Fajrin who decided to redirect trip to office on the spot, Site Infra team, Web Infra team, Kevin & Elisa, and others, and various products teams who are up on Sunday in the war room in W77 Tower 1 to handle the aftermath of the incidents to our customers, partners, suppliers.
We recognize the impact to the business becomes more and more major as the organization grows more complex (both broader, deeper, and more scale in domain), however tiny we think the root cause might be. And recognizing there is no quick answer to this, this functional organizational leadership upgrade (people + governance + their implementation organization-wide) will take top priority.
Documents: