We had a major incident last weekend with 7hr (9hr web) downtime for Transport, Connectivity, 20hr+ downtime for Experience, Accommodation, Package, due to Local service crash.

The irony is this is the largest incident we've had in the history of Traveloka, with ironically the tiniest corner case root cause we've encountered (Geographical data innocently modified Friday night, causing infinite loop).

The slow root cause discovery was due to lack of observability (esp. logging of certain key domain-specific operations) providing hint to the source of the problems, combined with some less robust design choice and some missing defensive code practices considering criticality of Local service.

While there are short-term follow up prevention, mitigation, migration, and architecture reviews that are already scheduled for Local + Accom + some related (TV) teams, this incident, even if rare, surfaces systematic problems that we start to have as the organization grows, and there i.

One big takeaway is the missing next level functional technology leadership for defining and governing (across the company) proper processes, tools, instrumentation, standards, and guidelines on implementation to each team, and when, and whether to take debt or not, according to tiers of severity, and how it creates blocking or non-blocking requirements in project spec, roadmap planning.

When we were a startup, 1 incident once in a while is okay. As 300+ person, 30-team organization, 1 incident per quarter per team can statistically result in 1 incident per 3 days. Not only 1) surface area and potential frequency of incident increases, but 2) some critical components bear more scale and impact on its fault that it requires the next level of standard of care we might not even have yet in the company, because we lack architecture experts at the next level.

(Our availability of most products in months prior to this have been around 99.8% (1.5 hour downtime per month). Compare this with Google Cloud's 99.999% for its data center (5 minute per year (this is not a typo)). (most global companies are between 99.9% and 99.99%). To achieve 99.9% or more for some critical components (e.g. NE, User services, etc.) needs another level of expertise + governance we need to acquire)

The fix is not fast, but the outline is as follows:

Define "Standard of Care" in terms of Tiering of tech assets (e.g. Tier I, II, III)
Design a more formal governance standards (standard of care) for each Tier, involving design guidelines, development standards, observability checklist, operational checklists, blocking requirements
Consciously identify all inventories and label them by Tier (I, II, III) by severity
Beef up central teams (mostly backend infra) to accelerate building missing tooling, processes, standards to apply to the whole company
Accelerate consultation engagement with experts from 3rd party (AWS for overall, Google Cloud for data)
Accelerate hiring for experts and architects positions in the company
Build up the next level functional technology leaders and organization in the company to shepherd Traveloka's next stage of growth
Form more pointed committees for doing regular architecture review and deep dive on troubleshooting certain critical services

This is on top of fixes and preventions already to implement across Local team + some Accom teams.

This might impact product roadmaps of most teams. In the future, we would like to converge to 60

:20:20 ratio for Day-to-day business : Technical debt : Investment. Many teams might not be at this point yet, but while there is no strong reason to do otherwise, it is strongly encouraged for teams to revisit their roadmap this Q4'17 (except for some time-to-market critical projects) to converge to this balance, and for some with Tier I services to take present hit paying technical debt it's already necessary to pay.

Ultimately, Sustainability will only help the business sustain its growth, so as the company grows, all leaders (not just Tech leaders, but All leaders) in the company should increasingly hold all metrics in balance - immediate vs. lagging, upside vs. risk mitigation, growth vs. sustainability.

Special thank you to team members who are up since Saturday 7:30am through Sunday and Monday handling this incident notably Denni who takes initaitive leading the group for an extended period, Evan as main Local PIC, Prashant who enlists India-based architect on the spot, Fajrin who decided to redirect trip to office on the spot, Site Infra team, Web Infra team, Kevin & Elisa, and others, and various products teams who are up on Sunday in the war room in W77 Tower 1 to handle the aftermath of the incidents to our customers, partners, suppliers.

We recognize the impact to the business becomes more and more major as the organization grows more complex (both broader, deeper, and more scale in domain), however tiny we think the root cause might be. And recognizing there is no quick answer to this, this functional organizational leadership upgrade (people + governance + their implementation organization-wide) will take top priority.

Documents:

Post-Mortem on Weekend Incident