Hello y’all, this is the second Backend Update (we will call this Newsletter from now on) which will cover Q3 2019. Based on the feedback we got from the previous update, we will try to post this quarterly. We also tried to publish this at the end of Q3, however due to the various issues, we have to push this back to the start of Q4.

Initiative Highlights

ASG Migration

Why

Improve fault tolerance by replacing unhealthy instance automatically
Improve reliability by always replacing old instances with clean, new ones in each deployment
With proper scaling policy will give improved cost-efficiency and availability
Ansible Tower will be shut down in 6 months

Progress

ent (2019Q4), pps (2019Q4), and vcp (2020Q1) will be the last 3 PDs to migrate to ASG, the others have completed / been doing the migration.
33 product domains have completed the migration. The average progress of all PDs is 80+%
@arpit.agrawal has enabled canary strategy to ASG deployment. Please test it soon by using the ‘master’ version of the deployment role

Plans

Problems

EC2 Downsizing

Why

150 out of 164 m4.large clusters use only <10% CPU, and even with the likely random heap setting, 130 of them on average leave 4 GBs of usable memory → Theoretically, these clusters could run fine with smaller instance types (c5.large or t3.medium).
Switching to t3 medium will give 55% EC2 cost reduction.

Progress

The efficiency dashboard, cost-saving spreadsheet, and cluster sizing guide has been created and announced in #technology.
Some PDs (trp, asi, bus, cul) have been informed in-person and they agree to downsize. We also see some other PDs proactively reduce their EC2 costs, like fpr, pay, trn, and ugc have also taken the action of downsizing their clusters. Thank you for caring about your cloud infrastructure cost!
You can put comments in the spreadsheet if you know your cluster info is outdated

Plans

Let’s join the EC2 cost-saving initiative by start reading Backend EC2 Cluster Sizing Guide. We can discuss in #backend_cluster_optimization.
The inefficiency dashboard and JVM setting dashboard show some PDs should greatly benefit from this initiative, but might have missed the announcements

We’ll inform more backend teams about this

Problems

Backend Load Testing

Why

Progress

We design a step-by-step process that all interested parties can use as a guideline to load test their cluster, and justify their cluster configuration based on certain performance target and its SLA (application tier).

For updated load testing overall result and its associated cost (man hours), you check this document.
7 product domains (hcn, cxp, lpc, trn, trp, pay, fpr-post-main-flow) have finished load testing their clusters in q3;
13 product domains in total enjoy improvement on application reliability, increasing resource utilisation (up to 10x), while saved more than 80% ec2 instances cost in production.
Interested in joining in? Start by reading backend load testing landing page and slack @igit for more detail.

Multi-account load testing platform is almost done. It will make doing load testing and maintaining its infrastructure easier for multi-account user. Thanks for the contribution from CTV, LOC, TXT, AFC, Fintech-Devops Infra Team, Site-infra Team, and Backend-infra Team.
We just released a methodology to help backend developers set their application heap size

It helps you to understand your java service/application memory requirements
You can use the number to target your next instance type and size to check during load testing process

Plans

Piloting the load testing methodology for multi-account users (early Oct 2019)
Release building block for running load testing in multi-account environment (late Oct 2019)
Follow up with teams that already committed to join in Q4 2019
Follow up with teams that may join after migrating to ASG or enabling load testing building block in their environment (multi-account users)
Reach out to business stakeholders for discussing the importance of load testing their product infrastructure

Problems

Multi-account

Why

Enable new technology research, such as container and serverless. Keep in mind that tvlk-prod is deprecated. Improve your services’ reliability and maintainability by using better and more suitable solutions
Provide cost visibility, for better budgeting. Saved cost can be spent for cool perks!
Strengthen security, by enabling better security tools and reducing blast radius
Full power and control, suitable for team that plans to do refactor or kick starts a new product

Many teams have been driven by impacts above to migrate to multi-account, as multi-account is a great investment for both technology and business. If you think multi-account will help your PD, feel free to contact us!

Progress

Plans

Multi-account onboarding and Introduction to Terraform labs will be released soon
Cost Management and Security sharing session will be held soon, by Shvetsov Serhiy. This is the follow-up of last Q&A sessions in September
More documentations, labs and sharing sessions in Q4!

If you think multi-account will help your PD : we will help you plan and drive the project, given resource and time availability
If you are not sure multi-account will help : we will help you align multi-account with the current business state, vision, and priority. We believe multi-account is a great investment that benefits all stakeholders

Problems

We don’t have any. Maybe you have problems, and multi-account can solve them!

Java 8 Migration

Why

Java 7 had reached end-of-life on April 2015. Furthermore, sometimes we can’t use newer version of some libraries because we still use Java 7. By using Java 8, we will:

Have more options for feature development (because of more supported libraries and Java 8 new features)
Have better performance for Java application
Get up-to-date support
Be able to research more JDK versions (example: Amazon Coretto)
First baby-step to upgrade to latest Java version

Progress

Plans

Finish the JRE migration for product domains that already started the project
Finalize the timeline for teams that still haven’t started the project -> we are aiming to finish the project at the end of Q4
Build the code as Java 8 code, after everyone use JRE 8
Optimizing our code base and building blocks to leverage Java 8 features

Problems

Multi-repo Migration

Why

Working in the old monorepo has become cumbersome due to big code size and amount of different teams that work in the repo, resulting in among other things unreliable repo (landing failure etc), uncontrolled growth of branch total, long build times for revision checking and application release. Moving monorepo hosting to Github (thanks to @echon) has somewhat alleviated the problem of unreliable repo, but other problems remain.

Progress

Plans

Problems

Many teams are still unfamiliar with working in truly multi-repo environment, which requires good practice regarding among other things API backward compatibility, library versioning and team communication to ensure all system work smoothly
Some documentation, sharing session and perhaps other measures are needed to ensure this problem doesn’t happen often.

Backend Microservice Integration Testing Practices

Why

By surveying multiple backend teams, I found some problems with our current testing:

It’s used to test not only release candidates, but also most of newly-developed features & fixes

Some people run IDE debugger against staging
Unnecessary waiting & roadblocks as these ever-changing feature versions introduce variety of issues

Many teams don’t monitor the state of their staging periodically

Issues are often left out until consumers complain as they’re blocked
Unreliable staging verification

Many teams run API tests manually, so the test is slow & error-prone
Different people often run different test cases, so the test is not thorough.

Progress

We know at least two initiatives are going on:

Automated API tests to solve problem [2] and [3]. @Shawn Lim from the insurance team, along with @pieter, have implemented and shared backend end to end testing. Similar approach has also been implemented in accomm. You can also read our doc to know more about API testing
Component test using mock services to solve problem [1]. @Subramaniam S is trying to find a way to help consumers test their integrations to the provider services without deploying to staging. This way people can test their features and fixes parallelly, independent of each other.

Plans

Monitoring Vendor Research

Why

Because of some issues related to cost attribution and user-access management, we decided to find an alternative for Datadog. We want to finalize this soon (Q4 2019 or Q1 2020) because the migration from Datadog is ideally finished before the next year’s Datadog contract expires.

Progress

So far we have started the trial for SignalFX with help from ipi and fpr (Thank you @putu.pradnyana, @Vincent, @janesa.tarigan). We have finished covering the current Datadog use cases, but not yet trialled APM (Application Performance Monitoring). Current docs: https://docs.google.com/document/d/1CoXr6AAHyOUuU2_QMJECMlFxdM55ATt0d9MyJ_Zr2Xg/edit# (will be moved to Confluence later).

Plans

Problems

For example: if major incidents rarely happens in Traveloka, do we want to spend $1M / year for APM?
So we will need to have more details (haven’t finalized yet) when doing incident tracking.

Log Analysis using AWS CWL Insights

Why

We have interviewed some teams regarding observability and one of the common feedback is usually it’s hard to query CWL, especially for querying MongoDB logs (to find slow queries) or querying multiple application log groups (to trace a certain request / bookingId / invoiceId). Fortunately, AWS Insights already has a new feature (published on 26 July 2019) to query multiple log groups.

Progress

Backend-infra team has done preliminary research on AWS CWL Insights. Current docs:

We also got back to a few teams that we have interviewed before, but the problem is their log formats may not be the same between services, so it’s still hard to have 1 single query that can parse them all correctly (example: in 1 service they log bookingId as a plain number, in another service they log it as “booking ID <number>”, in another service they log it as “bookingId: <number>”). But at least we can use this for MongoDB logs for now.

Plans

Demo this to merchandise team, because they have enabled the requestId logging (https://tvlk.slack.com/archives/C03A4ENFK/p1563278403022600) so it’s easier to query and trace requests. If it satisfies their use cases, we will share and demo this to a wider audience.
Also don’t hesitate to share some useful queries at https://29022131.atlassian.net/wiki/spaces/BEI/pages/1113522603/Sample+Queries+for+CloudWatchLog+Insights.
[low priority] Try to compare the performance if we use JSON-formatted logs (based on comparison of web-infra’s logs vs backend-app’s logs, the performance is about 3 times faster for JSON, but this may not be an apple-to-apple comparison, because the log size is also different).

Problems

Need to make sure this will bring benefits to teams, but still a bit hard to do this in most services’ logs, because of non-uniform log message format.

Other Information

Datadog Update

Currently some pages on Datadog are restricted so that only admin users can open it. Related announcement: https://tvlk.slack.com/archives/G035PQNQZ/p1569926384172900.
Related to that, we want to standardize the EC2 whitelisting on on-production environments because currently there are many monitored hosts which don’t have to be monitored. The problem is we have to pay for those hosts because they send metrics from Datadog agent or because their EC2 metrics are being collected (from Datadog-AWS integration). Related announcement: https://tvlk.slack.com/files/T02T3CAFM/FNVLKQC1E?origin_team=T02T3CAFM. We will need your help to disable the Datadog agent on unnecessary hosts too.
Another thing is, backend-infra and site-infra are in the middle of discussing new annual contract with Datadog, because current contract will expire on 31 October 2019. For this, we need to estimate our usages (number of hosts + custom metrics, what features we want to use) for the upcoming year.

Container

We postponed the research on backend containerization. One of its obvious benefits, cost saving, is also already achievable using EC2 with burstable / spot instances, with no additional research and lower implementation effort. However, we might consider container for other initiatives, e.g. local testing using mock service. Don’t hesitate to contact us if you have any concerns.

The Effective Engineer

We recently had a book discussion on The Effective Engineer internally. We found that the book shares some very important mindsets which can help us to become more effective as engineers. We strongly suggest everyone to read the book, no matter what level you are currently at. If you are too lazy (:p) to read the book (it is only ~200 pages long), you can read the summary that we have made here.

Backend Newsletter as Confluence Blogs

All Backend Newsletter will be published as Confluence blogs starting from the previous one (Q1-Q2 2019). This backend newsletter had been published here.

Thank you

We would like to thank everyone who has contributed on all of the initiatives. Adieu!

Backend Newsletter Q3 2019

Initiative Highlights

ASG Migration

Why

Progress

Plans

Problems

EC2 Downsizing

Why

Progress

Plans

Problems

Backend Load Testing

Why

Progress

Plans

Problems

Multi-account

Why

Progress

Plans

Problems

Java 8 Migration

Why

Progress

Plans

Problems

Multi-repo Migration

Why

Progress

Plans

Problems

Backend Microservice Integration Testing Practices

Why

Progress

Plans

Monitoring Vendor Research

Why

Progress

Plans

Problems

Log Analysis using AWS CWL Insights

Why

Progress

Plans

Problems

Other Information

Datadog Update

Container

The Effective Engineer

Backend Newsletter as Confluence Blogs

Thank you