Backend Newsletter Q3 2019
Hello y’all, this is the second Backend Update (we will call this Newsletter from now on) which will cover Q3 2019. Based on the feedback we got from the previous update, we will try to post this quarterly. We also tried to publish this at the end of Q3, however due to the various issues, we have to push this back to the start of Q4.
Initiative Highlights
ASG Migration
PIC: Salvian
Why
- Improve fault tolerance by replacing unhealthy instance automatically
- Improve reliability by always replacing old instances with clean, new ones in each deployment
- With proper scaling policy will give improved cost-efficiency and availability
- Ansible Tower will be shut down in 6 months
Progress
- ent (2019Q4), pps (2019Q4), and vcp (2020Q1) will be the last 3 PDs to migrate to ASG, the others have completed / been doing the migration.
- 33 product domains have completed the migration. The average progress of all PDs is 80+%
- @arpit.agrawal has enabled canary strategy to ASG deployment. Please test it soon by using the ‘master’ version of the deployment role
Plans
- None, other than helping the remaining teams to complete their migration if required
Problems
EC2 Downsizing
PIC: Igit, Salvian
Why
Our default EC2 instance type was m4.large. However:
- 150 out of 164 m4.large clusters use only <10% CPU, and even with the likely random heap setting, 130 of them on average leave 4 GBs of usable memory → Theoretically, these clusters could run fine with smaller instance types (c5.large or t3.medium).
- Switching to t3 medium will give 55% EC2 cost reduction.
Progress
- The efficiency dashboard, cost-saving spreadsheet, and cluster sizing guide has been created and announced in #technology.
- Some PDs (trp, asi, bus, cul) have been informed in-person and they agree to downsize. We also see some other PDs proactively reduce their EC2 costs, like fpr, pay, trn, and ugc have also taken the action of downsizing their clusters. Thank you for caring about your cloud infrastructure cost!
- You can put comments in the spreadsheet if you know your cluster info is outdated
Plans
- Let’s join the EC2 cost-saving initiative by start reading Backend EC2 Cluster Sizing Guide. We can discuss in #backend_cluster_optimization.
- The inefficiency dashboard and JVM setting dashboard show some PDs should greatly benefit from this initiative, but might have missed the announcements
- We’ll inform more backend teams about this
Problems
Backend Load Testing
PIC: Igit
Why
- Many of our production resources are heavily underutilised / overprovision.
- Many of our engineers don’t exactly measure their cluster reliability.
- This causing unnecessary high infrastructure cost and immeasurable production reliability status.
Progress
- We design a step-by-step process that all interested parties can use as a guideline to load test their cluster, and justify their cluster configuration based on certain performance target and its SLA (application tier).
- For updated load testing overall result and its associated cost (man hours), you check this document.
- 7 product domains (hcn, cxp, lpc, trn, trp, pay, fpr-post-main-flow) have finished load testing their clusters in q3;
- 13 product domains in total enjoy improvement on application reliability, increasing resource utilisation (up to 10x), while saved more than 80% ec2 instances cost in production.
- Interested in joining in? Start by reading backend load testing landing page and slack @igit for more detail.
- Multi-account load testing platform is almost done. It will make doing load testing and maintaining its infrastructure easier for multi-account user. Thanks for the contribution from CTV, LOC, TXT, AFC, Fintech-Devops Infra Team, Site-infra Team, and Backend-infra Team.
- We just released a methodology to help backend developers set their application heap size
- It helps you to understand your java service/application memory requirements
- You can use the number to target your next instance type and size to check during load testing process
Plans
- Piloting the load testing methodology for multi-account users (early Oct 2019)
- Release building block for running load testing in multi-account environment (late Oct 2019)
- Follow up with teams that already committed to join in Q4 2019
- Follow up with teams that may join after migrating to ASG or enabling load testing building block in their environment (multi-account users)
- Reach out to business stakeholders for discussing the importance of load testing their product infrastructure
Problems
- Business stakeholders sometimes have different view toward load testing and its long term effect
Multi-account
PIC: Fajrin, Febry Antonius, Gujarat Santana, Darwin Wirawan
Why
Goal : Enable autonomy
Impact :
- Enable new technology research, such as container and serverless. Keep in mind that tvlk-prod is deprecated. Improve your services’ reliability and maintainability by using better and more suitable solutions
- Provide cost visibility, for better budgeting. Saved cost can be spent for cool perks!
- Strengthen security, by enabling better security tools and reducing blast radius
- Full power and control, suitable for team that plans to do refactor or kick starts a new product
Many teams have been driven by impacts above to migrate to multi-account, as multi-account is a great investment for both technology and business. If you think multi-account will help your PD, feel free to contact us!
Progress
- 12 PDs are WIP
- 12 PDs are fully migrated
Plans
Labs and Sharing Session :
- Multi-account onboarding and Introduction to Terraform labs will be released soon
- Cost Management and Security sharing session will be held soon, by Shvetsov Serhiy. This is the follow-up of last Q&A sessions in September
- More documentations, labs and sharing sessions in Q4!
Feel free to discuss with us!
- If you think multi-account will help your PD : we will help you plan and drive the project, given resource and time availability
- If you are not sure multi-account will help : we will help you align multi-account with the current business state, vision, and priority. We believe multi-account is a great investment that benefits all stakeholders
Problems
We don’t have any. Maybe you have problems, and multi-account can solve them!
Java 8 Migration
PIC: Ronny
Number of applicable product domains: 49
Why
Java 7 had reached end-of-life on April 2015. Furthermore, sometimes we can’t use newer version of some libraries because we still use Java 7. By using Java 8, we will:
- Have more options for feature development (because of more supported libraries and Java 8 new features)
- Have better performance for Java application
- Get up-to-date support
- Be able to research more JDK versions (example: Amazon Coretto)
- First baby-step to upgrade to latest Java version
Progress
- 2 product domains already build their code as Java 8 code
- 36 out of 49 (73%) product domains have fully used JRE 8 in production
- 10 out of 49 (20%) product domains are still in progress of migrating to JRE 8
Plans
- Finish the JRE migration for product domains that already started the project
- Finalize the timeline for teams that still haven’t started the project -> we are aiming to finish the project at the end of Q4
- Build the code as Java 8 code, after everyone use JRE 8
- Optimizing our code base and building blocks to leverage Java 8 features
Problems
- Some teams may have different priority for this project, so it can delay some product domains to build Java 8 code
Multi-repo Migration
PIC: Christianto Handojo
Number of applicable product domains: 43
Why
Working in the old monorepo has become cumbersome due to big code size and amount of different teams that work in the repo, resulting in among other things unreliable repo (landing failure etc), uncontrolled growth of branch total, long build times for revision checking and application release. Moving monorepo hosting to Github (thanks to @echon) has somewhat alleviated the problem of unreliable repo, but other problems remain.
Progress
- 12 of 42 product domains has finished moved to their own repositories
- 24 of 42 product domains have made some progress in migrating their code, although 9 are currently put on hold for reasons
Plans
- Some team will start this project in Q4, for those that haven’t started yet, don’t hesitate to contact us if you want to start multi-repo
Problems
- Many teams are still unfamiliar with working in truly multi-repo environment, which requires good practice regarding among other things API backward compatibility, library versioning and team communication to ensure all system work smoothly
- Some documentation, sharing session and perhaps other measures are needed to ensure this problem doesn’t happen often.
Backend Microservice Integration Testing Practices
PIC: Salvian
Why
- By surveying multiple backend teams, I found some problems with our current testing:
- It’s used to test not only release candidates, but also most of newly-developed features & fixes
- Some people run IDE debugger against staging
- Unnecessary waiting & roadblocks as these ever-changing feature versions introduce variety of issues
- Many teams don’t monitor the state of their staging periodically
- Issues are often left out until consumers complain as they’re blocked
- Unreliable staging verification
- Many teams run API tests manually, so the test is slow & error-prone
- Different people often run different test cases, so the test is not thorough.
Progress
- We know at least two initiatives are going on:
- Automated API tests to solve problem [2] and [3]. @Shawn Lim from the insurance team, along with @pieter, have implemented and shared backend end to end testing. Similar approach has also been implemented in accomm. You can also read our doc to know more about API testing
- Component test using mock services to solve problem [1]. @Subramaniam S is trying to find a way to help consumers test their integrations to the provider services without deploying to staging. This way people can test their features and fixes parallelly, independent of each other.
Plans
- Gather more knowledge about the API and component test strategy
- We will share about the new testing strategies
- Improve our testing documentation.
Monitoring Vendor Research
PIC: Ronny
Why
Because of some issues related to cost attribution and user-access management, we decided to find an alternative for Datadog. We want to finalize this soon (Q4 2019 or Q1 2020) because the migration from Datadog is ideally finished before the next year’s Datadog contract expires.
Progress
So far we have started the trial for SignalFX with help from ipi and fpr (Thank you @putu.pradnyana, @Vincent, @janesa.tarigan). We have finished covering the current Datadog use cases, but not yet trialled APM (Application Performance Monitoring). Current docs: https://docs.google.com/document/d/1CoXr6AAHyOUuU2_QMJECMlFxdM55ATt0d9MyJ_Zr2Xg/edit# (will be moved to Confluence later).
Plans
- Continue with APM trial for SignalFX
- Find volunteer(s)
- Define the checklist of features to try for APM
- Find another vendor for trial
Problems
- It’s still not clear on how to do cost/benefit analysis of APM (can we justify the high cost)
For example: if major incidents rarely happens in Traveloka, do we want to spend $1M / year for APM?
So we will need to have more details (haven’t finalized yet) when doing incident tracking.
Log Analysis using AWS CWL Insights
PIC: Ronny
Why
We have interviewed some teams regarding observability and one of the common feedback is usually it’s hard to query CWL, especially for querying MongoDB logs (to find slow queries) or querying multiple application log groups (to trace a certain request / bookingId / invoiceId). Fortunately, AWS Insights already has a new feature (published on 26 July 2019) to query multiple log groups.
Progress
Backend-infra team has done preliminary research on AWS CWL Insights. Current docs:
We also got back to a few teams that we have interviewed before, but the problem is their log formats may not be the same between services, so it’s still hard to have 1 single query that can parse them all correctly (example: in 1 service they log bookingId as a plain number, in another service they log it as “booking ID <number>”, in another service they log it as “bookingId: <number>”). But at least we can use this for MongoDB logs for now.
Plans
Problems
Need to make sure this will bring benefits to teams, but still a bit hard to do this in most services’ logs, because of non-uniform log message format.
Other Information
Datadog Update
- Currently some pages on Datadog are restricted so that only admin users can open it. Related announcement: https://tvlk.slack.com/archives/G035PQNQZ/p1569926384172900.
- Related to that, we want to standardize the EC2 whitelisting on on-production environments because currently there are many monitored hosts which don’t have to be monitored. The problem is we have to pay for those hosts because they send metrics from Datadog agent or because their EC2 metrics are being collected (from Datadog-AWS integration). Related announcement: https://tvlk.slack.com/files/T02T3CAFM/FNVLKQC1E?origin_team=T02T3CAFM. We will need your help to disable the Datadog agent on unnecessary hosts too.
- Another thing is, backend-infra and site-infra are in the middle of discussing new annual contract with Datadog, because current contract will expire on 31 October 2019. For this, we need to estimate our usages (number of hosts + custom metrics, what features we want to use) for the upcoming year.
Container
We postponed the research on backend containerization. One of its obvious benefits, cost saving, is also already achievable using EC2 with burstable / spot instances, with no additional research and lower implementation effort. However, we might consider container for other initiatives, e.g. local testing using mock service. Don’t hesitate to contact us if you have any concerns.
The Effective Engineer
We recently had a book discussion on The Effective Engineer internally. We found that the book shares some very important mindsets which can help us to become more effective as engineers. We strongly suggest everyone to read the book, no matter what level you are currently at. If you are too lazy (:p) to read the book (it is only ~200 pages long), you can read the summary that we have made here.
Backend Newsletter as Confluence Blogs
All Backend Newsletter will be published as Confluence blogs starting from the previous one (Q1-Q2 2019). This backend newsletter had been published here.
Thank you
We would like to thank everyone who has contributed on all of the initiatives. Adieu!