This post is a copy of Backend Infra Confluence Blog Post RDS Postgres 9.5 Major Upgrade Completion
We first became aware that Postgres 9.5 was scheduled to enter End-of-Life (EoL) status on February 11th 2021 in 2020 Q3. Based on what AWS did when Postgres 9.4 reached EoL and the high amount of RDS Postgres 9.5 owned by many teams (at least 67 in production environments), we decided to do a quick research on possible ways to do a major upgrade for RDS Postgres and help teams by guiding the upgrade process in staging and production environment should teams need it.
There are two different upgrade methods that we found by the end of research. The first method, the native upgrade method, utilizes the AWS RDS capability to order a major version upgrade to an existing instance. The read replicas will be upgraded together with the primary. This method is straightforward, but with a big downtime requirement, in the order of 20-25 minutes for the major upgrade only, not counting the minor upgrade if needed (with read replicas, we assume the downtime will be multiplied by the amount of replicas). Thus the majority of upgrades with this method were done in the time most convenient for the product (lowest traffic / lowest risk). For more details please see here.
The second method, the Bucardo method, utilizes an open-source tool for Postgres replication, Bucardo. We create a new instance with the upgraded version and use Bucardo to replicate all data from the current instance to the new one. After we know that all data is already replicated, we switch all services to use the new cluster. In practice, to help save time in the replication stage, we create the new instance using a snapshot of the current instance and upgrade the new instance to the wanted version. The majority of the research related to the Bucardo method was done by Adrian Susanto from Flight team and Yonathan Handjojo and Hadi Japarto from CorpTech Team. For more details please see here.
All 67 RDS Postgres 9.5 instances were upgraded before the EoL and took around four months (from early October 2020 to Early February 2021). The database and application downtime vary depending on the upgrade method, business requirements, and route-off strategy that the team has. With native upgrade method, the application and database downtime takes the same duration because we route-off the traffic to the application before doing the major upgrade. Overall, the downtime will be around 30 minutes for doing a major upgrade in an RDS instance with no read replica and 55 minutes for an RDS instance with 1 read replica instance. An RDS instance with a read replica will require twice the amount of downtime compared to a RDS instance without a read replica. Meanwhile, with the Bucardo method, we have zero downtime for the database and 5-15 minutes for application downtime in order to switch over the traffic from the old RDS instance to the upgraded RDS instance. In terms of man-hour effort, for doing a major upgrade to a production RDS cluster, both methods require similar man-hour effort which is around 2-3 man-hours. During RDS Postgres 9.5 major upgrade project, we got reports from some teams that they encountered an issue after the upgrade. As of now, all the issues have been resolved and documented in our RDS Postgres Major Upgrade documentation, RDS Native Upgrade documentation and RDS-Bucardo documentation.
Make sure to check your AWS accounts and tvlk-prod/dev and clean up all unused resources left after all the migration done. This might include EC2 instances where Bucardo was run, the old RDS instance / cluster, along with old RDS snapshots. For resources in tvlk-prod/dev, you will need Cloud Infra help for deletion of your old RDS instance/cluster, make a PR in infra-production-playbook repo for the changes wanted and raise an issue in the same repository to apply the changes.
During this project, we received feedback from a team that psql version 9.5 installed in their bastion image is not fully compatible with servers running postgres 11. So far only a feature is known to be not compatible, the \df function for checking all stored procedures in the server. Due to changes in system tables between Postgres 9.5 and Postgres 11, we expect this will be the case for many other psql version 9.5 functions. Thus we advise all teams to use the latest bastion image (AMI ID ami-08f6f372e4b7bf430, name tvlk/ubuntu-20/tsi/bastion/hvm/x86_64/1612851530) created by Cloud Infra which contains psql for Postgres 11. For now we will only install one psql in the bastion image, and expect that it will be compatible with all postgres versions used in Traveloka. If you find any problem between psql for version 11 with servers running Postgres 13, 12, 10 or 9.6, please inform us.
As per 4th February 2021, we populated a list of all our RDS instances and we have 89 RDS Postgres instances that use Postgres 9.6 in production environment. Postgres has announced that Postgres 9.6 will be EoL in 11th November 2021 and AWS has announced that RDS Postgres 9.6 will be EoL in 18th January 2022. Therefore, you might need to consider upgrading our RDS Postgres 9.6 instances starting from now. As of now, BEI has no plan to provide hands-on support to help the team doing the major upgrade. For teams that want to do the major upgrade by themselves, you can start planning the upgrade as early as possible. One possible benefit that you can leverage from the major upgrade is you can use new features from either Postgres or AWS RDS Postgres, such as Performance Insight or M5 instance family. As the target major version, so far we have only tested upgrading to Postgres 11. You may try Postgres 12 or 13 as the target, but please test thoroughly in staging.
The old proxy method for calling RDS Postgres from service servers requires a codebuild pipeline to generate the proxy jar and upload it to the beiartf. The generator code used in the pipeline is only compatible with Postgres 9.5. Our first advice to tackle this problem is to move all communication methods with RDS to using jOOQ, but considering that many teams can’t move to jOOQ in a short period of time, we have published a new proxy generator image that uses Postgres 9.6 [IMAGE_ID]. We also have published an updated generator library compatible with Postgres 9.6 (com.traveloka.common:postgres-proxy-generator:11.1.27). However, we might not do this when Postgres 9.6 enter EoL status, since we aim to deprecate the proxy method. Please read this link for the current Postgres development and deployment method.
For those who still use proxy and have sql code in the tv-sql-product repository, the repository will be updated to use the new library on March 1st 2021. Please adjust your pipelines to use the new docker image in case of usage after March 1st. For those with separate sql repositories but still using proxy, please update the library and the image used yourself. If the repo structure still follows the tv-sql-product, you can find the library declaration in gradle/scripts/publish_library.gradle, the dependencies closure. To update the docker image you can simply update the terraform-aws-codebuild-postgres-proxy-publisher module for your pipeline to version 0.10.0
This concludes the RDS Postgres 9.5 Major Upgrade project. We want to thank @adrian.susanto from FPR, @Frans, @agusprasetiyo, and @dimas.yuwono from FinTech, @yonathan.handjojo and @hadi_japarto from CorpTech, @Namit from Platform DevOps, and all other engineers that have collaborated with us in driving this project to completion. We also want to remind once again to teams to start planning about RDS Postgres 9.6 Major Upgrade in relation to the given EoL date, and that Backend Infra team has no plan to provide hands-on support as of now. This might change later based on the situation then. Thank you for your attention.