[Breaking Changes] NRTPROD Dataset Retention Policy

Hi all, Ingestion team is planning to apply retention policy to nrtprod dataset in tvlk-realtime project on GCP.

Why do we need it?

We have been keeping the data since 2012 in the BQ and most of us do not need them. Aligning with the cost optimization mission for this quarter, it will be more cost-efficient if we archive the data older than 2 years to cheaper storage, in our case is GCS storage. Referring to last month bill, the BQ storage cost was around $10k. And using a naive calculation, we could potentially save 30-40% or $3-4K a month.

What is the policy?

Data older than 2 years old will be archived to GCS storage and removed from BQ. Data is archived to gs://tvlk-data-datalake-prod/traveloka/data/v1/tracking/parquet/<table_name>.

What do i need to do?

When will the policy apply?

If no further concerns, we will apply this on 24th Wednesday 2020.

Where can i get support before and after policy application?

Please join this channel #data-decommissions for support or you can contact any of ingestion team member directly.

FAQ

What should we do if we have a lot of tables?
If you could help us with the list, we will help you automate the process instead of clicking the UI one-by-one.

What if i still need data more than 2 years occasionally?
Data older than set retention policy is still in GCS. The ETL can use GCS as the data source or if really needed, we can recover this data temporarily to BQ.

Thank you for your attention!

Best Regards,
Ingestion Team