prod investigation

Based on: https://tvlk.slack.com/archives/CJS4Y5GJK/p1585205811024500 we are getting this error

20/03/26 00:55:39 INFO... found file name ERP Payment To Customer Report between 23/03/2020 00:00:00 to 23/03/2020 23:59:59 - Postgres ERP.xlsx

20/03/26 00:57:33 ERROR FileFormatWriter: Aborting job 4c962cc7-55ed-4162-a444-dc048d78aa8a.
org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 0:0 was 139267411 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)

Tried the fix from https://kb.databricks.com/execution/spark-serialized-task-is-too-large.html to set the configuration of spark.rpc.message.maxSize = 256 on dev manually and verified working.

Proposed

To mitigate the issue until we tackle this next sprint
SuperAdmin to clone the EMR in prod to ensure that we can sync 23rd report first.

Rationale:

The earliest we can fix this might be Monday and this mean reports will be delayed until MEC period which might cause impact.
Not sure if this is an odd case - first time experiencing this.

Ideal fix
To include this in terraform as an option to overwrite. man days required = 2. Steps described below

Submit PR at https://github.com/traveloka/terraform-aws-ecidtpl-emr/blob/master/templates/configuration.json.tpl#L16 with optional variable for spark.rpc.message.maxSize

{
{
"classification": "spark-defaults",
"properties": {
"spark.rpc.message.maxSize": "256"
}
},

2. Cut a new release for (1)

3. Proceed to https://github.com/traveloka/tvlk-eci-infra-image

Use (2) at https://github.com/traveloka/tvlk-eci-infra-image/blob/master/codebuild-data-pipeline-cd-emr/tftemplate/emr.tf#L12

Also update to read from environment variable and relay over at https://github.com/traveloka/tvlk-eci-infra-image/blob/master/codebuild-data-pipeline-cd-emr/scripts/terraform-generator.py#L36 - set default ''

4. Push the image to tvlk-prod under latest tag at https://ap-southeast-1.console.aws.amazon.com/ecr/repositories/ecidtpl-emr-cd-eb1004d0e13eff50/?region=ap-southeast-1

5. Include the environment variable at https://github.com/traveloka/terraform-aws-ecidtpl-codebuild-emr/blob/master/main.tf#L244
244

6. Cut a release for (5)

7. Update https://github.com/traveloka/tvlk-eci-dev-terraform-aws/blob/master/ap-southeast-1/ecidtpl/ecidtpl-flatteners/ecidtpl-flattener-erp-emr/main.tf#L62 to send the new variable

Test (7) by running EMR in dev. once 7 is verified, push image under production tag and update prod EMR (same as step 7)