Based on: https://tvlk.slack.com/archives/CJS4Y5GJK/p1585205811024500 we are getting this error
20/03/26 00:55:39 INFO... found file name ERP Payment To Customer Report between 23/03/2020 0000 to 23/03/2020 23:59:59 - Postgres ERP.xlsx :00:
20/03/26 00:57:33 ERROR FileFormatWriter: Aborting job 4c962cc7-55ed-4162-a444-dc048d78aa8a.
org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 0:0 was 139267411 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
Tried the fix from https://kb.databricks.com/execution/spark-serialized-task-is-too-large.html to set the configuration of spark.rpc.message.maxSize = 256
on dev manually and verified working.
Proposed
To mitigate the issue until we tackle this next sprint
SuperAdmin to clone the EMR in prod to ensure that we can sync 23rd report first.
Rationale:
Ideal fix
To include this in terraform as an option to overwrite. man days required = 2. Steps described below
{
{
"classification": "spark-defaults",
"properties": {
"spark.rpc.message.maxSize": "256"
}
},
2. Cut a new release for (1)
3. Proceed to https://github.com/traveloka/tvlk-eci-infra-image
4. Push the image to tvlk-prod under latest tag at https://ap-southeast-1.console.aws.amazon.com/ecr/repositories/ecidtpl-emr-cd-eb1004d0e13eff50/?region=ap-southeast-1
5. Include the environment variable at https://github.com/traveloka/terraform-aws-ecidtpl-codebuild-emr/blob/master/main.tf#L244
244
6. Cut a release for (5)
7. Update https://github.com/traveloka/tvlk-eci-dev-terraform-aws/blob/master/ap-southeast-1/ecidtpl/ecidtpl-flatteners/ecidtpl-flattener-erp-emr/main.tf#L62 to send the new variable
Test (7) by running EMR in dev. once 7 is verified, push image under production tag and update prod EMR (same as step 7)