通过Spark将中间处理的数据复制到目标S3时,AWS EMR性能问题 [英] AWS EMR Performance issues while copying the intermediate processed data by Spark to the Target S3

查看:60
本文介绍了通过Spark将中间处理的数据复制到目标S3时,AWS EMR性能问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前,我正在使用AWS EMR进行数据处理.S3被用作着陆区和最终处理的数据.来自S3的最终处理数据已加载到Redshift中,以供客户运行分析.

Currently I am using AWS EMR for Data processing. S3 is being used as the Landing zone and Final processed Data. Final processed data from S3 is getting loaded in Redshift for Customers to Run Analytics.

  • 每天我会收到100个包含小KB和MB(最大2-3MB)的小文件.一旦源文件在着陆区中可用,则根据SLA,数据需要在15分钟内以Redshift形式存在.订单"表的最终存储区"为800GB.

  • Daily I receive 100 small files of small KBs and MBs (2-3MB max). The data needs to be present in Redshift in 15 mins as per SLA, once the source file is available in landing zone. The Final Bucket for the Orders table is 800GB.

实现了SCD Type 1

SCD Type 1 is implemented

pySpark用于处理.数据清理在2-3分钟内完成

pySpark is used for processing. Data cleansing is done in 2-3 mins

Spark 创建用于数据处理的中间文件夹,我们将最终处理的数据从该文件夹拧到另一个S3存储桶

Spark creates an intermediate folder for data processing from which we are wring the final processed data to another S3 Bucket

此过程将花费近45分钟,即使是很小的KB数据也是如此.代码在下面

This process takes almost 45 minutes even for small data of KBs. The code is below

spark.conf.set('spark.sql.sources.partitionOverwriteMode','dynamic')

检查同一存储桶下文件夹之间的正常数据复制是否需要3-4分钟

Checked the normal data copy is taking 3-4 mins between folders under same bucket

正在使用5节点瞬态群集(r5.4x大)

5 Node transient cluster is being used (r5.4x Large)

  df.write.format("parquet").partitionBy("src", "hash_value").mode("append").save(path)

推荐答案

以下是最近几天进行的调整.加载时间从1.30Hrs减少到30分钟

FOllowing were the tuning done in the last couple of Days. The loading time is reduced from 1.30Hrs to 30 Mins

  1. 使用的R类型实例.与价格相同的M型实例相比,它提供了更多的内存

  1. used R type instance. It provides more memory compared to M type instances at same price

用于合并源中的文件,其中有许多小文件.

used coalesce to merge the files in source there were many small files.

检查映射程序任务的数量.任务越多,性能越低

Check the number of mapper tasks. The more the task, lesser the performance

我们有一些不必要的数据计数.删除了相同的

we had some unwanted count of data. Removed the same

使用的EMRFS6.在下面的emr配置文件中使用.以前我们使用

used EMRFS 6.used below in the emr config file. Previously we used

{分类":"spark-env",属性":{"spark.executor.memory":"16g","spark.driver.memory":"4g","spark.driver.cores":"4","spark.driver.memoryOverhead":"4g","spark.executor.cores":"5","spark.executor.memoryOverhead":"4g"}

{ "classification": "spark-env", "properties": { "spark.executor.memory": "16g", "spark.driver.memory": "4g", "spark.driver.cores": "4", "spark.driver.memoryOverhead": "4g", "spark.executor.cores": "5", "spark.executor.memoryOverhead": "4g" }

现在我正在使用以下

{
    "Classification": "spark",
    "Properties": {
       "maximizeResourceAllocation": "true"
    }
  }
 

这篇关于通过Spark将中间处理的数据复制到目标S3时,AWS EMR性能问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆