从Spark数据帧将镶木地板写入Azure BLOB存储-执行器性能低下 [英] Writing a parquet to Azure BLOB storage from a spark dataframe - low executor performance

查看:53
本文介绍了从Spark数据帧将镶木地板写入Azure BLOB存储-执行器性能低下的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark version : 2.3

hadoop dist:azure Hdinsight 2.6.5

hadoop dist : azure Hdinsight 2.6.5

平台:Azure

存储:BLOB

集群中的节点:6

执行器实例:6

每个执行者的核心数:3

cores per executor : 3

每个执行者的内存:8GB

Memory per executor : 8gb

尝试通过同一存储帐户上的spark数据帧将天蓝色blob(wasb)中的csv文件(大小为4.5g-280 col,280万行)加载为拼花格式.我已将文件重新分区为不同的大小,即20、40、60、100,但遇到一个奇怪的问题,即 在处理非常小的记录子集(小于1%)的6个执行程序中,有2个执行程序运行了大约1个小时左右并最终完成.

Trying to load a csv file(size 4.5g - 280 col , 2.8 mil rows ) in azure blob (wasb) to parquet format via a spark dataframe on the same storage account. I have repartitioned the file with different size i.e. 20, 40, 60, 100 but facing a weird issue where the 2 out of the 6 executors that process a very small subset of records ( < 1%) keep running for an 1 hour or so and eventually complete.

问题:

1)这两个执行者正在处理的分区具有最少的记录要处理(少于1%),但是要花近一个小时才能完成.这是什么原因.这与数据偏斜情况相反吗?

1) the partitions that is getting processed by these 2 executors has the least records to process ( less than 1%) but take almost an hour to complete. what is the reason for this. Is this opposite of a data skew scenario ?

2)运行这些执行程序的节点上的本地缓存文件夹已被填满(50-60GB).不知道其背后的原因.

2) local cache folders on the nodes running these executors are getting filled up (50-60GB). Not sure of the reason behind this.

3)增加分区的确会使整个执行时间减少到40分钟,但只想知道这两个执行器执行失败的原因.

3) Increasing the partitions does bring the over all execution time down to 40 min but wanted to know the reason behind the low through with these 2 executors only.

它是新的火花,因此期待一些指针来调整此工作负载.附加了Spark WebUi的其他信息.

New to spark so looking forward to some pointers to tune this workload. Additional info from Spark WebUi attached.


推荐答案

我没有看到所附的图片.您是否可以共享脚本或脚本的一部分,以便在其中对数据进行重新分区以及如何将其写出?那平均任务执行时间是多少?任务/阶段的总数是多少?多少任务卡住了,什么 他们的执行时间是什么? 
I don't see the attached picture. Can you share script or parts of the script where you repartition the data and how you write it out? What is that average task execution time? What is the total number of tasks/stages? How many tasks get stuck and what is their execution time? 


这篇关于从Spark数据帧将镶木地板写入Azure BLOB存储-执行器性能低下的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆