从Spark数据帧将镶木地板写入Azure BLOB存储-执行器性能低下 [英] Writing a parquet to Azure BLOB storage from a spark dataframe - low executor performance

查看：53 发布时间：2019/6/18 19:00:37 hdinsight

本文介绍了从Spark数据帧将镶木地板写入Azure BLOB存储-执行器性能低下的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Spark version : 2.3

hadoop dist:azure Hdinsight 2.6.5

hadoop dist : azure Hdinsight 2.6.5

平台:Azure

存储:BLOB

集群中的节点:6

执行器实例:6

每个执行者的核心数:3

cores per executor : 3

每个执行者的内存:8GB

Memory per executor : 8gb

尝试通过同一存储帐户上的spark数据帧将天蓝色blob(wasb)中的csv文件(大小为4.5g-280 col，280万行)加载为拼花格式.我已将文件重新分区为不同的大小，即20、40、60、100，但遇到一个奇怪的问题，即在处理非常小的记录子集(小于1％)的6个执行程序中，有2个执行程序运行了大约1个小时左右并最终完成.

Trying to load a csv file(size 4.5g - 280 col , 2.8 mil rows ) in azure blob (wasb) to parquet format via a spark dataframe on the same storage account. I have repartitioned the file with different size i.e. 20, 40, 60, 100 but facing a weird issue where the 2 out of the 6 executors that process a very small subset of records ( < 1%) keep running for an 1 hour or so and eventually complete.

问题:

1)这两个执行者正在处理的分区具有最少的记录要处理(少于1％)，但是要花近一个小时才能完成.这是什么原因.这与数据偏斜情况相反吗?

1) the partitions that is getting processed by these 2 executors has the least records to process ( less than 1%) but take almost an hour to complete. what is the reason for this. Is this opposite of a data skew scenario ?

2)运行这些执行程序的节点上的本地缓存文件夹已被填满(50-60GB).不知道其背后的原因.

2) local cache folders on the nodes running these executors are getting filled up (50-60GB). Not sure of the reason behind this.

3)增加分区的确会使整个执行时间减少到40分钟，但只想知道这两个执行器执行失败的原因.

3) Increasing the partitions does bring the over all execution time down to 40 min but wanted to know the reason behind the low through with these 2 executors only.

它是新的火花，因此期待一些指针来调整此工作负载.附加了Spark WebUi的其他信息.

New to spark so looking forward to some pointers to tune this workload. Additional info from Spark WebUi attached.

从Spark数据帧将镶木地板写入Azure BLOB存储-执行器性能低下 [英] Writing a parquet to Azure BLOB storage from a spark dataframe - low executor performance

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

从Spark数据帧将镶木地板写入Azure BLOB存储-执行器性能低下 [英] Writing a parquet to Azure BLOB storage from a spark dataframe - low executor performance

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭