Spark Dataset缓存仅使用一个执行程序 [英] Spark Dataset cache is using only one executor

查看:55
本文介绍了Spark Dataset缓存仅使用一个执行程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个读取hive(parquet-snappy)表并构建2GB数据集的过程.这是一个迭代(〜7K)的过程,并且所有迭代该数据集都将是相同的,因此我决定缓存该数据集.

I have a process which reads hive(parquet-snappy) table and builds a dataset of 2GB. It is iterative(~ 7K) process and This dataset is going to be the same for all iterations so I decided to cache the dataset.

以某种方式仅在一个执行程序上完成缓存任务,并且似乎缓存仅在该一个执行程序上执行.导致延迟,OOM等.

Somehow cache task is done on one executor only and seems like the cache is on that one executor only. which leads in delay, OOM etc.

是因为镶木地板吗?如何确保高速缓存分布在多个执行器上?

Is it because of parquet? How to make sure that cache is distributed on multiple executors?

这是spark配置:

  1. 执行者:3
  2. 核心:4
  3. 内存:4GB
  4. 分区:200

尝试重新分区并调整配置,但没有运气.

tried repartition and adjusting config but no luck.

推荐答案

我正在回答自己的问题,但这是一个有趣的发现,值得分享,如@thebluephantom建议.

I am answering my own question but it is interesting finding and It's worth sharing as @thebluephantom suggested.

所以这里的情况是火花代码,我正在从3个蜂巢实木复合地板中读取数据并建立数据集.现在以我为例,我正在读取每个表中的几乎所有列(大约502列),而镶木地板不是这种情况的理想选择.但是有趣的是,火花不是在一个执行器中为我的数据创建块(分区),也不是在缓存整个数据集(〜2GB).

So here the situation was in spark code I was reading data from 3 hive parquet tables and building the dataset. Now in my case, I am reading almost all columns from each table (approx 502 columns) and parquet is not ideal for this situation. But the interesting thing was spark was not creating blocks(partitions) for my data and caching entire dataset(~2GB) in just one executor.

此外,在我的迭代过程中,只有一名执行者正在执行所有任务.

Moreover, during my iterations, only one executor was doing all of the tasks.

此外, spark.default.parallelism spark.sql.shuffle.partitions 不在我的控制范围内.将其更改为Avro格式后,我实际上可以根据需要调整分区,随机播放,每个执行程序任务等.

Also, spark.default.parallelism and spark.sql.shuffle.partitions were not in my control. After changing it to Avro format I could actually tune the partitions, shuffles, each executor tasks etc. as per my need.

希望这会有所帮助!谢谢.

Hope this helps! Thank you.

这篇关于Spark Dataset缓存仅使用一个执行程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆