如何知道何时使用不平衡的分区重新分区/合并RDD(可能没有改组)? [英] How to know when to repartition/coalesce RDD with unbalanced partitions (without shuffling possibly)?

查看:89
本文介绍了如何知道何时使用不平衡的分区重新分区/合并RDD(可能没有改组)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从s3加载成千上万的gzip压缩文件以用于我的spark工作.这导致某些分区非常小(记录的10s),而又有些很大(记录的10000s).分区的大小在节点之间分配得很好,因此每个执行器似乎都在处理相同数量的数据.因此,我不确定自己是否有问题.

I'm loading tens of thousands of gzipped files from s3 for my spark job. This results in some partitions being very small (10s of records) and some very large (10000s of records). The sizes of the partitions are pretty well distributed among nodes so each executor seems to be working on the same amount of data in aggregate. So I'm not really sure if I even have a problem.

我怎么知道是否值得对RDD进行重新分区或合并?这些中的任何一个都能够平衡分区而不会拖尾数据吗?另外,RDD不会被重用,只会映射到另一个RDD上,然后再加入.

How would I know if it's worth repartitioning or coalescing the RDD? Will either of these be able to balance the partitions without shuffling data? Also, the RDD will not be reused, just mapped over and then joined to another RDD.

推荐答案

有趣的问题.关于合并与重新分区,合并肯定会更好,因为它不会触发完整的洗牌.通常,当跨分区的数据稀疏时(例如,在过滤器之后),建议合并.我认为这是类似的情况,但直接来自初始负载.但是,由于您在初始加载后使用RDD进行处理,因此我真的认为合并可能对您而言是值得的.

Interesting question. With respect to coalescing versus repartitioning, coalescing would definitely be better as it does not trigger a full shuffle. In general, coalescing is recommended when you have sparse data across partitions (say, after a filter). I think this is a similar scenario, but straight from the initial load. However, I really think coalescing would probably be worth it for you due to what you do with the RDD after your initial load.

在将联接应用于已加载的RDD时对数据进行混洗时,Spark会咨询混洗管理器以查看应使用哪种混洗实现(通过 spark.shuffle.manager 配置).随机播放管理器有两种实现方式: hash (默认版本< 1.2.0)和 sort (默认> = 1.2.0).

When data is shuffled when you apply your join to your loaded RDD, Spark consults the shuffle manager to see which implementation of shuffle it should use (configured through spark.shuffle.manager). There are two implementations for the shuffle manager: hash (default for version < 1.2.0) and sort (default >= 1.2.0).

如果使用 hash 实现,则每个输入分区将创建输出文件,以发送到将要进行连接的相应的reducer.通过将 spark.shuffle.consolidateFiles 设置为true可以减轻文件的爆炸性,但是如果有很多分区作为输入,最终会使连接缓慢.如果使用此实现,则合并绝对值得,因为大量的输入分区可能会产生大量的文件以减少使用.

If the hash implementation is used, each input partition will create output files to send to the corresponding reducers where the join will take place. This can create a huge blow up of files which can be mitigated by setting spark.shuffle.consolidateFiles to true but ultimately can leave you with a pretty slow join if there are a lot of partitions as input. If this implementation is used, coalescing is definitely worth it as a large number of input partitions can yield an unwieldy number of files to reduce out of.

如果使用 sort 实现,则每个分区(输出!)只有一个输出文件(索引!),并且对该文件进行了索引,以使reducer可以从各自的索引中获取其键.但是,对于许多输入分区,Spark仍将从所有输入分区读取以收集每个可能的键.如果使用此实现,则合并仍然值得,因为将此查找和读取应用于每个分区也可能会很昂贵.

If the sort implementation is used, there is only one output file per partition (whew!) and the file is indexed such that the reducers can grab their keys from their respective indices. However, with many input partitions, Spark will still be reading from all of the input partitions to gather every possible key. If this implementation is used, coalescing might still be worth it since applying this seek and read to every partition could also be costly.

如果最终使用合并,则可能需要调整要合并的分区数,因为合并是执行计划中的一个步骤.但是,此步骤可能会节省您非常昂贵的加入.另外,作为旁注,这篇文章对解释混洗背后的实现非常有帮助.

If you do end up using coalescing, the number of partitions you want to coalesce to is something you will probably have to tune since coalescing will be a step within your execution plan. However, this step could potentially save you a very costly join. Also, as a side note, this post is very helpful in explaining the implementation behind shuffling.

这篇关于如何知道何时使用不平衡的分区重新分区/合并RDD(可能没有改组)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆