星火:增加分区的数量,而不会导致洗牌? [英] Spark: increase number of partitions without causing a shuffle?

查看:141
本文介绍了星火:增加分区的数量,而不会导致洗牌?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在减少一种可以使用的分区数量 COALESCE ,这是伟大的,因为它不会导致洗牌,似乎立刻开始工作(不要求额外的工作阶段)。

When decreasing the number of partitions one can use coalesce, which is great because it doesn't cause a shuffle and seems to work instantly (doesn't require an additional job stage).

我想有时做相反,但再分配诱导洗牌。我认为,在几个月前,我居然通过了这个工作 CoalescedRDD balanceSlack = 1.0 - 所以会发生什么将拆分的分区,以便使所得的分区的位置,其中所有相同的节点(这么小净IO)。在

I would like to do the opposite sometimes, but repartition induces a shuffle. I think a few months ago I actually got this working by using CoalescedRDD with balanceSlack = 1.0 - so what would happen is it would split a partition so that the resulting partitions location where all on the same node (so small net IO).

这样的功能是在Hadoop中自动,一位刚刚调整拆分的大小。它似乎没有这样的工作方式在星火若不是减少分区的数量。我认为,解决办法可能是与我们自定义一个RDD一起编写自定义分区的get preferredLocations ...但我认为是这样一个简单而普通的一定要做到的事,必须有这样做的直接的方式?

This kind of functionality is automatic in Hadoop, one just tweaks the split size. It doesn't seem to work this way in Spark unless one is decreasing the number of partitions. I think the solution might be to write a custom partitioner along with a custom RDD where we define getPreferredLocations ... but I thought that is such a simple and common thing to do surely there must be a straight forward way of doing it?

试过的东西:

.SET(spark.default.parallelism分区)我的 SparkConf ,当在阅读实木复合地板方面,我已经试过 sqlContext.sql(设置spark.sql.shuffle.partitions = ... ,这在1.0.0导致一个错误,并没有真正要我要,我要的分区数量在所有类型的作业改变,不只是打乱。

.set("spark.default.parallelism", partitions) on my SparkConf, and when in the context of reading parquet I've tried sqlContext.sql("set spark.sql.shuffle.partitions= ..., which on 1.0.0 causes an error AND not really want I want, I want partition number to change across all types of job, not just shuffles.

推荐答案

我完全不明白你的意思是什么。你的意思是你现在有5个分区,但接下来的操作后,你要分发到10个数据?因为有10个,但仍然使用5并没有太大的意义...发送数据到新分区的过程中有某个时候发生的。

I do not exactly understand what your point is. Do you mean you have now 5 partitions, but after next operation you want data distributed to 10? Because having 10, but still using 5 does not make much sense… The process of sending data to new partitions has to happen sometime.

做当合并,你可以摆脱unsued分区,例如:如果您有最初100,但随后reduceByKey后得到10(因为其中只有10键),你可以设置合并

When doing coalesce, you can get rid of unsued partitions, for example: if you had initially 100, but then after reduceByKey you got 10 (as there where only 10 keys), you can set coalesce.

如果你想的过程中走另一条路,你可能只是迫使一些类型的划分:

If you want the process to go the other way, you could just force some kind of partitioning:

[RDD].partitionBy(new HashPartitioner(100))

我不知道这就是你要找什么,但希望如此。

I'm not sure that's what you're looking for, but hope so.

这篇关于星火:增加分区的数量,而不会导致洗牌?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆