Spark是否有关于RDD的最佳分区数及其元素数的经验法则? [英] Spark Is there any rule of thumb about the optimal number of partition of a RDD and its number of elements?

查看:215
本文介绍了Spark是否有关于RDD的最佳分区数及其元素数的经验法则?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

RDD包含的元素数量与其理想的分区数量之间是否有任何关系?

Is there any relationship between the number of elements an RDD contained and its ideal number of partitions ?

我有一个具有数千个分区的RDD(因为我是从一个由多个小文件组成的源文件中加载它的,这是我无法解决的约束,因此我必须加以处理).我想对其重新分区(或使用coalesce方法).但我事先不知道RDD将包含的确切事件数.
所以我想以自动化的方式来做.看起来像这样:

I have a RDD that has thousand of partitions (because I load it from a source file composed by multiple small files, that's a constraint I can't fix so I have to deal with it). I would like to repartition it (or use the coalescemethod). But I don't know in advance the exact number of events the RDD will contain.
So I would like to do it in an automated way. Something that will look like:

val numberOfElements = rdd.count()
val magicNumber = 100000
rdd.coalesce( numberOfElements / magicNumber)

关于RDD的最佳分区数目及其元素数目是否有任何经验法则?

Is there any rule of thumb about the optimal number of partition of a RDD and its number of elements ?

谢谢.

推荐答案

没有,因为它高度依赖于应用程序,资源和数据.有一些严格的限制(像各种2GB限制),但其余的则有根据任务进行调整.要考虑的一些因素:

There isn't, because it is highly dependent on application, resources and data. There are some hard limitations (like various 2GB limits) but the rest you have to tune on task to task basis. Some factors to consider:

  • 单行/元素的大小
  • 典型操作的成本.如果分区很小并且操作便宜,那么调度成本可能会比数据处理成本高得多.
  • 执行按分区(例如排序)操作时处理分区的成本.
  • size of a single row / element
  • cost of a typical operation. If have small partitions and operations are cheap then scheduling cost can be much higher than the cost of data processing.
  • cost of processing partition when performing partition-wise (sort for example) operations.

如果此处的核心问题是许多初始文件,那么使用CombineFileInputFormat的某些变体可能比重新分区/合并更好.例如:

If the core problem here is a number of the initial files then using some variant of CombineFileInputFormat could be a better idea than repartitioning / coalescing. For example:

sc.hadoopFile(
  path,
  classOf[CombineTextInputFormat],
  classOf[LongWritable], classOf[Text]
).map(_._2.toString)

另请参见如何计算合并的最佳分区数?

这篇关于Spark是否有关于RDD的最佳分区数及其元素数的经验法则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆