为什么 SparkContext.textFile 的 partition 参数没有生效? [英] Why does partition parameter of SparkContext.textFile not take effect?

查看:41
本文介绍了为什么 SparkContext.textFile 的 partition 参数没有生效?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

scala> val p=sc.textFile("file:///c:/_home/so-posts.xml", 8) //i've 8 cores
p: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:21

scala> p.partitions.size
res33: Int = 729

我原以为会打印 8 个,但在 Spark UI 中看到 729 个任务

I was expecting 8 to be printed and I see 729 tasks in Spark UI

按照@zero323 的建议调用 repartition() 之后

After calling repartition() as suggested by @zero323

scala> p1 = p.repartition(8)
scala> p1.partitions.size
res60: Int = 8
scala> p1.count

即使 spark-shell 打印 8,我仍然在 Spark UI 中看到 729 个任务.

I still see 729 tasks in the Spark UI even though the spark-shell prints 8.

推荐答案

如果你看一下签名

textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] 

您会看到您使用的参数称为 minPartitions,这几乎描述了它的功能.在某些情况下,甚至会忽略这一点,但这是另一回事.幕后使用的输入格式仍然决定如何计算拆分.

you'll see that the argument you use is called minPartitions and this pretty much describes its function. In some cases even that is ignored but it is a different matter. Input format which is used behind the scenes still decides how to compute splits.

在这种特殊情况下,您可能可以使用 mapred.min.split.size 来增加拆分大小(这将在加载期间起作用)或在加载后简单地使用 repartition(这将在数据加载后生效),但通常不需要这样做.

In this particular case you could probably use mapred.min.split.size to increase split size (this will work during load) or simply repartition after loading (this will take effect after data is loaded) but in general there should be no need for that.

这篇关于为什么 SparkContext.textFile 的 partition 参数没有生效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆