为什么 SparkContext.textFile 的 partition 参数没有生效? [英] Why does partition parameter of SparkContext.textFile not take effect?
问题描述
scala> val p=sc.textFile("file:///c:/_home/so-posts.xml", 8) //i've 8 cores
p: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:21
scala> p.partitions.size
res33: Int = 729
我原以为会打印 8 个,但在 Spark UI 中看到 729 个任务
I was expecting 8 to be printed and I see 729 tasks in Spark UI
按照@zero323 的建议调用 repartition()
之后
After calling repartition()
as suggested by @zero323
scala> p1 = p.repartition(8)
scala> p1.partitions.size
res60: Int = 8
scala> p1.count
即使 spark-shell 打印 8,我仍然在 Spark UI 中看到 729 个任务.
I still see 729 tasks in the Spark UI even though the spark-shell prints 8.
推荐答案
如果你看一下签名
textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
您会看到您使用的参数称为 minPartitions
,这几乎描述了它的功能.在某些情况下,甚至会忽略这一点,但这是另一回事.幕后使用的输入格式仍然决定如何计算拆分.
you'll see that the argument you use is called minPartitions
and this pretty much describes its function. In some cases even that is ignored but it is a different matter. Input format which is used behind the scenes still decides how to compute splits.
在这种特殊情况下,您可能可以使用 mapred.min.split.size
来增加拆分大小(这将在加载期间起作用)或在加载后简单地使用 repartition
(这将在数据加载后生效),但通常不需要这样做.
In this particular case you could probably use mapred.min.split.size
to increase split size (this will work during load) or simply repartition
after loading (this will take effect after data is loaded) but in general there should be no need for that.
这篇关于为什么 SparkContext.textFile 的 partition 参数没有生效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!