为什么分区参数不为SparkContext.textFile影响? [英] Why is the partition parameter not in effect for SparkContext.textFile?
问题描述
scala> val p=sc.textFile("file:///c:/_home/so-posts.xml", 8) //i've 8 cores
p: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:21
scala> p.partitions.size
res33: Int = 729
我期待8要打印,我看到在星火UI 729任务
I was expecting 8 to be printed and I see 729 tasks in Spark UI
编辑:
呼叫后重新分区()
由@ zero323
After calling repartition()
as suggested by @zero323
scala> p1 = p.repartition(8)
scala> p1.partitions.size
res60: Int = 8
scala> p1.count
我仍然看到729任务星火UI即使火花壳版画8。
I still see 729 tasks in the Spark UI even though the spark-shell prints 8.
推荐答案
如果您在签名看看
textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
您将看到您所使用的参数就叫做 minPartitions
这pretty多介绍了它的功能。在某些情况下甚至被忽略,但它是一个不同的问题。这是在后台使用的输入格式还是决定如何计算分裂。
you'll see that the argument you use is called minPartitions
and this pretty much describes its function. In some cases even that is ignored but it is a different matter. Input format which is used behind the scenes still decides how to compute splits.
在这种特殊情况下你很可能 MA pred.min.split.size
来增加分割尺寸(这将加载过程中工作)或者干脆再分配
加载(加载数据后,这将生效),但在后一般应该没有必要。
In this particular case you could probably use mapred.min.split.size
to increase split size (this will work during load) or simply repartition
after loading (this will take effect after data is loaded) but in general there should be no need for that.
这篇关于为什么分区参数不为SparkContext.textFile影响?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!