为什么分区参数不为SparkContext.textFile影响? [英] Why is the partition parameter not in effect for SparkContext.textFile?

查看:948
本文介绍了为什么分区参数不为SparkContext.textFile影响?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

scala> val p=sc.textFile("file:///c:/_home/so-posts.xml", 8) //i've 8 cores
p: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:21

scala> p.partitions.size
res33: Int = 729

我期待8要打印,我看到在星火UI 729任务

I was expecting 8 to be printed and I see 729 tasks in Spark UI

编辑:

呼叫后重新分区()由@ zero323

After calling repartition() as suggested by @zero323

scala> p1 = p.repartition(8)
scala> p1.partitions.size
res60: Int = 8
scala> p1.count

我仍然看到729任务星火UI即使火花壳版画8。

I still see 729 tasks in the Spark UI even though the spark-shell prints 8.

推荐答案

如果您在签名看看

textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] 

您将看到您所使用的参数就叫做 minPartitions 这pretty多介绍了它的功能。在某些情况下甚至被忽略,但它是一个不同的问题。这是在后台使用的输入格式还是决定如何计算分裂。

you'll see that the argument you use is called minPartitions and this pretty much describes its function. In some cases even that is ignored but it is a different matter. Input format which is used behind the scenes still decides how to compute splits.

在这种特殊情况下你很可能 MA pred.min.split.size 来增加分割尺寸(这将加载过程中工作)或者干脆再分配加载(加载数据后,这将生效),但在后一般应该没有必要。

In this particular case you could probably use mapred.min.split.size to increase split size (this will work during load) or simply repartition after loading (this will take effect after data is loaded) but in general there should be no need for that.

这篇关于为什么分区参数不为SparkContext.textFile影响?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆