如何在星火分区的工作? [英] How does partitioning work in Spark?
问题描述
我想了解分区如何在Apache的星火完成。你们能帮忙吗?
I'm trying to understand how partitioning is done in Apache Spark. Can you guys help please?
下面是该方案:
- 高手和1个核心每个 两个节点
- 文件
count.txt
的大小为10 MB
- a master and two nodes with 1 core each
- a file
count.txt
of 10 MB in size
多少个分区执行以下操作创造?
How many partitions does the following create?
rdd = sc.textFile(count.txt)
请问文件的大小对分区的数量有任何影响?
Does the size of the file have any impact on the number of partitions?
推荐答案
默认情况下一个分区的每个分区HDFS,默认为64MB创建(从的Spark编程指南)。
By default a partition is created for each HDFS partition, which by default is 64MB (from the Spark Programming Guide).
这是可能通过另外一个参数 defaultMinPartitions
将覆盖火花将创建分区的最小数目。如果不重写此值,则激发将创造至少多分区为 spark.default.parallelism
。
It's possible to pass another parameter defaultMinPartitions
which overrides the minimum number of partitions that spark will create. If you don't override this value then spark will create at least as many partitions as spark.default.parallelism
.
由于 spark.default.parallelism
在所有集群中的机器,我相信会有中创建至少3个分区应该是核心数量你的情况。
Since spark.default.parallelism
is supposed to be the number of cores across all of the machines in your cluster I believe that there would be at least 3 partitions created in your case.
您也可以在再分配
或合并
的RDD改变分区数量,反过来影响总量可用的并行性。
You can also repartition
or coalesce
an RDD to change the number of partitions that in turn influences the total amount of available parallelism.
这篇关于如何在星火分区的工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!