如何在星火分区的工作? [英] How does partitioning work in Spark?

查看:164
本文介绍了如何在星火分区的工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解分区如何在Apache的星火完成。你们能帮忙吗?

I'm trying to understand how partitioning is done in Apache Spark. Can you guys help please?

下面是该方案:


  • 高手和1个核心每个
  • 两个节点
  • 文件 count.txt 的大小为10 MB

  • a master and two nodes with 1 core each
  • a file count.txt of 10 MB in size

多少个分区执行以下操作创造?

How many partitions does the following create?

rdd = sc.textFile(count.txt)

请问文件的大小对分区的数量有任何影响?

Does the size of the file have any impact on the number of partitions?

推荐答案

默认情况下一个分区的每个分区HDFS,默认为64MB创建(从的Spark编程指南)。

By default a partition is created for each HDFS partition, which by default is 64MB (from the Spark Programming Guide).

这是可能通过另外一个参数 defaultMinPartitions 将覆盖火花将​​创建分区的最小数目。如果不重写此值,则激发将创造至少多分区为 spark.default.parallelism

It's possible to pass another parameter defaultMinPartitions which overrides the minimum number of partitions that spark will create. If you don't override this value then spark will create at least as many partitions as spark.default.parallelism.

由于 spark.default.parallelism 在所有集群中的机器,我相信会有中创建至少3个分区应该是核心数量你的情况。

Since spark.default.parallelism is supposed to be the number of cores across all of the machines in your cluster I believe that there would be at least 3 partitions created in your case.

您也可以在再分配合并的RDD改变分区数量,反过来影响总量可用的并行性。

You can also repartition or coalesce an RDD to change the number of partitions that in turn influences the total amount of available parallelism.

这篇关于如何在星火分区的工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆