Spark中的默认分区方案 [英] Default Partitioning Scheme in Spark

查看：38 发布时间：2021/11/12 5:34:15 apache-spark rdd partitioning

本文介绍了Spark中的默认分区方案的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当我执行以下命令时:

scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10)).persist()
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[10] at partitionBy at <console>:22

scala> rdd.partitions.size
res9: Int = 10

scala> rdd.partitioner.isDefined
res10: Boolean = true


scala> rdd.partitioner.get
res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner@a

它说有 10 个分区，分区是使用 HashPartitioner 完成的.但是当我执行以下命令时:

It says that there are 10 partitions and partitioning is done using HashPartitioner. But When I execute below command:

scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4)
...
scala> rdd.partitions.size
res6: Int = 4
scala> rdd.partitioner.isDefined
res8: Boolean = false

它说有4个分区并且没有定义partitioner.那么，Spark 中的默认分区方案是什么?/在第二种情况下如何对数据进行分区?

It says that there are 4 partitions and partitioner is not defined. So, What is default Partitioning Scheme in Spark ? / How data is partitioned in second case?

推荐答案

你必须区分两种不同的东西:

You have to distinguish between two different things:

根据键的值在分区之间分配数据，该值仅限于 PairwiseRDDs (RDD[(T, U)]).这在分区和可以在给定分区上找到的键集之间创建了关系.
分区是将输入分成多个分区，其中数据被简单地分成包含连续记录的块，以实现分布式计算.确切的逻辑取决于特定的源，但它要么是记录数，要么是块的大小.

partitioning as distributing data between partitions depending on a value of the key which is limited only to the PairwiseRDDs (RDD[(T, U)]). This creates a relationship between partition and the set of keys which can be found on a given partition.
partitioning as splitting input into multiple partitions where data is simply divided into chunks containing consecutive records to enable distributed computation. Exact logic depends on a specific source but it is either number of records or size of a chunk.

在 parallelize 的情况下，数据使用索引均匀分布在分区之间.对于 HadoopInputFormats(如 textFile)，它取决于 mapreduce.input.fileinputformat.split.minsize/mapreduce.input 等属性.fileinputformat.split.maxsize.

In case of parallelize data is evenly distributed between partitions using indices. In case of HadoopInputFormats (like textFile) it depends on properties like mapreduce.input.fileinputformat.split.minsize / mapreduce.input.fileinputformat.split.maxsize.

所以默认分区方案是无，因为分区并不适用于所有 RDD.对于需要对 PairwiseRDD(aggregateByKey、reduceByKey 等)进行分区的操作，默认方法是使用哈希分区.

So default partitioning scheme is simply none because partitioning is not applicable to all RDDs. For operations which require partitioning on a PairwiseRDD (aggregateByKey, reduceByKey etc.) default method is use hash partitioning.

这篇关于Spark中的默认分区方案的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark中的默认分区方案 [英] Default Partitioning Scheme in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark中的默认分区方案 [英] Default Partitioning Scheme in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭