Spark中的默认分区方案 [英] Default Partitioning Scheme in Spark

查看:38
本文介绍了Spark中的默认分区方案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我执行以下命令时:

scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4).partitionBy(new HashPartitioner(10)).persist()
rdd: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[10] at partitionBy at <console>:22

scala> rdd.partitions.size
res9: Int = 10

scala> rdd.partitioner.isDefined
res10: Boolean = true


scala> rdd.partitioner.get
res11: org.apache.spark.Partitioner = org.apache.spark.HashPartitioner@a

它说有 10 个分区,分区是使用 HashPartitioner 完成的.但是当我执行以下命令时:

It says that there are 10 partitions and partitioning is done using HashPartitioner. But When I execute below command:

scala> val rdd = sc.parallelize(List((1,2),(3,4),(3,6)),4)
...
scala> rdd.partitions.size
res6: Int = 4
scala> rdd.partitioner.isDefined
res8: Boolean = false

它说有4个分区并且没有定义partitioner.那么,Spark 中的默认分区方案是什么?/在第二种情况下如何对数据进行分区?

It says that there are 4 partitions and partitioner is not defined. So, What is default Partitioning Scheme in Spark ? / How data is partitioned in second case?

推荐答案

你必须区分两种不同的东西:

You have to distinguish between two different things:

  • 根据键的值在分区之间分配数据,该值仅限于 PairwiseRDDs (RDD[(T, U)]).这在分区和可以在给定分区上找到的键集之间创建了关系.
  • 分区是将输入分成多个分区,其中数据被简单地分成包含连续记录的块,以实现分布式计算.确切的逻辑取决于特定的源,但它要么是记录数,要么是块的大小.

  • partitioning as distributing data between partitions depending on a value of the key which is limited only to the PairwiseRDDs (RDD[(T, U)]). This creates a relationship between partition and the set of keys which can be found on a given partition.
  • partitioning as splitting input into multiple partitions where data is simply divided into chunks containing consecutive records to enable distributed computation. Exact logic depends on a specific source but it is either number of records or size of a chunk.

parallelize 的情况下,数据使用索引均匀分布在分区之间.对于 HadoopInputFormats(如 textFile),它取决于 mapreduce.input.fileinputformat.split.minsize/mapreduce.input 等属性.fileinputformat.split.maxsize.

In case of parallelize data is evenly distributed between partitions using indices. In case of HadoopInputFormats (like textFile) it depends on properties like mapreduce.input.fileinputformat.split.minsize / mapreduce.input.fileinputformat.split.maxsize.

所以默认分区方案是无,因为分区并不适用于所有 RDD.对于需要对 PairwiseRDD(aggregateByKeyreduceByKey 等)进行分区的操作,默认方法是使用哈希分区.

So default partitioning scheme is simply none because partitioning is not applicable to all RDDs. For operations which require partitioning on a PairwiseRDD (aggregateByKey, reduceByKey etc.) default method is use hash partitioning.

这篇关于Spark中的默认分区方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆