如何划分RDD遵守顺序? [英] How can I partition a RDD respecting order?

查看:140
本文介绍了如何划分RDD遵守顺序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将RDD划分为与我发现的不同键的数量相对应的多个分区(在这种情况下为3):

I would like to divide a RDD in a number of partitions corresponding to the number of different keys I found (3 in this case):

RDD:[(1,a), (1,b), (1,c), (2,d), (3,e), (3,f), (3,g), (3,h), (3,i)]

我现在要做的是具有相同键的元素将属于同一分区:

What I do now is that elements with the same key will fall into the same partition:

[(1,a), (1,b), (1,c)]
[(2,d)]
[(3,e), (3,f), (3,g), (3,h), (3,i)]

这就是我的分区方式

val partitionedRDD = rdd.partitionBy(new PointPartitioner(
     rdd.keys.distinct().count().asInstanceOf[Int]))

这是PoinPartitioner类

This is PoinPartitioner class

class PointPartitioner(numParts: Int) extends org.apache.spark.Partitioner{

import org.apache.spark.Partitioner
  override def numPartitions: Int = numParts

  override def getPartition(key: Any): Int = {
    key.hashCode % numPartitions
  }

  override def equals(other: Any): Boolean = other match
  {
    case dnp: PointPartitioner =>
      dnp.numPartitions == numPartitions
    case _ =>
      false
  }
}

但是,元素在各个分区之间是不平衡的.我想获得的是一个RDD分区,如下所示,其中所有分区都包含大致相同数量的元素,并遵守键的顺序:

However, elements are unbalanced across partitions. What I would like to obtain is a RDD partitioned like this, where all the partitions contain roughly the same number of elements, respecting the order of the keys:

[(1,a), (1,b), (1,c)]
[(2,d), (3,e), (3,f)]
[(3,g), (3,h), (3,i)]

我可以尝试什么?

推荐答案

像这样分配分区

p1 [(1,a), (1,b), (1,c)]
p2 [(2,d), (3,e), (3,f)]
p3 [(3,g), (3,h), (3,i)]

表示您希望将相同的分区键分配给不同的分区(对于3,它是p2或p3).就像数学函数一样,同一参数不能有多个值(该值取决于什么?).

would mean that you would like to assign the same partition key to different partitions (for 3 it's p2 or p3). Just like for mathematical functions it cannot have many values for the same argument (what does the value depend on then?).

您可以做的是在分区键中添加一些内容,这将导致有更多的存储桶(有效地将一组划分为较小的组).但是,您(实际上)无法控制Spark如何将分区放置到节点上,因此您希望位于同一节点上的数据可以跨越多个节点.

What you could do instead is adding something to your partition key which would result in having more buckets (effectively splitting one set into smaller sets). But you have no control (virtually) over how Spark places your partitions onto the nodes so the data which you wanted to be on the same node can span across multiple nodes.

这实际上归结为您想要执行的工作.我建议考虑要获得的结果,看看是否可以在合理的权衡下(如果确实有必要)提出一些智能分区键.也许您可以按字母来保存值,然后使用reduceByKey而不是groupByKey之类的操作来获得最终结果?

That really boils down to what job you would like to perform. I would recommend considering what is the outcome you want to get and see if you can come up with some smart partition key with a reasonable tradeoff (if really necessary). Maybe you could hold values by the letter and then use operations like reduceByKey rather than the groupByKey to get your final results?

这篇关于如何划分RDD遵守顺序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆