如何在星火鲜明()函数的工作? [英] How does Distinct() function work in Spark?

查看:110
本文介绍了如何在星火鲜明()函数的工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一个新手到Apache Spark和正在学习的基本功能。
有一个小doubt.Suppose我有元组(键,值)的RDD,想获得一些独特的人了出来。我用不同的()函数。我想知道在什么基础上并函数认为作为元组不同的..?它是基于密钥,或值,或两者兼而有之?

I'm a newbie to Apache Spark and was learning basic functionalities. Had a small doubt.Suppose I have an RDD of tuples (key, value) and wanted to obtain some unique ones out of them. I use distinct() function. I'm wondering on what basis does the function consider that tuples as disparate..? Is it based on the keys, or values, or both?

推荐答案

.distinct()肯定是在做跨分区的洗牌。要查看更多发生了什么的,你RDD运行.toDebugString。

.distinct() is definitely doing a shuffle across partitions. To see more of what's happening, run a .toDebugString on your RDD.

val hashPart = new HashPartitioner(<number of partitions>)

val myRDDPreStep = <load some RDD>

val myRDD = myRDDPreStep.distinct.partitionBy(hashPart).setName("myRDD").persist(StorageLevel.MEMORY_AND_DISK_SER)
myRDD.checkpoint
println(myRDD.toDebugString)

这对于一个RDD例子中,我(myRDD preStep已经通过关键的散列分区,由StorageLevel.MEMORY_AND_DISK_SER依然存在,检查点),返回:

which for an RDD example I have (myRDDPreStep is already hash-partitioned by key, persisted by StorageLevel.MEMORY_AND_DISK_SER, and checkpointed), returns:

(2568) myRDD ShuffledRDD[11] at partitionBy at mycode.scala:223 [Disk Memory Serialized 1x Replicated]
+-(2568) MapPartitionsRDD[10] at distinct at mycode.scala:223 [Disk Memory Serialized 1x Replicated]
    |    ShuffledRDD[9] at distinct at mycode.scala:223 [Disk Memory Serialized 1x Replicated]
    +-(2568) MapPartitionsRDD[8] at distinct at mycode.scala:223 [Disk Memory Serialized 1x Replicated]
        |    myRDDPreStep ShuffledRDD[6] at partitionBy at mycode.scala:193 [Disk Memory Serialized 1x Replicated]
        |        CachedPartitions: 2568; MemorySize: 362.4 GB; TachyonSize: 0.0 B; DiskSize: 0.0 B
        |    myRDD[7] at count at mycode.scala:214 [Disk Memory Serialized 1x Replicated]

请注意,有可能是更有效的方式来获得一个独特的涉及较少的洗牌,特别是如果你RDD是一个聪明的办法已分区和分区是不是过于倾斜。

Note that there may be more efficient ways to get a distinct that involve fewer shuffles, ESPECIALLY if your RDD is already partitioned in a smart way and the partitions are not overly skewed.

请参阅Is有没有办法改写星火RDD不同的,而不是使用不同的mapPartitions?

<一href=\"http://stackoverflow.com/questions/31081563/apache-spark-what-is-the-equivalent-implementation-of-rdd-groupbykey-using-rd\">Apache星火:什么是用RDD.aggregateByKey RDD.groupByKey()相当于()实现

这篇关于如何在星火鲜明()函数的工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆