如何在星火鲜明（）函数的工作？ [英] How does Distinct() function work in Spark?

查看：110 发布时间：2016/5/22 15:32:06 apache-spark distinct

本文介绍了如何在星火鲜明（）函数的工作？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是一个新手到Apache Spark和正在学习的基本功能。
有一个小doubt.Suppose我有元组（键，值）的RDD，想获得一些独特的人了出来。我用不同的（）函数。我想知道在什么基础上并函数认为作为元组不同的..？它是基于密钥，或值，或两者兼而有之？

I'm a newbie to Apache Spark and was learning basic functionalities. Had a small doubt.Suppose I have an RDD of tuples (key, value) and wanted to obtain some unique ones out of them. I use distinct() function. I'm wondering on what basis does the function consider that tuples as disparate..? Is it based on the keys, or values, or both?

推荐答案

.distinct（）肯定是在做跨分区的洗牌。要查看更多发生了什么的，你RDD运行.toDebugString。

.distinct() is definitely doing a shuffle across partitions. To see more of what's happening, run a .toDebugString on your RDD.

val hashPart = new HashPartitioner(<number of partitions>)

val myRDDPreStep = <load some RDD>

val myRDD = myRDDPreStep.distinct.partitionBy(hashPart).setName("myRDD").persist(StorageLevel.MEMORY_AND_DISK_SER)
myRDD.checkpoint
println(myRDD.toDebugString)

这对于一个RDD例子中，我（myRDD preStep已经通过关键的散列分区，由StorageLevel.MEMORY_AND_DISK_SER依然存在，检查点），返回：

which for an RDD example I have (myRDDPreStep is already hash-partitioned by key, persisted by StorageLevel.MEMORY_AND_DISK_SER, and checkpointed), returns:

(2568) myRDD ShuffledRDD[11] at partitionBy at mycode.scala:223 [Disk Memory Serialized 1x Replicated]
+-(2568) MapPartitionsRDD[10] at distinct at mycode.scala:223 [Disk Memory Serialized 1x Replicated]
    |    ShuffledRDD[9] at distinct at mycode.scala:223 [Disk Memory Serialized 1x Replicated]
    +-(2568) MapPartitionsRDD[8] at distinct at mycode.scala:223 [Disk Memory Serialized 1x Replicated]
        |    myRDDPreStep ShuffledRDD[6] at partitionBy at mycode.scala:193 [Disk Memory Serialized 1x Replicated]
        |        CachedPartitions: 2568; MemorySize: 362.4 GB; TachyonSize: 0.0 B; DiskSize: 0.0 B
        |    myRDD[7] at count at mycode.scala:214 [Disk Memory Serialized 1x Replicated]

请注意，有可能是更有效的方式来获得一个独特的涉及较少的洗牌，特别是如果你RDD是一个聪明的办法已分区和分区是不是过于倾斜。

Note that there may be more efficient ways to get a distinct that involve fewer shuffles, ESPECIALLY if your RDD is already partitioned in a smart way and the partitions are not overly skewed.

请参阅Is有没有办法改写星火RDD不同的，而不是使用不同的mapPartitions？
和
<一href=\"http://stackoverflow.com/questions/31081563/apache-spark-what-is-the-equivalent-implementation-of-rdd-groupbykey-using-rd\">Apache星火：什么是用RDD.aggregateByKey RDD.groupByKey（）相当于（）实现

这篇关于如何在星火鲜明（）函数的工作？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在星火鲜明（）函数的工作？ [英] How does Distinct() function work in Spark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在星火鲜明（）函数的工作？ [英] How does Distinct() function work in Spark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭