什么是导致洗牌星火转变？ [英] What are the Spark transformations that causes a Shuffle?

查看：126 发布时间：2016/5/22 15:16:51 python scala apache-spark

本文介绍了什么是导致洗牌星火转变？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有麻烦，导致一个洗牌和运作，这并不星火文档操作找到。在这份名单中，哪些不会导致洗牌，哪些没有？

地图和过滤器没有。但是，我不知道与其他。

 地图（FUNC）
过滤器（FUNC）
flatMap（FUNC）
mapPartitions（FUNC）
mapPartitionsWithIndex（FUNC）
样品（withReplacement，片段，种子）
工会（otherDataset）
路口（otherDataset）
不同的（[numTasks]））
groupByKey（[numTasks]）
reduceByKey（FUNC，[numTasks]）
aggregateByKey（零值）（SEQOP，combOp，[numTasks]）
sortByKey（[上升]，[numTasks]）
加入（otherDataset，[numTasks]）
协同组（otherDataset，[numTasks]）
笛卡尔（otherDataset）
管（指挥，[envvars中]）
合并（numPartitions）

解决方案

它实际上是非常容易发现这一点，没有文档。对于任何这些功能只是创建一个RDD并调用调试字符串，这里是一个例子，你可以做其余的乌尔自己。

 斯卡拉＆GT; VAL A = sc.parallelize（阵列（1,2,3））。不同
斯卡拉＆GT; a.toDebugString
MappedRDD [5]在不同的AT＆LT;＆控制台GT; 12（1分区）
  MapPartitionsRDD [4]在不同的处与下;控制台＆GT;：12（1分区）
    ** ShuffledRDD [3]在不同的处与下;控制台＆GT;：12（1分区）**
      MapPartitionsRDD [2]在不同的处与下;控制台＆GT;：12（1分区）
        MappedRDD [1]在不同的AT＆LT;＆控制台GT; 12（1分区）
          ParallelCollectionRDD [0]并行化AT＆LT;＆控制台GT; 12（1分区）

因此，大家可以看到不同的创建一个洗牌。它也特别重要的是要找出这种方式，而不是文档，因为有其中洗牌将需要或不需要的特定功能的情况。例如加入通常需要洗牌，但如果你从同一个RDD火花连接两个RDD的一个分支，有时在的Elid洗牌。

I have trouble to find in the Spark documentation operations that causes a shuffle and operation that does not. In this list, which ones does cause a shuffle and which ones does not?

Map and filter does not. However, I am not sure with the others.

map(func)
filter(func)
flatMap(func)
mapPartitions(func)
mapPartitionsWithIndex(func)
sample(withReplacement, fraction, seed)
union(otherDataset)
intersection(otherDataset)
distinct([numTasks]))
groupByKey([numTasks])
reduceByKey(func, [numTasks])
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
sortByKey([ascending], [numTasks])
join(otherDataset, [numTasks])
cogroup(otherDataset, [numTasks])
cartesian(otherDataset)
pipe(command, [envVars])
coalesce(numPartitions)

解决方案

It is actually extremely easy to find this out, without the documentation. For any of these functions just create an RDD and call to debug string, here is one example you can do the rest on ur own.

scala> val a  = sc.parallelize(Array(1,2,3)).distinct
scala> a.toDebugString
MappedRDD[5] at distinct at <console>:12 (1 partitions)
  MapPartitionsRDD[4] at distinct at <console>:12 (1 partitions)
    **ShuffledRDD[3] at distinct at <console>:12 (1 partitions)**
      MapPartitionsRDD[2] at distinct at <console>:12 (1 partitions)
        MappedRDD[1] at distinct at <console>:12 (1 partitions)
          ParallelCollectionRDD[0] at parallelize at <console>:12 (1 partitions)

So as you can see distinct creates a shuffle. It is also particularly important to find out this way rather than docs because there are situations where a shuffle will be required or not required for a certain function. For example join usually requires a shuffle but if you join two RDD's that branch from the same RDD spark can sometimes elide the shuffle.

这篇关于什么是导致洗牌星火转变？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

什么是导致洗牌星火转变？ [英] What are the Spark transformations that causes a Shuffle?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

什么是导致洗牌星火转变？ [英] What are the Spark transformations that causes a Shuffle?

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭