星火:滑车性能密集型的命令喜欢收集(),groupByKey(),reduceByKey() [英] Spark : Tackle performance intensive commands like collect(), groupByKey(), reduceByKey()

查看:281
本文介绍了星火:滑车性能密集型的命令喜欢收集(),groupByKey(),reduceByKey()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道有些星火操作,如的收集()导致性能问题。

I know that some of Spark Actions like collect() cause performance issues.

据引述文档中

要打印驱动程序的所有元素,可以使用收集()方式,首先把RDD以这样的驱动程序节点: RDD .collect()的foreach(的println) 。这可能会导致驾驶者的耗尽内存,虽然

To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus:rdd.collect().foreach(println). This can cause the driver to run out of memory, though,

,因为收集()获取整个RDD到一台机器;如果你只需要打印RDD的几个要素,一个更安全的方法是使用取(): rdd.take(100).foreach(的println)

和从一个更相关的SE的问题:<一href=\"http://stackoverflow.com/questions/22637518/spark-runs-out-of-memory-when-grouping-by-key\">Spark通过关键分组时耗尽内存

And from one more related SE question: Spark runs out of memory when grouping by key

我来知道 groupByKey(),reduceByKey()可能会导致出现内存不足,如果并行未正确设置。

I have come to know that groupByKey(), reduceByKey() may cause out of memory if parallelism is not set properly.

我没有得到rel=\"nofollow\">转换以及其他

I did not get enough evidence on other Transformations and Action commands, which have to be used with caution.

这三者是需要解决的唯一的命令?我有疑虑下面的命令太

These three are the only commands to be tackled? I have doubts about below commands too

1. aggregateByKey()
2. sortByKey() 
3. persist() / cache()

如果您提供密集的命令的信息,其中有更好的防护来解决(跨分区,而不是单个分区或低性能的命令全局)这将是巨大的。

It would be great if you provide information on intensive commands (global across partitions instead of single partition OR low performance commands), which have to be tackled with better guarding.

推荐答案

您要考虑三种类型的操作:

You have to consider three types of operations:

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆