星火:滑车性能密集型的命令喜欢收集(),groupByKey(),reduceByKey() [英] Spark : Tackle performance intensive commands like collect(), groupByKey(), reduceByKey()
问题描述
我知道有些星火操作,如的收集()
导致性能问题。
I know that some of Spark Actions like collect()
cause performance issues.
据引述文档中
要打印驱动程序的所有元素,可以使用收集()方式,首先把RDD以这样的驱动程序节点: RDD .collect()的foreach(的println)
的。这可能会导致驾驶者的的耗尽内存的,虽然
To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus:
rdd.collect().foreach(println)
. This can cause the driver to run out of memory, though,
,因为收集()获取整个RDD到一台机器的;如果你只需要打印RDD的几个要素,一个更安全的方法是使用取(): rdd.take(100).foreach(的println)
和从一个更相关的SE的问题:<一href=\"http://stackoverflow.com/questions/22637518/spark-runs-out-of-memory-when-grouping-by-key\">Spark通过关键分组时耗尽内存
And from one more related SE question: Spark runs out of memory when grouping by key
我来知道 groupByKey(),reduceByKey()
可能会导致出现内存不足,如果并行未正确设置。
I have come to know that groupByKey(), reduceByKey()
may cause out of memory if parallelism is not set properly.
I did not get enough evidence on other Transformations and Action commands, which have to be used with caution.
这三者是需要解决的唯一的命令?我有疑虑下面的命令太
These three are the only commands to be tackled? I have doubts about below commands too
1. aggregateByKey()
2. sortByKey()
3. persist() / cache()
如果您提供密集的命令的信息,其中有更好的防护来解决(跨分区,而不是单个分区或低性能的命令全局)这将是巨大的。
It would be great if you provide information on intensive commands (global across partitions instead of single partition OR low performance commands), which have to be tackled with better guarding.
推荐答案
您要考虑三种类型的操作:
You have to consider three types of operations:
- 仅使用
实施变革mapPartitions(WithIndex)
如过滤
,地图
,flatMap
等。通常这将是最安全的组。也许你可以遇到的最大问题可能是一个广泛的溢出到磁盘上。 - 需要洗牌转换。它包括明显的犯罪嫌疑人如
combineByKey
(groupByKey
,reduceByKey $ C $的不同变体C>,
aggregateByKey
)或加入
和喜欢不太明显sortBy
,不同的
或再分配
。如果没有上下文(数据分发,减少,分区确切功能,资源),这是很难说,如果特定的转型将是有问题的。有两个主要因素:- 网络流量和磁盘IO - 这是不是在内存中执行任何操作都将至少一个订单幅度较慢的。
- 偏态数据分布 - 如果分布极不平衡洗牌可能失败或后续操作可以从一个次优的资源配置受苦
- transformations implemented using only
mapPartitions(WithIndex)
likefilter
,map
,flatMap
etc. Typically it will be the safest group. Probably the biggest possible issue you can encounter is an extensive spilling to disk. - transformations which require shuffle. It includes obvious suspects like different variants of
combineByKey
(groupByKey
,reduceByKey
,aggregateByKey
) orjoin
and less obvious likesortBy
,distinct
orrepartition
. Without a context (data distribution, exact function for reduction, partitioner, resources) it is hard to tell if particular transformation will be problematic. There are two main factors:- network traffic and disk IO - any operation which is not performed in memory will be at least an order of magnitude slower.
- skewed data distribution - if distribution is highly skewed shuffle can fail or subsequent operations may suffer from a suboptimal resource allocation
操作。通常,它涵盖了类似的行动
收集
或取
,并从本地一家创建分布式数据结构(并行
)。operations which require passing data to and from the driver. Typically it covers actions like
collect
ortake
and creating distributed data structure from a local one (parallelize
).这个类别的其他成员包括
广播
(包括自动广播连接)和蓄电池
。当然总成本取决于一个特定的操作和数据量。Other members of this category are
broadcasts
(including automatic broadcast joins) andaccumulators
. Total cost depends of course on a particular operation and the amount of data.虽然这些操作可能是昂贵的没有一个是特别坏的(包括<一个href=\"https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/$p$pfer_reducebykey_over_groupbykey.html\"相对=nofollow>本身妖魔化
groupByKey
)。显然,这是更好地避免网络流量或额外的磁盘IO但实际上,你无法避免它在任何复杂的应用程序。While some of these operations can be expensive none is particularly bad (including demonized
groupByKey
) by itself. Obviously it is better to avoid network traffic or additional disk IO but in practice you cannot avoid it in any complex application.关于缓存你可能会发现星火:为什么我要明确地告诉缓存什么有用吗?。
Regarding cache you may find Spark: Why do i have to explicitly tell what to cache? useful.
这篇关于星火:滑车性能密集型的命令喜欢收集(),groupByKey(),reduceByKey()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!