应该避免使用DataFrame函数groupB吗? [英] Should the DataFrame function groupBy be avoided?
问题描述
此链接和其他告诉我,如果存在大量的键,因为Spark会洗净周围的所有键.是否同样适用于
This link and others tell me that the Spark groupByKey
is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy
function as well? Or is this something different?
我之所以这样问是因为我想做什么这个问题试图解决,但是我有很多键.可以做到这一点,而不必通过在本地减少每个节点来消除所有数据的混乱,但是我找不到PySpark的方法来做到这一点(坦率地说,我发现文档非常缺乏).
I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this (frankly, I find the documentation quite lacking).
本质上,我想做的是:
# Non-working pseudocode
df.groupBy("A").reduce(lambda x,y: if (x.TotalValue > y.TotalValue) x else y)
但是,数据框API不提供减少"选项.我可能误会了数据框到底想达到什么目的.
However, the dataframe API does not offer a "reduce" option. I'm probably misunderstanding what exactly dataframe is trying to achieve.
推荐答案
后跟agg
的DataFrame groupBy
不会不必要地移动数据,请参见
A DataFrame groupBy
followed by an agg
will not move the data around unnecessarily, see here for a good example. Hence, there is no need to avoid it.
使用RDD API时,情况正好相反.在此,最好避免使用groupByKey
并在可能的地方使用reducebyKey
或combineByKey
.但是,在某些情况下,确实需要使用groupByKey
.
When using the RDD API, the opposite is true. Here it is preferable to avoid groupByKey
and use a reducebyKey
or combineByKey
where possible. Some situations, however, do require one to use groupByKey
.
使用DataFrame API进行此类操作的通常方法是使用groupBy
,然后使用agg
进行汇总.在您的示例情况下,您想找到每个组的单个列的最大值,这可以通过max
函数来实现:
The normal way to do this type of operation with the DataFrame API is to use groupBy
followed by an aggregation using agg
. In your example case, you want to find the maximum value for a single column for each group, this can be achived by the max
function:
from pyspark.sql import functions as F
joined_df.groupBy("A").agg(F.max("TotalValue").alias("MaxValue"))
In addition to max
there are a multitude of functions that can be used in combination with agg
, see here for all operations.
这篇关于应该避免使用DataFrame函数groupB吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!