应该避免使用DataFrame函数groupB吗? [英] Should the DataFrame function groupBy be avoided?

查看:102
本文介绍了应该避免使用DataFrame函数groupB吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

链接其他告诉我,如果存在大量的键,因为Spark会洗净周围的所有键.是否同样适用于

This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different?

我之所以这样问是因为我想做什么这个问题试图解决,但是我有很多键.可以做到这一点,而不必通过在本地减少每个节点来消除所有数据的混乱,但是我找不到PySpark的方法来做到这一点(坦率地说,我发现文档非常缺乏).

I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this (frankly, I find the documentation quite lacking).

本质上,我想做的是:

# Non-working pseudocode
df.groupBy("A").reduce(lambda x,y: if (x.TotalValue > y.TotalValue) x else y)

但是,数据框API不提供减少"选项.我可能误会了数据框到底想达到什么目的.

However, the dataframe API does not offer a "reduce" option. I'm probably misunderstanding what exactly dataframe is trying to achieve.

推荐答案

后跟agg的DataFrame groupBy不会不必要地移动数据,请参见

A DataFrame groupBy followed by an agg will not move the data around unnecessarily, see here for a good example. Hence, there is no need to avoid it.

使用RDD API时,情况正好相反.在此,最好避免使用groupByKey并在可能的地方使用reducebyKeycombineByKey.但是,在某些情况下,确实需要使用groupByKey.

When using the RDD API, the opposite is true. Here it is preferable to avoid groupByKey and use a reducebyKey or combineByKey where possible. Some situations, however, do require one to use groupByKey.

使用DataFrame API进行此类操作的通常方法是使用groupBy,然后使用agg进行汇总.在您的示例情况下,您想找到每个组的单个列的最大值,这可以通过max函数来实现:

The normal way to do this type of operation with the DataFrame API is to use groupBy followed by an aggregation using agg. In your example case, you want to find the maximum value for a single column for each group, this can be achived by the max function:

from pyspark.sql import functions as F

joined_df.groupBy("A").agg(F.max("TotalValue").alias("MaxValue"))

除了max之外,还有许多可以与agg结合使用的功能,请参见

In addition to max there are a multitude of functions that can be used in combination with agg, see here for all operations.

这篇关于应该避免使用DataFrame函数groupB吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆