应该避免使用 DataFrame 函数 groupBy 吗? [英] Should the DataFrame function groupBy be avoided?

查看:38
本文介绍了应该避免使用 DataFrame 函数 groupBy 吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个链接others 告诉我 Spark groupByKey 不是如果有大量键,则使用,因为 Spark 会随机播放周围的所有键.这同样适用于 groupBy 函数也一样吗?或者这是不同的东西?

This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different?

我问这个是因为我想做这个问题 试图做,但我有非常多的键.通过在本地减少每个节点,应该可以在不混洗所有数据的情况下执行此操作,但是我找不到 PySpark 方法来执行此操作(坦率地说,我发现文档非常缺乏).

I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this (frankly, I find the documentation quite lacking).

基本上,我想做的是:

# Non-working pseudocode
df.groupBy("A").reduce(lambda x,y: if (x.TotalValue > y.TotalValue) x else y)

但是,数据帧 API 不提供减少"选项.我可能误解了 dataframe 想要实现的目标.

However, the dataframe API does not offer a "reduce" option. I'm probably misunderstanding what exactly dataframe is trying to achieve.

推荐答案

A DataFrame groupBy 后跟一个 agg 不会不必要地移动数据,参见 这里 是一个很好的例子.因此,没有必要避免它.

A DataFrame groupBy followed by an agg will not move the data around unnecessarily, see here for a good example. Hence, there is no need to avoid it.

当使用 RDD API 时,情况正好相反.这里最好避免 groupByKey 并尽可能使用 reducebyKeycombineByKey.但是,有些情况确实需要使用 groupByKey.

When using the RDD API, the opposite is true. Here it is preferable to avoid groupByKey and use a reducebyKey or combineByKey where possible. Some situations, however, do require one to use groupByKey.

使用 DataFrame API 执行此类操作的正常方法是使用 groupBy,然后使用 agg 进行聚合.在您的示例中,您想为每个组找到单个列的最大值,这可以通过 max 函数实现:

The normal way to do this type of operation with the DataFrame API is to use groupBy followed by an aggregation using agg. In your example case, you want to find the maximum value for a single column for each group, this can be achived by the max function:

from pyspark.sql import functions as F

joined_df.groupBy("A").agg(F.max("TotalValue").alias("MaxValue"))

除了 max 之外,还有许多函数可以与 agg 结合使用,请参阅 此处 了解所有操作.

In addition to max there are a multitude of functions that can be used in combination with agg, see here for all operations.

这篇关于应该避免使用 DataFrame 函数 groupBy 吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆