使用 Spark DataFrame groupby 时如何获取其他列? [英] How to get other columns when using Spark DataFrame groupby?

查看:33
本文介绍了使用 Spark DataFrame groupby 时如何获取其他列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我像这样使用 DataFrame groupby 时:

when I use DataFrame groupby like this:

df.groupBy(df("age")).agg(Map("id"->"count"))

我只会得到一个包含 "age" 和 "count(id)" 列的 DataFrame,但在 df 中,还有许多其他列,如 "name".

I will only get a DataFrame with columns "age" and "count(id)",but in df,there are many other columns like "name".

总而言之,我想得到 MySQL 中的结果,

In all,I want to get the result as in MySQL,

按年龄从 df 组中选择姓名、年龄、计数(id)"

"select name,age,count(id) from df group by age"

在Spark中使用groupby应该怎么做?

What should I do when use groupby in Spark?

推荐答案

长话短说,一般来说,您必须将聚合结果与原始表连接起来.Spark SQL 遵循与大多数主要数据库(PostgreSQL、Oracle、MS SQL Server)相同的 pre-SQL:1999 约定,它不允许在聚合查询中添加额外的列.

Long story short in general you have to join aggregated results with the original table. Spark SQL follows the same pre-SQL:1999 convention as most of the major databases (PostgreSQL, Oracle, MS SQL Server) which doesn't allow additional columns in aggregation queries.

由于像计数结果这样的聚合没有明确定义,并且在支持此类查询的系统中行为往往会有所不同,因此您可以使用任意聚合(如 firstlast<)来包含附加列/代码>.

Since for aggregations like count results are not well defined and behavior tends to vary in systems which supports this type of queries you can just include additional columns using arbitrary aggregate like first or last.

在某些情况下,您可以使用带有窗口函数和后续 whereselect 替换 agg,但根据上下文,它可能会非常昂贵.

In some cases you can replace agg using select with window functions and subsequent where but depending on the context it can be quite expensive.

这篇关于使用 Spark DataFrame groupby 时如何获取其他列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆