使用Spark DataFrame groupby时如何获取其他列? [英] How to get other columns when using Spark DataFrame groupby?

查看:467
本文介绍了使用Spark DataFrame groupby时如何获取其他列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我像这样使用DataFrame groupby时:

when I use DataFrame groupby like this:

df.groupBy(df("age")).agg(Map("id"->"count"))

我只会得到一个带有"age"和"count(id)"列的DataFrame,但是在df中,还有许多其他列,例如"name".

I will only get a DataFrame with columns "age" and "count(id)",but in df,there are many other columns like "name".

总的来说,我想得到与MySQL一样的结果,

In all,I want to get the result as in MySQL,

按年龄从df组中选择姓名,年龄,计数(id)"

"select name,age,count(id) from df group by age"

在Spark中使用groupby时该怎么办?

What should I do when use groupby in Spark?

推荐答案

总之,总的来说,您必须将汇总结果与原始表结合起来. Spark SQL与大多数主要数据库(PostgreSQL,Oracle,MS SQL Server)遵循相同的SQL:1999以前的约定,不允许在聚合查询中使用其他列.

Long story short in general you have to join aggregated results with the original table. Spark SQL follows the same pre-SQL:1999 convention as most of the major databases (PostgreSQL, Oracle, MS SQL Server) which doesn't allow additional columns in aggregation queries.

由于诸如计数结果之类的聚合没有得到很好的定义,并且在支持这种类型查询的系统中,行为往往会有所不同,因此您可以使用firstlast之类的任意聚合来添加其他列.

Since for aggregations like count results are not well defined and behavior tends to vary in systems which supports this type of queries you can just include additional columns using arbitrary aggregate like first or last.

在某些情况下,您可以将select替换为具有窗口功能的agg和随后的where,但根据上下文的不同,它可能会非常昂贵.

In some cases you can replace agg using select with window functions and subsequent where but depending on the context it can be quite expensive.

这篇关于使用Spark DataFrame groupby时如何获取其他列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆