使用Spark DataFrame groupby时如何获取其他列? [英] How to get other columns when using Spark DataFrame groupby?
问题描述
当我像这样使用DataFrame groupby时:
when I use DataFrame groupby like this:
df.groupBy(df("age")).agg(Map("id"->"count"))
我只会得到一个带有"age"和"count(id)"列的DataFrame,但是在df中,还有许多其他列,例如"name".
I will only get a DataFrame with columns "age" and "count(id)",but in df,there are many other columns like "name".
总的来说,我想得到与MySQL一样的结果,
In all,I want to get the result as in MySQL,
按年龄从df组中选择姓名,年龄,计数(id)"
"select name,age,count(id) from df group by age"
在Spark中使用groupby时该怎么办?
What should I do when use groupby in Spark?
推荐答案
总之,总的来说,您必须将汇总结果与原始表结合起来. Spark SQL与大多数主要数据库(PostgreSQL,Oracle,MS SQL Server)遵循相同的SQL:1999以前的约定,不允许在聚合查询中使用其他列.
Long story short in general you have to join aggregated results with the original table. Spark SQL follows the same pre-SQL:1999 convention as most of the major databases (PostgreSQL, Oracle, MS SQL Server) which doesn't allow additional columns in aggregation queries.
由于诸如计数结果之类的聚合没有得到很好的定义,并且在支持这种类型查询的系统中,行为往往会有所不同,因此您可以使用first
或last
之类的任意聚合来添加其他列.
Since for aggregations like count results are not well defined and behavior tends to vary in systems which supports this type of queries you can just include additional columns using arbitrary aggregate like first
or last
.
在某些情况下,您可以将select
替换为具有窗口功能的agg
和随后的where
,但根据上下文的不同,它可能会非常昂贵.
In some cases you can replace agg
using select
with window functions and subsequent where
but depending on the context it can be quite expensive.
这篇关于使用Spark DataFrame groupby时如何获取其他列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!