Spark数据集按和相加 [英] spark dataset group by and sum
问题描述
我正在使用Spark 1.6.1和Java作为编程语言。
以下代码在 dataframes 上正常工作:
I am using Spark 1.6.1 and Java as programming language. The following code was working fine with dataframes:
simpleProf.groupBy(col("col1"), col("col2") )
.agg(
sum("CURRENT_MONTH"),
sum("PREVIOUS_MONTH")
);
但是,它不使用数据集,任何想法
But, it does not using datasets, any idea how to do the same with dataset in Java/Spark?
干杯
推荐答案
它不起作用,在某种意义上说,在groupBy之后,我得到了GroupedDataset对象,当我尝试应用agg函数时,它需要typedColumn而不是column。
It does not work, in the sense that after the groupBy I get a GroupedDataset object and when I try to apply the function agg it requires typedColumn instead of column.
啊,由于在Spark 2.x中合并了Dataset和DataFrame,所以在此方面有些困惑,其中有一个 groupBy
用于关系列,而 groupByKey
用于类型列。因此,假设您在1.6中使用了显式的数据集,那么解决方案是通过 .as
方法来表示列。
Ahh, there was just some confusion on this because of the merging of Dataset and DataFrame in Spark 2.x, where there is a groupBy
which works with relational columns, and groupByKey
which works with typed columns. So, given that you are using an explicit Dataset in 1.6, then the solution is to typify your columns via the .as
method.
sum("CURRENT_MONTH").as[Int]
这篇关于Spark数据集按和相加的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!