Spark Dataframe groupBy 以序列作为键参数 [英] Spark Dataframe groupBy with sequence as keys arguments

查看：30 发布时间：2021/11/14 22:28:04 scala apache-spark apache-spark-sql

本文介绍了Spark Dataframe groupBy 以序列作为键参数的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我有一个 spark dataFrame，我想通过多个键聚合值

正如 spark 文档所建议的那样:

<块引用>

def groupBy(col1: String, cols: String*): GroupedDataDataFrame 使用指定的列，因此我们可以对它们进行聚合

所以我做了以下

 val keys = Seq("a", "b", "c")dataframe.groupBy(keys:_*).agg(...)

Intellij Idea 向我抛出以下错误:

<块引用>

但是，我可以手动传递多个参数而不会出错:

dataframe.groupBy("a", "b", "c").agg(...)

所以，我的问题是:如何以编程方式执行此操作?

解决方案

使用带有 groupBy(cols: Column*)

的列

import org.apache.spark.sql.functions.colval 键 = Seq("a", "b", "c").map(col(_))dataframe.groupBy(keys:_*).agg(...)

或 head/tail with groupBy(col1: String, cols: String*):

val keys = Seq("a", "b", "c")dataframe.groupBy(keys.head, keys.tail: _*).agg(...)

I have a spark dataFrame and I want to aggregate values by multiple keys

As spark documentation suggests:

def groupBy(col1: String, cols: String*): GroupedData Groups the DataFrame using the specified columns, so we can run aggregation on them

So I do the following

 val keys = Seq("a", "b", "c")
 dataframe.groupBy(keys:_*).agg(...)

Intellij Idea throws me following errors:

expansion for non repeated parameters

Type mismatch: expected Seq[Column], actual Seq[String]

However, I can pass multiple arguments manually without errors:

dataframe.groupBy("a", "b", "c").agg(...)

So, my question is: How can I do this programmatically?

解决方案

Either use columns with groupBy(cols: Column*)

import org.apache.spark.sql.functions.col

val keys = Seq("a", "b", "c").map(col(_))
dataframe.groupBy(keys:_*).agg(...)

or head / tail with groupBy(col1: String, cols: String*):

val keys = Seq("a", "b", "c") 
dataframe.groupBy(keys.head, keys.tail: _*).agg(...)

这篇关于Spark Dataframe groupBy 以序列作为键参数的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文