如何使用Spark KeyValueGroupedDataset的agg方法? [英] How to use the agg method of Spark KeyValueGroupedDataset?

查看:54
本文介绍了如何使用Spark KeyValueGroupedDataset的agg方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一些类似这样的代码:

We have some code like this:

// think of class A as a table with two columns
case class A(property1: String, property2: Long)

// class B adds a column to class A
case class B(property1: String, property2: Long, property3: String)

df.as[A].map[B](a => {
      val my_udf = // some code here which creates a user defined function
      new B(a.property1, a.property2, my_udf(a))
    })

其中df是一个DataFrame.接下来,我们要创建C类型的数据集

where df is a DataFrame. next we want to create a dataset of type C

// we want to group objects of type B by properties 1 and 3 and compute the average of property2 and also want to store a count
case class C(property1: String, property3: String, average: Long, count: Long)

我们将像这样在sql中创建

which we would have created in sql like this

select property1, property3, avg(property2), count(*) from B group by property1, property3

我们如何才能做到这一点?我们正在尝试使用提供 KeyValueGroupedDataSet

How can we do this in spark? we are trying to use the groupByKey that gives a KeyValueGroupedDataSet together with agg but unable to get it to work. Can't figure out how to use agg

推荐答案

如果您有一个名为 ds_c 的C类型数据集,则可以这样做(使用 groupBy.agg ):

If you have a dataset of type C called ds_c, then you can do (use groupBy.agg):

ds_c.groupBy("property1", "property3").agg(count($"property2").as("count"), 
                                           avg($"property2").as("mean"))

这篇关于如何使用Spark KeyValueGroupedDataset的agg方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆