如何计算单个groupBy中的总和和计数? [英] How to calculate sum and count in a single groupBy?
本文介绍了如何计算单个groupBy中的总和和计数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
基于以下DataFrame
:
val client = Seq((1,"A",10),(2,"A",5),(3,"B",56)).toDF("ID","Categ","Amnt")
+---+-----+----+
| ID|Categ|Amnt|
+---+-----+----+
| 1| A| 10|
| 2| A| 5|
| 3| B| 56|
+---+-----+----+
我想按类别获取ID的数量和总金额:
I would like to to obtain the number of ID and the total amount by category:
+-----+-----+---------+
|Categ|count|sum(Amnt)|
+-----+-----+---------+
| B| 1| 56|
| A| 2| 15|
+-----+-----+---------+
是否可以在不进行连接的情况下进行计数和求和?
Is it possible to do the count and the sum without having to do a join?
client.groupBy("Categ").count
.join(client.withColumnRenamed("Categ","cat")
.groupBy("cat")
.sum("Amnt"), 'Categ === 'cat)
.drop("cat")
也许是这样的:
client.createOrReplaceTempView("client")
spark.sql("SELECT Categ count(Categ) sum(Amnt) FROM client GROUP BY Categ").show()
推荐答案
我举的例子和你的不一样
I'm giving different example than yours
// In 1.3.x, in order for the grouping column "department" to show up,
// it must be included explicitly as part of the agg function call.
df.groupBy("department").agg($"department", max("age"), sum("expense"))
// In 1.4+, grouping column "department" is included automatically.
df.groupBy("department").agg(max("age"), sum("expense"))
<小时>
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
val spark: SparkSession = SparkSession
.builder.master("local")
.appName("MyGroup")
.getOrCreate()
import spark.implicits._
val client: DataFrame = spark.sparkContext.parallelize(
Seq((1,"A",10),(2,"A",5),(3,"B",56))
).toDF("ID","Categ","Amnt")
client.groupBy("Categ").agg(sum("Amnt"),count("ID")).show()
<小时>
+-----+---------+---------+
|Categ|sum(Amnt)|count(ID)|
+-----+---------+---------+
| B| 56| 1|
| A| 15| 2|
+-----+---------+---------+
这篇关于如何计算单个groupBy中的总和和计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文