如何对数组列的元素进行切片和求和? [英] How to slice and sum elements of array column?
问题描述
我想使用 SparkSQL 在数组列上sum
(或执行其他聚合函数).
I would like to sum
(or perform other aggregate functions too) on the array column using SparkSQL.
我有一张桌子
+-------+-------+---------------------------------+
|dept_id|dept_nm| emp_details|
+-------+-------+---------------------------------+
| 10|Finance| [100, 200, 300, 400, 500]|
| 20| IT| [10, 20, 50, 100]|
+-------+-------+---------------------------------+
我想对这个 emp_details
列的值求和.
I would like to sum the values of this emp_details
column .
预期查询:
sqlContext.sql("select sum(emp_details) from mytable").show
预期结果
1500
180
此外,我也应该能够对范围元素进行总结:
Also I should be able to sum on the range elements too like :
sqlContext.sql("select sum(slice(emp_details,0,3)) from mytable").show
结果
600
80
当按预期对 Array 类型执行 sum 时,它表明 sum 期望参数是数字类型而不是数组类型.
when doing sum on the Array type as expected it shows that sum expects argument to be numeric type not array type.
我认为我们需要为此创建 UDF.但是怎么样?
I think we need to create UDF for this . but how ?
使用 UDF 时我会遇到任何性能问题吗?除了UDF之外还有其他解决方案吗?
Will I be facing any performance hits with UDFs ? and is there any other solution apart from the UDF one ?
推荐答案
Spark 2.4.0
从 Spark 2.4 开始,Spark SQL 支持 用于操作复杂数据结构(包括数组)的高阶函数.
Spark 2.4.0
As of Spark 2.4, Spark SQL supports higher-order functions that are to manipulate complex data structures, including arrays.
现代"解决方案如下:
scala> input.show(false)
+-------+-------+-------------------------+
|dept_id|dept_nm|emp_details |
+-------+-------+-------------------------+
|10 |Finance|[100, 200, 300, 400, 500]|
|20 |IT |[10, 20, 50, 100] |
+-------+-------+-------------------------+
input.createOrReplaceTempView("mytable")
val sqlText = "select dept_id, dept_nm, aggregate(emp_details, 0, (acc, value) -> acc + value) as sum from mytable"
scala> sql(sqlText).show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
| 10|Finance|1500|
| 20| IT| 180|
+-------+-------+----+
您可以在以下文章和视频中找到有关高阶函数的好读物:
You can find a good reading on higher-order functions in the following articles and video:
- 在 Apache Spark 2.4 中为复杂数据类型引入新的内置函数和高阶函数
- 在 Databricks 上的 SQL 中使用高阶函数处理嵌套数据
- 与 Herman van Hovell (Databricks) 一起介绍 Spark SQL 中的高阶函数
Spark 2.3.2 及更早版本
免责声明 我不推荐这种方法(即使它获得了最多的赞成票),因为 Spark SQL 执行 Dataset.map
时会进行反序列化.该查询强制 Spark 反序列化数据并将其加载到 JVM(从 Spark 在 JVM 外部管理的内存区域).这将不可避免地导致更频繁的 GC,从而使性能变差.
Spark 2.3.2 and earlier
DISCLAIMER I would not recommend this approach (even though it got the most upvotes) because of the deserialization that Spark SQL does to execute Dataset.map
. The query forces Spark to deserialize the data and load it onto JVM (from memory regions that are managed by Spark outside JVM). That will inevitably lead to more frequent GCs and hence make performance worse.
一种解决方案是使用 Dataset
解决方案,其中 Spark SQL 和 Scala 的组合可以展示其强大功能.
One solution would be to use Dataset
solution where the combination of Spark SQL and Scala could show its power.
scala> val inventory = Seq(
| (10, "Finance", Seq(100, 200, 300, 400, 500)),
| (20, "IT", Seq(10, 20, 50, 100))).toDF("dept_id", "dept_nm", "emp_details")
inventory: org.apache.spark.sql.DataFrame = [dept_id: int, dept_nm: string ... 1 more field]
// I'm too lazy today for a case class
scala> inventory.as[(Long, String, Seq[Int])].
map { case (deptId, deptName, details) => (deptId, deptName, details.sum) }.
toDF("dept_id", "dept_nm", "sum").
show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
| 10|Finance|1500|
| 20| IT| 180|
+-------+-------+----+
我将切片部分留作练习,因为它同样简单.
I'm leaving the slice part as an exercise as it's equally simple.
这篇关于如何对数组列的元素进行切片和求和?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!