如何对数组列的元素进行切片和求和? [英] How to slice and sum elements of array column?

查看:26
本文介绍了如何对数组列的元素进行切片和求和?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 SparkSQL 在数组列上sum(或执行其他聚合函数).

I would like to sum (or perform other aggregate functions too) on the array column using SparkSQL.

我有一张桌子

+-------+-------+---------------------------------+
|dept_id|dept_nm|                      emp_details|
+-------+-------+---------------------------------+
|     10|Finance|        [100, 200, 300, 400, 500]|
|     20|     IT|                [10, 20, 50, 100]|
+-------+-------+---------------------------------+

我想对这个 emp_details 列的值求和.

I would like to sum the values of this emp_details column .

预期查询:

sqlContext.sql("select sum(emp_details) from mytable").show

预期结果

1500
180

此外,我也应该能够对范围元素进行总结:

Also I should be able to sum on the range elements too like :

sqlContext.sql("select sum(slice(emp_details,0,3)) from mytable").show

结果

600
80

当按预期对 Array 类型执行 sum 时,它表明 sum 期望参数是数字类型而不是数组类型.

when doing sum on the Array type as expected it shows that sum expects argument to be numeric type not array type.

我认为我们需要为此创建 UDF.但是怎么样?

I think we need to create UDF for this . but how ?

使用 UDF 时我会遇到任何性能问题吗?除了UDF之外还有其他解决方案吗?

Will I be facing any performance hits with UDFs ? and is there any other solution apart from the UDF one ?

推荐答案

Spark 2.4.0

Spark 2.4 开始,Spark SQL 支持 用于操作复杂数据结构(包括数组)的高阶函数.

Spark 2.4.0

As of Spark 2.4, Spark SQL supports higher-order functions that are to manipulate complex data structures, including arrays.

现代"解决方案如下:

scala> input.show(false)
+-------+-------+-------------------------+
|dept_id|dept_nm|emp_details              |
+-------+-------+-------------------------+
|10     |Finance|[100, 200, 300, 400, 500]|
|20     |IT     |[10, 20, 50, 100]        |
+-------+-------+-------------------------+

input.createOrReplaceTempView("mytable")

val sqlText = "select dept_id, dept_nm, aggregate(emp_details, 0, (acc, value) -> acc + value) as sum from mytable"
scala> sql(sqlText).show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
|     10|Finance|1500|
|     20|     IT| 180|
+-------+-------+----+

您可以在以下文章和视频中找到有关高阶函数的好读物:

You can find a good reading on higher-order functions in the following articles and video:

  1. 在 Apache Spark 2.4 中为复杂数据类型引入新的内置函数和高阶函数
  2. 在 Databricks 上的 SQL 中使用高阶函数处理嵌套数据
  3. 与 Herman van Hovell (Databricks) 一起介绍 Spark SQL 中的高阶函数

Spark 2.3.2 及更早版本

免责声明 我不推荐这种方法(即使它获得了最多的赞成票),因为 Spark SQL 执行 Dataset.map 时会进行反序列化.该查询强制 Spark 反序列化数据并将其加载到 JVM(从 Spark 在 JVM 外部管理的内存区域).这将不可避免地导致更频繁的 GC,从而使性能变差.

Spark 2.3.2 and earlier

DISCLAIMER I would not recommend this approach (even though it got the most upvotes) because of the deserialization that Spark SQL does to execute Dataset.map. The query forces Spark to deserialize the data and load it onto JVM (from memory regions that are managed by Spark outside JVM). That will inevitably lead to more frequent GCs and hence make performance worse.

一种解决方案是使用 Dataset 解决方案,其中 Spark SQL 和 Scala 的组合可以展示其强大功能.

One solution would be to use Dataset solution where the combination of Spark SQL and Scala could show its power.

scala> val inventory = Seq(
     |   (10, "Finance", Seq(100, 200, 300, 400, 500)),
     |   (20, "IT", Seq(10, 20, 50, 100))).toDF("dept_id", "dept_nm", "emp_details")
inventory: org.apache.spark.sql.DataFrame = [dept_id: int, dept_nm: string ... 1 more field]

// I'm too lazy today for a case class
scala> inventory.as[(Long, String, Seq[Int])].
  map { case (deptId, deptName, details) => (deptId, deptName, details.sum) }.
  toDF("dept_id", "dept_nm", "sum").
  show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
|     10|Finance|1500|
|     20|     IT| 180|
+-------+-------+----+

我将切片部分留作练习,因为它同样简单.

I'm leaving the slice part as an exercise as it's equally simple.

这篇关于如何对数组列的元素进行切片和求和?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆