如何对数组列的元素进行切片和求和? [英] How to slice and sum elements of array column?

查看:420
本文介绍了如何对数组列的元素进行切片和求和?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用SparkSQL在数组列上sum(或也执行其他聚合函数).

I would like to sum (or perform other aggregate functions too) on the array column using SparkSQL.

我有一个表

+-------+-------+---------------------------------+
|dept_id|dept_nm|                      emp_details|
+-------+-------+---------------------------------+
|     10|Finance|        [100, 200, 300, 400, 500]|
|     20|     IT|                [10, 20, 50, 100]|
+-------+-------+---------------------------------+

我想对emp_details列的值求和.

预期查询:

sqlContext.sql("select sum(emp_details) from mytable").show

预期结果

1500
180

我也应该能够像这样对范围元素求和:

Also I should be able to sum on the range elements too like :

sqlContext.sql("select sum(slice(emp_details,0,3)) from mytable").show

结果

600
80

按预期对数组类型求和时,表明sum期望参数为数字类型而不是数组类型.

when doing sum on the Array type as expected it shows that sum expects argument to be numeric type not array type.

我认为我们需要为此创建UDF.但是如何?

I think we need to create UDF for this . but how ?

我会遇到UDF的性能下降吗? 除了UDF,还有其他解决方案吗?

Will I be facing any performance hits with UDFs ? and is there any other solution apart from the UDF one ?

推荐答案

Spark 2.4.0

Spark 2.4 开始,Spark SQL支持用于处理复杂数据结构(包括数组)的高阶函数.

Spark 2.4.0

As of Spark 2.4, Spark SQL supports higher-order functions that are to manipulate complex data structures, including arrays.

现代"解决方案如下:

scala> input.show(false)
+-------+-------+-------------------------+
|dept_id|dept_nm|emp_details              |
+-------+-------+-------------------------+
|10     |Finance|[100, 200, 300, 400, 500]|
|20     |IT     |[10, 20, 50, 100]        |
+-------+-------+-------------------------+

input.createOrReplaceTempView("mytable")

val sqlText = "select dept_id, dept_nm, aggregate(emp_details, 0, (acc, value) -> acc + value) as sum from mytable"
scala> sql(sqlText).show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
|     10|Finance|1500|
|     20|     IT| 180|
+-------+-------+----+

在以下文章和视频中,您可以找到有关高阶函数的很好的读物:

You can find a good reading on higher-order functions in the following articles and video:

  1. 在数据块上使用SQL中的高阶函数处理嵌套数据
  2. Spark SQL中的Herman van Hovell(Databricks)高阶函数简介
  1. Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4
  2. Working with Nested Data Using Higher Order Functions in SQL on Databricks
  3. An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell (Databricks)

Spark 2.3.2和更早版本

免责声明,由于Spark SQL为执行Dataset.map会进行反序列化,因此我不推荐这种方法(即使它获得最多的支持).该查询会强制Spark反序列化数据并将其加载到JVM(从JVM外部由Spark管理的内存区域)中.这将不可避免地导致更频繁的GC,从而使性能变差.

Spark 2.3.2 and earlier

DISCLAIMER I would not recommend this approach (even though it got the most upvotes) because of the deserialization that Spark SQL does to execute Dataset.map. The query forces Spark to deserialize the data and load it onto JVM (from memory regions that are managed by Spark outside JVM). That will inevitably lead to more frequent GCs and hence make performance worse.

一种解决方案是使用Dataset解决方案,其中Spark SQL和Scala的组合可以显示其功能.

One solution would be to use Dataset solution where the combination of Spark SQL and Scala could show its power.

scala> val inventory = Seq(
     |   (10, "Finance", Seq(100, 200, 300, 400, 500)),
     |   (20, "IT", Seq(10, 20, 50, 100))).toDF("dept_id", "dept_nm", "emp_details")
inventory: org.apache.spark.sql.DataFrame = [dept_id: int, dept_nm: string ... 1 more field]

// I'm too lazy today for a case class
scala> inventory.as[(Long, String, Seq[Int])].
  map { case (deptId, deptName, details) => (deptId, deptName, details.sum) }.
  toDF("dept_id", "dept_nm", "sum").
  show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
|     10|Finance|1500|
|     20|     IT| 180|
+-------+-------+----+

我将切片部分留作练习,因为它同样简单.

I'm leaving the slice part as an exercise as it's equally simple.

这篇关于如何对数组列的元素进行切片和求和?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆