Spark SQL - 从 sql 函数生成数组数组 [英] Spark SQL - Generate array of arrays from the sql function

查看:101
本文介绍了Spark SQL - 从 sql 函数生成数组数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一个数组数组.这是我的数据表:

I want to create an array of arrays. This is my data table:

// A case class for our sample table
case class Testing(name: String, age: Int, salary: Int)

// Create an RDD with some data
val x = sc.parallelize(Array(
    Testing(null, 21, 905),
    Testing("Noelia", 26, 1130),
    Testing("Pilar", 52,  1890),
    Testing("Roberto", 31, 1450)
 ))

// Convert RDD to a DataFrame 
val df = sqlContext.createDataFrame(x) 

// For SQL usage we need to register the table
df.registerTempTable("df")

我想创建一个整数列年龄"的数组.为此,我使用collect_list":

I want to create an array of integer column "age". For that I use "collect_list":

sqlContext.sql("SELECT collect_list(age) as age from df").show

但现在我想生成一个包含如上创建的多个数组的数组:

But now I want to generate an array containing multiple arrays as created above:

 sqlContext.sql("SELECT collect_list(collect_list(age), collect_list(salary)) as arrayInt from df").show

但这不起作用,或者使用函数 org.apache.spark.sql.functions.array.有什么想法吗?

But this does not work , or use the function org.apache.spark.sql.functions.array. Any ideas?

推荐答案

好吧,事情再简单不过了.让我们考虑您正在处理的相同数据,然后从那里一步一步地进行

Ok, things can't get more simple. Let's consider the same data you are working on and go step by step from there

// A case class for our sample table
case class Testing(name: String, age: Int, salary: Int)

// Create an RDD with some data
val x = sc.parallelize(Array(
  Testing(null, 21, 905),
  Testing("Noelia", 26, 1130),
  Testing("Pilar", 52, 1890),
  Testing("Roberto", 31, 1450)
))

// Convert RDD to a DataFrame
val df = sqlContext.createDataFrame(x)

// For SQL usage we need to register the table
df.registerTempTable("df")
sqlContext.sql("select collect_list(age) as age from df").show

// +----------------+
// |             age|
// +----------------+
// |[21, 26, 52, 31]|
// +----------------+

sqlContext.sql("select collect_list(collect_list(age),     collect_list(salary)) as arrayInt from df").show

正如错误信息所说:

org.apache.spark.sql.AnalysisException: No handler for Hive udf class
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Exactly one argument is expected..; line 1 pos 52 [...]

collest_list 只接受一个参数.让我们查看文档这里.

collest_list takes just one argument. Let's check the documentation here.

它实际上需要一个参数!但是让我们在函数对象的文档中更进一步.您似乎已经注意到数组函数允许您从 Column 或重复的 Column 参数中创建新的数组列.所以让我们使用它:

It actually takes one argument ! But let's go further in the documentation of the functions object. You seem to have noticed that the array function allows you to create a new array column out of a Column or a repeated Column parameter. So let's use that :

sqlContext.sql("select array(collect_list(age), collect_list(salary)) as arrayInt from df").show(false)

数组函数确实从列列表中创建了一个列,由 collect_list 在年龄和薪水上预先创建:

The array function create indeed a column from the column list create before-hand by collect_list on both age and salary :

// +-------------------------------------------------------------------+
// |arrayInt                                                           |
// +-------------------------------------------------------------------+
// |[WrappedArray(21, 26, 52, 31), WrappedArray(905, 1130, 1890, 1450)]|
// +-------------------------------------------------------------------+

我们该往哪里去?

您必须记住,DataFrame 中的 Row 只是由 Row 包装的另一个集合.

You have to remember that a Row from a DataFrame is just another collection wrapped by a Row.

我要做的第一件事就是处理这个集合.那么我们如何压平 WrappedArray[WrappedArray[Int]] ?

The first thing I'll do is work on that collection. So How do we flatten a WrappedArray[WrappedArray[Int]] ?

Scala 有点神奇,你只需要使用 .flatten

Scala is kind of magical you just need to use .flatten

import scala.collection.mutable.WrappedArray

val firstRow: mutable.WrappedArray[mutable.WrappedArray[Int]] =
  sqlContext.sql("select array(collect_list(age), collect_list(salary)) as arrayInt from df")
    .first.get(0).asInstanceOf[WrappedArray[WrappedArray[Int]]]
// res26: scala.collection.mutable.WrappedArray[scala.collection.mutable.WrappedArray[Int]] =
// WrappedArray(WrappedArray(21, 26, 52, 31), WrappedArray(905, 1130, 1890, 1450))

firstRow.flatten
// res27: scala.collection.mutable.IndexedSeq[Int] = ArrayBuffer(21, 26, 52, 31, 905, 1130, 1890, 1450)

现在让我们将它包装在 UDF 中,以便我们可以在 DataFrame 上使用它:

Now let's wrap it in a UDF so we can use it on the DataFrame :

def flatten(array: WrappedArray[WrappedArray[Int]]) = array.flatten
sqlContext.udf.register("flatten", flatten(_: WrappedArray[WrappedArray[Int]]))

既然我们注册了 UDF,我们现在可以在 sqlContext 中使用它:

Since we registered the UDF, we can now use it inside the sqlContext :

sqlContext.sql("select flatten(array(collect_list(age), collect_list(salary))) as arrayInt from df").show(false)

// +---------------------------------------+
// |arrayInt                               |
// +---------------------------------------+
// |[21, 26, 52, 31, 905, 1130, 1890, 1450]|
// +---------------------------------------+

我希望这会有所帮助!

这篇关于Spark SQL - 从 sql 函数生成数组数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆