如何使用键取消嵌套数组,然后再加入? [英] How to unnest array with keys to join on afterwards?

查看:23
本文介绍了如何使用键取消嵌套数组,然后再加入?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个表,分别是table1table2.table1 很大,而 table2 很小.另外,我有一个 UDF 函数,其接口定义如下:

I have two tables, namely table1 and table2. table1 is big, whereas table2 is small. Also, I have a UDF function whose interface is defined as below:

--table1--
id
1
2
3

--table2--
category
a
b
c
d
e
f
g

UDF: foo(id: Int): List[String]

我打算先调用UDF来获取对应的category:foo(table1.id),它会返回一个WrappedArray,然后我想加入每个categorytable2 进行更多操作.预期结果应如下所示:

I intend to call UDF firstly to get the corresponding categories: foo(table1.id), which will return a WrappedArray, then I want to join every category in table2 to do some more manipulation. The expected result should look like this:

--view--

id,category
1,a
1,c
1,d
2,b
2,c
3,e
3,f
3,g

我试图在 Hive 中找到一个 unnest 方法,但没有运气,有人能帮我吗?谢谢!

I try to find a unnest method in Hive, but with no luck, could anyone help me out? Thanks!

推荐答案

我相信你想用 explode function 或 Dataset 的 flatMap 运算符.

I believe that you want to use explode function or Dataset's flatMap operator.

explode 函数为给定数组或映射列中的每个元素创建一个新行.

explode function creates a new row for each element in the given array or map column.

flatMap 运算符通过首先将函数应用于此数据集的所有元素,然后将结果展平来返回一个新的数据集.

flatMap operator returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

在你执行你的 UDF foo(id: Int): List[String] 之后,你最终会得到一个 Dataset 类型为 array 的列.

After you execute your UDF foo(id: Int): List[String] you'll end up with a Dataset with the column of type array.

val fooUDF = udf { id: Int => ('a' to ('a'.toInt + id).toChar).map(_.toString) }

// table1 with fooUDF applied
val table1 = spark.range(3).withColumn("foo", fooUDF('id))

scala> table1.show
+---+---------+
| id|      foo|
+---+---------+
|  0|      [a]|
|  1|   [a, b]|
|  2|[a, b, c]|
+---+---------+

scala> table1.printSchema
root
 |-- id: long (nullable = false)
 |-- foo: array (nullable = true)
 |    |-- element: string (containsNull = true)

scala> table1.withColumn("fooExploded", explode($"foo")).show
+---+---------+-----------+
| id|      foo|fooExploded|
+---+---------+-----------+
|  0|      [a]|          a|
|  1|   [a, b]|          a|
|  1|   [a, b]|          b|
|  2|[a, b, c]|          a|
|  2|[a, b, c]|          b|
|  2|[a, b, c]|          c|
+---+---------+-----------+

有了这个,加入应该很容易.

With that, join should be quite easy.

这篇关于如何使用键取消嵌套数组,然后再加入?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆