从数据块中的数组列获取数据,而无需交叉联接 [英] Get data from array columns in databricks without cross joining

查看:49
本文介绍了从数据块中的数组列获取数据,而无需交叉联接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一张桌子:

<身体>
id array_col
101 [{"system":"x","value":"1"},{"system":"y","value":"2"},{"; system":"z","value":"3"}]

array_col基本包含结构数组的地方

Where array_col basically contains an array of structures

0:{"system":"x","value":"1"}

0: {"system": "x", "value": "1"}

1:{"system":"y","value":"2"}

1: {"system": "y", "value": "2"}

2:{"system":"z","value":"3"}

2: {"system": "z", "value": "3"}

我需要如下表所示的输出:

I need the output like the following table:

<身体>
id 系统
101 x 1
101 y 2
101 z 3

现在,我正在尝试在子查询中使用explode(因为在单个select语句中不能有多个explode,然后根据id将它们加入.但这给了我一个输出,每个系统在其中显示每个值,所以我得到9个结果,而不是3个.

Right now I'm trying to use explode in sub queries (Since can't have multiple explode in a single select statement, and then joining them based on id. But that is giving me an output where each system is showing for each value, so instead of 3 i'm getting 9 results.

<身体>
id 系统
101 x 1
101 x 2
101 x 3
101 y 1
101 y 2
101 y 3
101 z 1
101 z 2
101 z 3

帮我获得3行而不是9行的输出.

Help me get the output with 3 rows, instead of 9.

推荐答案

尝试 inline :

df.selectExpr('id', 'inline(array_col)').show()
+---+------+-----+
| id|system|value|
+---+------+-----+
|101|     x|    1|
|101|     y|    2|
|101|     z|    3|
+---+------+-----+

以上假设数组包含结构,而不是字符串结构.如果您的结构是字符串,则需要先使用 from_json 解析它们:

The above assumes that the arrays contains structs, not structs as strings. If your structs are strings, you need to parse them with from_json first:

df2 = df.selectExpr(
    'id', 'explode(array_col) array_col'
).selectExpr(
    'id', "inline(array(from_json(array_col, 'struct<system:string, value:string>')))"
)

df2.show()
+---+------+-----+
| id|system|value|
+---+------+-----+
|101|     x|    1|
|101|     y|    2|
|101|     z|    3|
+---+------+-----+

这篇关于从数据块中的数组列获取数据,而无需交叉联接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆