连接同一表中以定界符分隔的列 [英] join columns separated by delimiter in same table
问题描述
我有以下数据集
color_code fav_color_code color_code_name fav_color_name
1|2 5 blue|white black
3|4 7|9 green|red pink|yellow
我需要将
的第一值 color_code
加入到 color_code_name
的第一值和 color_code
到 color_code_name
等的第二个值。
I need to join
first value of color_code
to first value of color_code_name
and second value of color_code
to second value of color_code_name
etc..
code color
1 blue
2 white
5 black
3 green
4 red
7 pink
9 yellow
我正在使用以下代码,但是由于没有<$,所以它正在交叉连接c $ c> id 到 join
之后。如果我映射2列而不是多列,则此代码有效。
I am using the below code, but it is doing cross join since I dont have an id
to join
upon. This code work if I am mapping 2 columns but not multiple columns.
有人可以帮助我获得预期的结果吗?
Could someone help me to get the expected result?
SELECT
t1.code AS code,
t2.color AS color,
FROM
(
SELECT
c.value :: varchar AS code,
row_number() over(
order by
code
) AS rownum
FROM
table,
lateral flatten (
input => split(color_code, '|')
) c
UNION
SELECT
d.value :: varchar AS code,
row_number() OVER(
ORDER BY
code
) AS rownum
FROM
table,
lateral flatten (
input => split(fav_color_code, '|')
) d
) t1
JOIN (
SELECT
f.value :: varchar AS color,
row_number() OVER(
ORDER BY
color
) AS rownum
FROM
table,
lateral flatten (
input => split(color_code_name, '|')
) f
UNION
SELECT
g.value :: varchar AS color,
row_number() OVER(
ORDER BY
color
) AS rownum
FROM
table,
lateral flatten (
input => split(fav_color_name, '|')
) g
) t2 ON (t1.rownum = t2.rownum)
ORDER BY
t1.color
推荐答案
您可以按照以下步骤进行此方法以进行解释,因为我认为一步之遥就是一团糟。
You can follow this approach in several steps for explanation purposes, because I think in one step is a mess.
注意:
代码解决方案是在 Hive
(帖子
没有指定任何 sql-query-engine
),但是在 Hive
中,几乎所有内容都 sql-standard
。
NOTE:
The code solution is done in Hive
(the post
doesn't specify any sql-query-engine
), but in Hive
almost everything is sql-standard
.
原始数据
+--------------------+------------------------+-------------------------+------------------------+--+
| colors.color_code | colors.fav_color_code | colors.color_code_name | colors.fav_color_name |
+--------------------+------------------------+-------------------------+------------------------+--+
| 1|2 | 5 | blue|white | black |
| 3|4 | 7|9 | green|red | pink|yellow |
+--------------------+------------------------+-------------------------+------------------------+--
首先,我们创建一个 temp表
,其中颜色ID
,其中我们将 code列
, split
列连接在一起放入数组
,然后爆炸
数组
并用行号
First we create a temp table
with color ids
where we concatenate the code columns
, split
the column into an array
and then explode
the array
with a rownumber
CREATE TABLE tc1 AS
SELECT ROW_NUMBER() OVER() AS rownum, CAST(color_id AS INT) as color_id
FROM colors
LATERAL VIEW EXPLODE(SPLIT(CONCAT(color_code,'|', fav_color_code),'\\|')) a1 AS color_id;
我们用 temp表
>颜色名称,我们按照以前的方法进行操作,但是现在我们将 color_name
列,分割
将列放入数组
,然后展开
数组
使用行数
We create a second temp table
with color names
and we follow the approach as before but now we concatenate the color_name
columns, split
the column into an array
and then explode
the array
with a rownumber
CREATE TABLE tc2 AS
SELECT ROW_NUMBER() OVER() AS rownum, color_name
FROM colors
LATERAL VIEW EXPLODE(SPLIT(CONCAT(color_code_name,'|', fav_color_name),'\\|')) a1 AS color_name;
我们通过 rownum
join
临时表
we join
temp tables by rownum
SELECT color_id, color_name
FROM tc1
JOIN tc2 ON(tc1.rownum = tc2.rownum)
ORDER BY color_id;
预期输出
+-----------+-------------+--+
| color_id | color_name |
+-----------+-------------+--+
| 1 | blue |
| 2 | white |
| 3 | green |
| 4 | red |
| 5 | black |
| 7 | pink |
| 9 | yellow |
+-----------+-------------+--+
虽然不是轻量级查询,但一次执行相同的操作
Doing the same thing in one shot, though is not a light-weight query
SELECT tc1.color_id, tc2.color_name
FROM (SELECT ROW_NUMBER() OVER() AS rownum, CAST(color_id AS INT) as color_id
FROM colors
LATERAL VIEW EXPLODE(SPLIT(CONCAT(color_code,'|', fav_color_code),'\\|')) a1 AS color_id) AS tc1
JOIN (SELECT ROW_NUMBER() OVER() AS rownum, color_name
FROM colors
LATERAL VIEW EXPLODE(SPLIT(CONCAT(color_code_name,'|', fav_color_name),'\\|')) a1 AS color_name) AS tc2
ON(tc1.rownum = tc2.rownum)
ORDER BY tc1.color_id;
预期输出
+---------------+-----------------+--+
| tc1.color_id | tc2.color_name |
+---------------+-----------------+--+
| 1 | blue |
| 2 | white |
| 3 | green |
| 4 | red |
| 5 | black |
| 7 | pink |
| 9 | yellow |
+---------------+-----------------+--+
这篇关于连接同一表中以定界符分隔的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!