使用HiveQL(Hadoop)在Hive中加入两个表 [英] Joining two Tables in Hive using HiveQL(Hadoop)
问题描述
可能存在重复:
SQL查询JOIN与表格
CREATE EXTERNAL TABLE IF NOT EXESTTS TestingTable1(这是通过它进行比较的MAIN表)
(
BUYER_ID BIGINT,
ITEM_ID BIGINT,
CREATED_TIME STRING
)
这就是上面第一张表中的数据
** BUYER_ID ** | ** ITEM_ID ** | ** CREATED_TIME **
-------------- + ------------------ + ------- ------------------
1015826235 220003038067 * 2001-11-03 19:40:21 *
1015826235 300003861266 2001-11-08 18:19 :59
1015826235 140002997245 2003-08-22 09:23:17
1015826235 * 210002448035 * 2001-11-11 22:21:11
这是Hive中的第二张表 - 它还包含有关我们正在购买的物品的信息。
CREATE EXTERNAL TABLE IF NOT EXISTS TestingTable2
(
USER_ID BIGINT,
PURCHASED_ITEM ARRAY< STRUCT< PRODUCT_ID:BIGINT,TIMESTAMPS:STRING>>
)
这就是上面第二个表中的数据( TestingTable2 $
** USER_ID ** ** PURCHASED_ITEM **
1015826235 [{product_id :220003038067,timestamps:1004941621},{product_id:300003861266,timestamps:1005268799},{product_id:140002997245times tamps:1061569397},{product_id:200002448035,timestamps:1005542471}]
将 TestingTable2
与 TestingTable1
进行比较,以满足以下情况。
从找到
与TestingTable1中的 PRODUCT_ID
AND TIMESTAMPS
TestingTable2 ITEM_ID
AND CREATED_TIME
不匹配 TestingTable1
进行比较后。
TestingTable2
数据this(last) ITEM_ID 210002448035 from
TestingTable1
is不符合 TestingTable2
PRODUCT_ID- 200002448035
数据,并且与时间戳相似。所以我想用HiveQL查询来显示下面的结果。
** BUYER_ID ** | ** ITEM_ID ** | ** CREATED_TIME ** | ** PRODUCT_ID ** | ** TIMESTAMPS **
-------------- + ------------------ + ------- ------------------------- + ------------------------ + ----------------------
1015826235 * 210002448035 * 2001-11-11 22:21:11 200002448035 1005542471
1015826235 220003038067 * 2001-11-03 19:40:21 * 220003038067 1004941621
任何人都可以帮助我解决这个问题。因为我是HiveQL的新手,所以有很多问题。
更新: -
<我已经写了这个查询,但它不按我想要的方式工作。
select * from
(选择* from
(选择user_id,prod_and_ts.product_id作为product_id,prod_and_ts.timestamps作为时间戳
from testingtable2 LATERAL VIEW
explode(purchased_item)exploded_table作为prod_and_ts)
prod_and_ts
LEFT OUTER JOIN testingtable1
ON(prod_and_ts.user_id = testingtable1.buyer_id AND testingtable1.item_id = prod_and_ts.product_id
AND prod_and_ts.timestamps = UNIX_TIMESTAMP(testingtable1.created_time)
)
where testingtable1.buyer_id IS NULL)
set_a LEFT OUTER JOIN testingtable1
ON(set_a.user_id = testingtable1.buyer_id AND
(set_a.product_id = testingtable1.item_id或set_a.timestamps = UNIX_TIMESTAMP (testingtable1.created_time))
) ;
更多更新
根据 user1166147
评论。我按照他的查询编写了我的查询。
在配置单元中,我猜< JOIN
。
$ b
select * from(select t2.buyer_id,t2.item_id, t2.created_time as created_time,subq.user_id,subq.product_id,subq.timestamps as timestamps
from
(选择user_id,prod_and_ts.product_id作为product_id,prod_and_ts.timestamps作为来自testingtable2横向视图爆炸的时间戳(purchased_item )exploded_table as prod_and_ts)subq JOIN testingtable1 t2 on t2.buyer_id = subq.user_id
AND subq.timestamps = unix_timestamp(t2.created_time)
WHERE(subq.product_id<> t2.item_id)
union all
select t2.buyer_id,t2.item_id as item_id,t2.created_time,subq.user_id,subq.product_id as product_id,subq.timestamps
from
(select user_id, prod_and_ts.product_id作为product_id,prod_and_ts.timestamps作为来自testingtable2侧视图的时间戳explode(purchased_item)exploded_table作为prod_and_ts)subq JO IN testingtable1 t2 on t2.buyer_id = subq.user_id
and subq.product_id = t2.item_id
WHERE(subq.timestamps<> unix_timestamp(t2.created_time)))unionall;
运行上述查询后,我得到零结果。
最后一次更新: -
我不好,我没有准确的数据这是我没有得到结果的原因。是的,它正在实际上面的查询。
编辑 - 第1部分
好的 - 出于某种原因,我要解释我自己 - 所以首先我因为SQL标签而偶然发现了这个问题,并且看到了Hive,并开始不去查看,只是跳过它。但是,我发现它已经过了一天,你没有得到答案。我看了一下 - 我发现原来的查询中有一个SQL逻辑更正,我知道这是需要的,会有所帮助,所以我只发布了,因为没有人回答。我会尽力解决这个最后的问题 - 但之后,我会向自己保留我的建议,因为我可能会提出不好的建议。祝你好运!我试过了!现在你似乎正在得到答案,所以......
在TSQL中,我可以通过下面的单个查询来解决整个问题:
pre $ SELECT *
FROM SO_Table1HIVE A
全外连接SO_Table2HIVE B ON A.BUYER_ID = B [ USER_ID] AND(B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)
它将返回所有内容,包括您的匹配buyer_id / user_id。它不会与另一个表中的时间或产品中没有匹配的buyer_id / user_id行相匹配,但它会将其作为单独的行在其他表的字段中与NULLS一起返回。我不会以任何方式匹配 - 没有提供准确的信息来做到这一点,如下所述。
结束编辑第1部分
如果您不能使用 OR
执行 FULL OUTER JOIN
在Hive中,满足原始标准的最简单方法是使用 UNION ALL
2 INNER JOIN
s。在其中一个查询中,除了加入匹配的user_ids之外,加入PRODUCT_ID并在 WHERE
中查找与CREATED_TIME不匹配的TIMESTAMPS。在第二个查询中,除了加入匹配的user_ids之外,加入时间AND在 WHERE
中寻找不匹配的产品。
编辑第二部分 - 更新评论问题补充标准
最后一个标准是任何表中的任何记录都有匹配的user_id = buyer_id,但没有其他匹配的记录。带有 OR
条件的 FULL OUTER JOIN
会返回它们,但没有足够的提供信息来描述关联记录彼此。我们可以很容易地识别它们,但无法将它们绑回到对方。如果你这样做,并且你在两个表中都有一个以上的记录没有匹配,那么每个记录都会有多个条目。 我试图在没有更多信息的情况下尝试绑定它们的任何查询都可能是猜测并且不准确。
例如,在第一个表中,如果有这两个(示例假记录)记录在第二个表中除user_id外没有匹配:
1015826235 420003038067 2011-11-03 19:40:21.000
1015826235 720003038067 2004-11-03 19:40:21.000
AND在table2中 - 这些不匹配:
1015826235 {product_id:520003038067,timestamps:10 ...
1015826235 {product_id:620003038067,timestamps:10 ...
您可以识别它们,但是如果您在没有更多条件的情况下将它们匹配,您将得到4而不是2:
1015826235 420003038067 2011-11-03 19:40:21.000 1015826235 520003038067
1015826235 420003038067 2011-11-03 19:40:21.000 1015826235 620003038067
1015826235 720003038067 2004 -11-03 19:40:2 1.000 1015826235 520003038067
1015826235 720003038067 2004-11-03 19:40:21.000 1015826235 620003038067
BUYER_ID ITEM_ID CREATED_TIME USER_ID PRODUCTID时间戳
- -------------------------------------------------- -------------------
NULL NULL NULL 1015826235 520003038067 2009-11-11 22:21:11.000
NULL NULL NULL 1015826235 620003038067 2008-11 -11 22:21:11.000
1015826235 420003038067 2011-11-03 19:40:21.000 NULL NULL NULL
1015826235 720003038067 2004-11-03 19:40:21.000 NULL NULL NULL
结束编辑第2部分 - 更新评论问题附加标准 - 第1部分
我正在使用TSQL,所以我无法为您测试一个确切的曲线你的语法,但连接的概念是相同的,这将返回你想要的。我确实接受了您的查询并尝试您的语法,并根据需要进行修改。我使用TSQL进行测试。您可以通过HiveQL中的功能来完善它并进行改进。还有其他方法可以做到这一点 - 但这是最直接的,这将转化为HiveQL。
删除后,您已加入此部分,并在此后包含
$ b (根据需要再次修改语法)**
SELECT *
FROM(
)SELECT BUYER_ID,ITEM_ID,CREATED_TIME,PRODUCT_ID,TIMESTAMPS $ b $ FROM FROM testingtable2横向视图
explode(purchased_item)exploded_table作为prod_and_ts )
prod_and_ts
INNER JOIN table2 A ON A.BUYER_ID = prod_and_ts。[USER_ID] AND prod_and_ts.timestamps = UNIX_TIMESTAMP(table2.created_time)
WHERE prod_and_ts.product_id<> A.ITEM_ID
UNION ALL
SELECT BUYER_ID,ITEM_ID,CREATED_TIME,PRODUCT_ID,TIMESTAMPS
FROM testingtable2 LATERAL VIEW
explode(purchased_item)exploded_table作为prod_and_ts)
prod_and_ts
INNER JOIN table2 A ON A.BUYER_ID = prod_and_ts。[USER_ID] AND prod_and_ts.product_id = A.ITEM_ID
WHERE prod_and_ts.timestamps<> UNIX_TIMESTAMP(table2.created_time)
)X
这是我的测试过的TSQL版本与我的表名称作为参考:
SELECT *
FROM(
SELECT *
FROM SO_Table1HIVE A
INNER JOIN SO_Table2HIVE B ON A.BUYER_ID = B。[USER_ID] AND B.t1time = A.Created_TIME
WHERE B.PRODUCTID<> A.ITEM_ID
UNION ALL
SELECT *
FROM SO_Table1HIVE A
INNER JOIN SO_Table2HIVE B ON A.BUYER_ID = B. [USER_ID] AND B.PRODUCTID = A.ITEM_ID
WHERE B. t1time<> A.Created_TIME
)X
* 编辑第3部分在TSQL中,整个查询(没有联合)可以使用 FULL来运行。 OUTER JOIN
加入 OR
条件
SELECT *
FROM SO_Table1HIVE A
FULL OUTER JOIN SO_Table2HIVE BO N A.BUYER_ID = B. [USER_ID] AND(B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)
如果不能简单地执行上述操作,对于新条件的SQL逻辑 - 从两个表中获取不匹配的SQL逻辑,并将它们显示为 TSQL示例 - 修改HIVE 或者,如果你想要抓住它们并将它们作为重复项添加到 TSQL 再次祝您好运! NULL code>在另一个表中使用
RIGHT JOIN
和 LEFT JOIN
。
RIGHT JOIN
将抓取第一个表中的所有内容,并匹配第二个表中的所有内容,并且 LEFT
相反。将新查询添加到 UNION
中。
SELECT *
FROM SO_Table1HIVE A
RIGHT JOIN SO_Table2HIVE B ON A.BUYER_ID = B. [USER_ID] AND(B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)
WHERE A.BUYER_ID是NULL
UNION ALL
SELECT *
FROM SO_Table1HIVE A
LEFT JOIN SO_Table2HIVE B ON A.BUYER_ID = B。[USER_ID] AND(B.t1time = A.Created_TIME or B.PRODUCTID = A.ITEM_ID)
WHERE B. [USER_ID] IS NULL
UNION
:
SELECT *
FROM SO_Table1HIVE A
JOIN SO_Table2HIVE B ON A.BUYER_ID = B。[USER_ID]
WHERE B.t1time NOT IN(SELECT Created_TIME FROM SO_Table1HIVE)
AND A.Created_TIME NOT IN(SELECT t1time FROM SO_Table2HIVE)
AND B.PRODUCTID NOT IN(SELECT ITEM_ID FROM SO_Table1HIVE)
AND A.ITEM_ID NOT IN (从SO_Table2HIVE中选择产品)
Possible Duplicate:
SQL Query JOIN with Table
CREATE EXTERNAL TABLE IF NOT EXISTS TestingTable1 (This is the MAIN table through which comparisons need to be made)
(
BUYER_ID BIGINT,
ITEM_ID BIGINT,
CREATED_TIME STRING
)
And this is the data in the above first table
**BUYER_ID** | **ITEM_ID** | **CREATED_TIME**
--------------+------------------+-------------------------
1015826235 220003038067 *2001-11-03 19:40:21*
1015826235 300003861266 2001-11-08 18:19:59
1015826235 140002997245 2003-08-22 09:23:17
1015826235 *210002448035* 2001-11-11 22:21:11
This is Second table in Hive- It also contains information about the items we are purchasing.
CREATE EXTERNAL TABLE IF NOT EXISTS TestingTable2
(
USER_ID BIGINT,
PURCHASED_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>
)
And this is the data in the above second table (TestingTable2
)-
**USER_ID** **PURCHASED_ITEM**
1015826235 [{"product_id":220003038067,"timestamps":"1004941621"}, {"product_id":300003861266,"timestamps":"1005268799"}, {"product_id":140002997245,"timestamps":"1061569397"},{"product_id":200002448035,"timestamps":"1005542471"}]
Compare TestingTable2
with TestingTable1
so that below scenario is fulfilled.
Find the PRODUCT_ID
AND TIMESTAMPS
from TestingTable2
WHICH IS NOT MATCHING WITH ITEM_ID
AND CREATED_TIME
from TestingTable1 CORRESPONDING TO BUYER_ID(USER_ID)
after comparing from TestingTable1
.
So If you look TestingTable2
data this(last) ITEM_ID 210002448035
from TestingTable1
is not matching with TestingTable2
PRODUCT_ID- 200002448035
data and similarly with timestamps. So I want to show the below result using the HiveQL query.
**BUYER_ID** | **ITEM_ID** | **CREATED_TIME** | **PRODUCT_ID** | **TIMESTAMPS**
--------------+------------------+--------------------------------+------------------------+----------------------
1015826235 *210002448035* 2001-11-11 22:21:11 200002448035 1005542471
1015826235 220003038067 *2001-11-03 19:40:21* 220003038067 1004941621
Can anyone help me with this. As I am new to HiveQL so having lot of Problem.
Updated:-
I have written this query, but it doesn't working the way I wanted to.
select * from
(select * from
(select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps
from testingtable2 LATERAL VIEW
explode(purchased_item) exploded_table as prod_and_ts)
prod_and_ts
LEFT OUTER JOIN testingtable1
ON ( prod_and_ts.user_id = testingtable1.buyer_id AND testingtable1.item_id = prod_and_ts.product_id
AND prod_and_ts.timestamps = UNIX_TIMESTAMP (testingtable1.created_time)
)
where testingtable1.buyer_id IS NULL)
set_a LEFT OUTER JOIN testingtable1
ON (set_a.user_id = testingtable1.buyer_id AND
( set_a.product_id = testingtable1.item_id OR set_a.timestamps = UNIX_TIMESTAMP(testingtable1.created_time) )
);
One More UPDATE
As per user1166147
comments. I wrote my query as per his query.
In hive I guess INNER JOIN
are written by simply JOIN
.
This is my below query.
select * from (select t2.buyer_id, t2.item_id, t2.created_time as created_time, subq.user_id, subq.product_id, subq.timestamps as timestamps
from
(select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps from testingtable2 lateral view explode(purchased_item) exploded_table as prod_and_ts) subq JOIN testingtable1 t2 on t2.buyer_id = subq.user_id
AND subq.timestamps = unix_timestamp(t2.created_time)
WHERE (subq.product_id <> t2.item_id)
union all
select t2.buyer_id, t2.item_id as item_id, t2.created_time, subq.user_id, subq.product_id as product_id, subq.timestamps
from
(select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps from testingtable2 lateral view explode(purchased_item) exploded_table as prod_and_ts) subq JOIN testingtable1 t2 on t2.buyer_id = subq.user_id
and subq.product_id = t2.item_id
WHERE (subq.timestamps <> unix_timestamp(t2.created_time))) unionall;
And after running the above query, I am getting zero result back.
One Final UPDATE:-
My Bad, I didn't have the accurate data in the tables so that is the reason I was not getting the result back. Yes it is working the actual above query.
EDIT - PART 1 Okay - For some reason I am going to explain myself - so to start with I stumbled upon this question because of the SQL tag, and saw Hive, and started to not look and just skip it. BUT then I noticed it had been over a day and you had gotten no answers. I looked - I saw a SQL logic correction in the original query posted that I knew would be needed and would help, so I posted ONLY because no one had answered. I will try to address this last question - but after that I am keeping my advice to myself, as I may be giving bad advice. Good luck! I tried! And you seem to be getting answers now, so...
In TSQL, I could solve this entire problem with the below single query:
SELECT *
FROM SO_Table1HIVE A
FULL OUTER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND (B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)
It would return everything, including your match buyer_id/user_id only. It won't match a buyer_id/user_id row with no matches in either time or product in the other table, but it will return it as a separate row with NULLS in the other table's fields. I would not match these any way - there is no accurate information provided to do it with as explained below.
END EDIT PART 1
If you can't do FULL OUTER JOIN
with OR
in Hive, the simplest way to meet the original criteria is to UNION ALL
2 INNER JOIN
s. On one of the queries, in addition to joining the matching user_ids, join on the PRODUCT_ID AND in your WHERE
look for TIMESTAMPS that don't match CREATED_TIME. On the second query, in addition to joining the matching user_ids, join on the times AND in your WHERE
look for products that don't match.
EDIT PART 2 - UPDATE FOR COMMENT QUESTION ADDITIONAL CRITERIA
If I understand the last criteria it is any record in either table that has a matching user_id = buyer_id, but nothing else matches. The FULL OUTER JOIN
with OR
condition will return them, but there isn't enough provided info for a way to relate the records to each other. We can easily identify them, but have no way to tie them back to each other. If you do so and you have more than one record without a match in either OR both tables, there are going to be multiple entries for each.
Any query I wrote to try to tie them without more info (and probably with) would be a guess and inaccurate.
For example, in the first table if there were these 2 (sample fake) records with nothing matching in the second except user_id:
1015826235 420003038067 2011-11-03 19:40:21.000
1015826235 720003038067 2004-11-03 19:40:21.000
AND in table2 - these non matching:
1015826235 {"product_id":520003038067,"timestamps":"10...
1015826235 {"product_id":620003038067,"timestamps":"10...
You can identify them, but if you match them without more criteria you get 4 instead of 2:
1015826235 420003038067 2011-11-03 19:40:21.000 1015826235 520003038067
1015826235 420003038067 2011-11-03 19:40:21.000 1015826235 620003038067
1015826235 720003038067 2004-11-03 19:40:21.000 1015826235 520003038067
1015826235 720003038067 2004-11-03 19:40:21.000 1015826235 620003038067
My suggestion would be simply to identify them and show them, as below.
BUYER_ID ITEM_ID CREATED_TIME USER_ID PRODUCTID timestamps
----------------------------------------------------------------------
NULL NULL NULL 1015826235 520003038067 2009-11-11 22:21:11.000
NULL NULL NULL 1015826235 620003038067 2008-11-11 22:21:11.000
1015826235 420003038067 2011-11-03 19:40:21.000 NULL NULL NULL
1015826235 720003038067 2004-11-03 19:40:21.000 NULL NULL NULL
END EDIT PART 2 - UPDATE FOR COMMENT QUESTION ADDITIONAL CRITERIA - PART 1
I am working with TSQL, so I can't test for you an exact query with your syntax, but the concepts of the joins are the same, and this will return what you want. I did take your query and attempt your syntax, modify as needed. I tested in TSQL. You may be able to take this and improve upon it with functionality in HiveQL. There are other ways to do this - but this is the most straightforward and this will translate to HiveQL.
REMOVED, YOU GOT THIS PART AND IT IS INCLUDED LATER
(Again modify syntax as needed)**
SELECT *
FROM (
SELECT BUYER_ID,ITEM_ID,CREATED_TIME,PRODUCT_ID,TIMESTAMPS
FROM testingtable2 LATERAL VIEW
explode(purchased_item) exploded_table as prod_and_ts)
prod_and_ts
INNER JOIN table2 A ON A.BUYER_ID = prod_and_ts.[USER_ID] AND prod_and_ts.timestamps = UNIX_TIMESTAMP (table2.created_time)
WHERE prod_and_ts.product_id <> A.ITEM_ID
UNION ALL
SELECT BUYER_ID,ITEM_ID,CREATED_TIME,PRODUCT_ID,TIMESTAMPS
FROM testingtable2 LATERAL VIEW
explode(purchased_item) exploded_table as prod_and_ts)
prod_and_ts
INNER JOIN table2 A ON A.BUYER_ID = prod_and_ts.[USER_ID] AND prod_and_ts.product_id = A.ITEM_ID
WHERE prod_and_ts.timestamps <> UNIX_TIMESTAMP (table2.created_time)
) X
And here is my tested TSQL version with my table names for reference:
SELECT *
FROM(
SELECT *
FROM SO_Table1HIVE A
INNER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND B.t1time = A.Created_TIME
WHERE B.PRODUCTID <> A.ITEM_ID
UNION ALL
SELECT *
FROM SO_Table1HIVE A
INNER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND B.PRODUCTID = A.ITEM_ID
WHERE B.t1time <> A.Created_TIME
) X
*EDIT PART 3 - UPDATE FOR COMMENT QUESTION ADDITIONAL CRITERIA -PART 2
In TSQL the entire query (no unions) can be run using a FULL OUTER JOIN
with an OR
condition on the join
SELECT *
FROM SO_Table1HIVE A
FULL OUTER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND (B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)
If you can't simply do the above, For the SQL logic for the new criteria - to grab those that don't match from both tables and display them as NULL
in the other table use RIGHT JOIN
and LEFT JOIN
.
RIGHT JOIN
will grab anything in the first table the matches the second and everything in the second, and LEFT
does the opposite. Add the new queries to your UNION
.
TSQL EXAMPLE - MODIFY FOR HIVE
SELECT *
FROM SO_Table1HIVE A
RIGHT JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND (B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)
WHERE A.BUYER_ID IS NULL
UNION ALL
SELECT *
FROM SO_Table1HIVE A
LEFT JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND (B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)
WHERE B.[USER_ID] IS NULL
Or, If you wanted to grab them and match them as duplicates add to UNION
:
TSQL
SELECT *
FROM SO_Table1HIVE A
JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID]
WHERE B.t1time NOT IN(SELECT Created_TIME FROM SO_Table1HIVE)
AND A.Created_TIME NOT IN(SELECT t1time FROM SO_Table2HIVE)
AND B.PRODUCTID NOT IN(SELECT ITEM_ID FROM SO_Table1HIVE)
AND A.ITEM_ID NOT IN(SELECT PRODUCTID FROM SO_Table2HIVE)
Again, Good luck!
这篇关于使用HiveQL(Hadoop)在Hive中加入两个表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!