使用 HiveQL(Hadoop) 在 Hive 中连接两个表 [英] Joining two Tables in Hive using HiveQL(Hadoop)

查看:50
本文介绍了使用 HiveQL(Hadoop) 在 Hive 中连接两个表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能的重复:
SQL Query JOIN with Table

CREATE EXTERNAL TABLE IF NOT EXISTS TestingTable1   (This is the MAIN table through which comparisons need to be made)
(
BUYER_ID BIGINT,
ITEM_ID BIGINT,
CREATED_TIME STRING
)

这是上面第一个表中的数据

And this is the data in the above first table

**BUYER_ID**  |  **ITEM_ID**     |      **CREATED_TIME**   
--------------+------------------+-------------------------
 1015826235      220003038067       *2001-11-03 19:40:21*
 1015826235      300003861266        2001-11-08 18:19:59
 1015826235      140002997245        2003-08-22 09:23:17
 1015826235     *210002448035*       2001-11-11 22:21:11

这是 Hive 中的第二个表 - 它还包含有关我们正在购买的物品的信息.

This is Second table in Hive- It also contains information about the items we are purchasing.

CREATE EXTERNAL TABLE IF NOT EXISTS TestingTable2
(
USER_ID BIGINT,
PURCHASED_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>
)

这是上面第二个表中的数据(TestingTable2)-

And this is the data in the above second table (TestingTable2)-

**USER_ID**    **PURCHASED_ITEM**
1015826235     [{"product_id":220003038067,"timestamps":"1004941621"},    {"product_id":300003861266,"timestamps":"1005268799"},    {"product_id":140002997245,"timestamps":"1061569397"},{"product_id":200002448035,"timestamps":"1005542471"}]

比较 TestingTable2TestingTable1 以便满足下面的场景.

Compare TestingTable2 with TestingTable1 so that below scenario is fulfilled.

TestingTable2 中找到与 ITEM_IDCREATED_TIME 不匹配的 PRODUCT_IDTIMESTAMPS<与 TestingTable1 比较后,来自 TestingTable1 的/code> 对应于 BUYER_ID(USER_ID).

Find the PRODUCT_ID AND TIMESTAMPS from TestingTable2 WHICH IS NOT MATCHING WITH ITEM_ID AND CREATED_TIME from TestingTable1 CORRESPONDING TO BUYER_ID(USER_ID) after comparing from TestingTable1.

因此,如果您查看 TestingTable2 数据,TestingTable1 中的此(最后)ITEM_ID 210002448035TestingTable2 不匹配PRODUCT_ID- 200002448035 数据以及类似的时间戳.所以我想使用 HiveQL 查询显示以下结果.

So If you look TestingTable2 data this(last) ITEM_ID 210002448035 from TestingTable1 is not matching with TestingTable2 PRODUCT_ID- 200002448035 data and similarly with timestamps. So I want to show the below result using the HiveQL query.

**BUYER_ID**  |  **ITEM_ID**     |      **CREATED_TIME**          |  **PRODUCT_ID**    |     **TIMESTAMPS** 
--------------+------------------+--------------------------------+------------------------+----------------------
1015826235          *210002448035*           2001-11-11 22:21:11            200002448035           1005542471 
1015826235       220003038067           *2001-11-03 19:40:21*           220003038067          1004941621

谁能帮我解决这个问题.因为我是 HiveQL 的新手,所以有很多问题.

Can anyone help me with this. As I am new to HiveQL so having lot of Problem.

更新:-

我已经编写了这个查询,但它没有按照我想要的方式工作.

I have written this query, but it doesn't working the way I wanted to.

select * from 
  (select * from 
     (select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps 
      from testingtable2 LATERAL VIEW
      explode(purchased_item) exploded_table as prod_and_ts)
      prod_and_ts 
      LEFT OUTER JOIN testingtable1 
  ON ( prod_and_ts.user_id = testingtable1.buyer_id AND testingtable1.item_id =    prod_and_ts.product_id
     AND prod_and_ts.timestamps = UNIX_TIMESTAMP (testingtable1.created_time)
  )
  where testingtable1.buyer_id IS NULL) 
  set_a LEFT OUTER JOIN testingtable1 
  ON (set_a.user_id = testingtable1.buyer_id AND  
  ( set_a.product_id = testingtable1.item_id OR set_a.timestamps = UNIX_TIMESTAMP(testingtable1.created_time) )
 );

又一更新

根据 user1166147 评论.我按照他的查询写了我的查询.在蜂巢中,我猜 INNER JOIN 是由简单的 JOIN 编写的.

As per user1166147 comments. I wrote my query as per his query. In hive I guess INNER JOIN are written by simply JOIN.

这是我的以下查询.

select * from (select t2.buyer_id, t2.item_id, t2.created_time as created_time, subq.user_id, subq.product_id, subq.timestamps as timestamps 
from
(select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps from testingtable2 lateral view explode(purchased_item) exploded_table as prod_and_ts) subq JOIN testingtable1 t2 on t2.buyer_id = subq.user_id 
AND subq.timestamps = unix_timestamp(t2.created_time)
WHERE (subq.product_id <> t2.item_id)
union all
select t2.buyer_id, t2.item_id as item_id, t2.created_time, subq.user_id, subq.product_id as product_id, subq.timestamps
from 
(select user_id, prod_and_ts.product_id as product_id, prod_and_ts.timestamps as timestamps from testingtable2 lateral view explode(purchased_item) exploded_table as prod_and_ts) subq JOIN testingtable1 t2 on t2.buyer_id = subq.user_id 
    and subq.product_id = t2.item_id 
    WHERE (subq.timestamps <> unix_timestamp(t2.created_time))) unionall;

在运行上述查询后,我得到的结果为零.

And after running the above query, I am getting zero result back.

最后一次更新:-

糟糕,表格中没有准确的数据,这就是我没有得到结果的原因.是的,它正在实际执行上述查询.

My Bad, I didn't have the accurate data in the tables so that is the reason I was not getting the result back. Yes it is working the actual above query.

推荐答案

编辑 - 第 1 部分好的 - 出于某种原因,我要解释自己 - 所以首先我因为 SQL 标签偶然发现了这个问题,看到了 Hive,然后开始不看,只是跳过它.但后来我注意到已经过了一天,你没有得到任何答复.我看了看 - 我在发布的原始查询中看到了 SQL 逻辑更正,我知道我知道需要并且会提供帮助,所以我发布只是因为没有人回答.我将尝试解决最后一个问题 - 但在那之后我会保留我的建议,因为我可能会给出不好的建议.祝你好运!我试过!你现在似乎得到了答案,所以...

EDIT - PART 1 Okay - For some reason I am going to explain myself - so to start with I stumbled upon this question because of the SQL tag, and saw Hive, and started to not look and just skip it. BUT then I noticed it had been over a day and you had gotten no answers. I looked - I saw a SQL logic correction in the original query posted that I knew would be needed and would help, so I posted ONLY because no one had answered. I will try to address this last question - but after that I am keeping my advice to myself, as I may be giving bad advice. Good luck! I tried! And you seem to be getting answers now, so...

在 TSQL 中,我可以通过以下单个查询解决整个问题:

SELECT * 
FROM SO_Table1HIVE A
FULL OUTER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND (B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)

它会返回所有内容,仅包括匹配的buyer_id/user_id.它不会匹配另一个表中的时间或产品中没有匹配项的buyer_id/user_id 行,但它会将其作为一个单独的行返回,在另一个表的字段中为 NULLS.我不会以任何方式匹配这些 - 没有提供准确的信息来做,如下所述.

It would return everything, including your match buyer_id/user_id only. It won't match a buyer_id/user_id row with no matches in either time or product in the other table, but it will return it as a separate row with NULLS in the other table's fields. I would not match these any way - there is no accurate information provided to do it with as explained below.

结束编辑第 1 部分

如果你不能在Hive中用ORFULL OUTER JOIN,满足原始条件的最简单方法是UNION ALL2 INNER JOINs.在其中一个查询中,除了加入匹配的 user_ids 之外,还加入 PRODUCT_ID 并在您的 WHERE 中查找与 CREATED_TIME 不匹配的 TIMESTAMPS.在第二个查询中,除了加入匹配的 user_ids 之外,在 WHERE 中加入时间 AND 查找不匹配的产品.

If you can't do FULL OUTER JOIN with OR in Hive, the simplest way to meet the original criteria is to UNION ALL 2 INNER JOINs. On one of the queries, in addition to joining the matching user_ids, join on the PRODUCT_ID AND in your WHERE look for TIMESTAMPS that don't match CREATED_TIME. On the second query, in addition to joining the matching user_ids, join on the times AND in your WHERE look for products that don't match.

编辑第 2 部分 - 评论问题附加标准的更新

如果我理解最后一个标准,它是任一表中具有匹配 user_id = purchase_id 的任何记录,但没有其他匹配.带有 OR 条件的 FULL OUTER JOIN 将返回它们,但没有提供足够的信息来将记录相互关联.我们可以很容易地识别它们,但无法将它们相互联系起来.如果您这样做并且您有多个记录在任一或两个表中没有匹配项,则每个表将有多个条目.

If I understand the last criteria it is any record in either table that has a matching user_id = buyer_id, but nothing else matches. The FULL OUTER JOIN with OR condition will return them, but there isn't enough provided info for a way to relate the records to each other. We can easily identify them, but have no way to tie them back to each other. If you do so and you have more than one record without a match in either OR both tables, there are going to be multiple entries for each.

我写的任何试图在没有更多信息(可能有)的情况下将它们联系起来的查询都是猜测和不准确的.

例如,在第一个表中,如果有这 2 个(样本假)记录,第二个中除了 user_id 外没有任何匹配:

For example, in the first table if there were these 2 (sample fake) records with nothing matching in the second except user_id:

1015826235  420003038067    2011-11-03 19:40:21.000
1015826235  720003038067    2004-11-03 19:40:21.000

表 2 中的 AND - 这些不匹配:

AND in table2 - these non matching:

1015826235  {"product_id":520003038067,"timestamps":"10...
1015826235  {"product_id":620003038067,"timestamps":"10...

您可以识别它们,但如果您在没有更多条件的情况下匹配它们,则会得到 4 而不是 2:

You can identify them, but if you match them without more criteria you get 4 instead of 2:

1015826235  420003038067    2011-11-03 19:40:21.000 1015826235 520003038067
1015826235  420003038067    2011-11-03 19:40:21.000 1015826235 620003038067
1015826235  720003038067    2004-11-03 19:40:21.000 1015826235 520003038067
1015826235  720003038067    2004-11-03 19:40:21.000 1015826235 620003038067

我的建议是简单地识别它们并显示它们,如下所示.

My suggestion would be simply to identify them and show them, as below.

BUYER_ID        ITEM_ID      CREATED_TIME           USER_ID PRODUCTID   timestamps  
----------------------------------------------------------------------
NULL            NULL         NULL                   1015826235  520003038067    2009-11-11 22:21:11.000
NULL            NULL         NULL                   1015826235  620003038067    2008-11-11 22:21:11.000
1015826235      420003038067 2011-11-03 19:40:21.000    NULL    NULL    NULL    
1015826235      720003038067 2004-11-03 19:40:21.000    NULL    NULL    NULL    

结束编辑第 2 部分 - 评论问题附加标准的更新 - 第 1 部分

我正在使用 TSQL,因此我无法使用您的语法为您测试准确的查询,但是连接的概念是相同的,这将返回您想要的内容.我确实接受了您的查询并尝试了您的语法,并根据需要进行了修改.我在 TSQL 中测试过.您可以使用 HiveQL 中的功能来改进它.还有其他方法可以做到这一点 - 但这是最​​直接的,这将转化为 HiveQL.

I am working with TSQL, so I can't test for you an exact query with your syntax, but the concepts of the joins are the same, and this will return what you want. I did take your query and attempt your syntax, modify as needed. I tested in TSQL. You may be able to take this and improve upon it with functionality in HiveQL. There are other ways to do this - but this is the most straightforward and this will translate to HiveQL.

已移除,您已获得此部分,稍后再包含

(再次根据需要修改语法)**

SELECT *
FROM (
    SELECT BUYER_ID,ITEM_ID,CREATED_TIME,PRODUCT_ID,TIMESTAMPS 
    FROM testingtable2 LATERAL VIEW
        explode(purchased_item) exploded_table as prod_and_ts)
        prod_and_ts 
    INNER JOIN table2 A  ON A.BUYER_ID = prod_and_ts.[USER_ID] AND prod_and_ts.timestamps = UNIX_TIMESTAMP (table2.created_time) 
    WHERE prod_and_ts.product_id <> A.ITEM_ID
    UNION ALL
    SELECT BUYER_ID,ITEM_ID,CREATED_TIME,PRODUCT_ID,TIMESTAMPS 
    FROM testingtable2 LATERAL VIEW
            explode(purchased_item) exploded_table as prod_and_ts)
            prod_and_ts 
    INNER JOIN table2 A  ON A.BUYER_ID = prod_and_ts.[USER_ID] AND prod_and_ts.product_id = A.ITEM_ID
    WHERE  prod_and_ts.timestamps <> UNIX_TIMESTAMP (table2.created_time) 
) X

这是我的测试过的 TSQL 版本,其中包含我的表名以供参考:

SELECT * 
FROM(
    SELECT *
    FROM SO_Table1HIVE A
    INNER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND B.t1time = A.Created_TIME 
    WHERE B.PRODUCTID <> A.ITEM_ID
    UNION ALL
    SELECT * 
    FROM SO_Table1HIVE A
    INNER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND B.PRODUCTID = A.ITEM_ID  
    WHERE B.t1time <> A.Created_TIME  
 ) X

*编辑第 3 部分 - 评论问题附加标准的更新 - 第 2 部分

在 TSQL 中,可以使用 FULL OUTER JOINOR 条件来运行整个查询(无联合)

In TSQL the entire query (no unions) can be run using a FULL OUTER JOIN with an OR condition on the join

SELECT * 
FROM SO_Table1HIVE A
FULL OUTER JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND (B.t1time = A.Created_TIME OR B.PRODUCTID = A.ITEM_ID)

如果您不能简单地执行上述操作,对于新条件的 SQL 逻辑 - 从两个表中获取不匹配的内容并将它们显示为 NULL 在另一个表中使用RIGHT JOINLEFT JOIN.RIGHT JOIN 将抓取第一个表中的任何内容,匹配第二个表中的所有内容,而 LEFT 则相反.将新查询添加到您的 UNION.

If you can't simply do the above, For the SQL logic for the new criteria - to grab those that don't match from both tables and display them as NULL in the other table use RIGHT JOIN and LEFT JOIN. RIGHT JOIN will grab anything in the first table the matches the second and everything in the second, and LEFT does the opposite. Add the new queries to your UNION.

TSQL 示例 - 修改 HIVE

TSQL EXAMPLE - MODIFY FOR HIVE

SELECT * 
FROM SO_Table1HIVE A
RIGHT JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND (B.t1time = A.Created_TIME    OR B.PRODUCTID = A.ITEM_ID)
WHERE A.BUYER_ID IS NULL 
UNION ALL
SELECT * 
FROM SO_Table1HIVE A
LEFT JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] AND (B.t1time = A.Created_TIME OR    B.PRODUCTID = A.ITEM_ID)
WHERE B.[USER_ID] IS NULL

或者,如果您想抓取它们并将它们匹配为重复项添加到 UNION:

Or, If you wanted to grab them and match them as duplicates add to UNION:

TSQL

SELECT * 
FROM SO_Table1HIVE A
JOIN SO_Table2HIVE B ON A.BUYER_ID = B.[USER_ID] 
WHERE B.t1time NOT IN(SELECT Created_TIME FROM SO_Table1HIVE)
AND A.Created_TIME  NOT IN(SELECT t1time FROM SO_Table2HIVE) 
AND B.PRODUCTID NOT IN(SELECT ITEM_ID FROM SO_Table1HIVE)
AND A.ITEM_ID NOT IN(SELECT PRODUCTID FROM SO_Table2HIVE) 

再次祝你好运!

这篇关于使用 HiveQL(Hadoop) 在 Hive 中连接两个表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆