比较Hive中表的两个分区 [英] Compare two partitions of table in Hive

查看:62
本文介绍了比较Hive中表的两个分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要比较Hive中表的两个分区中的数据更改.具体来说,我有两个分区(ptn_dt = '01 -31-2019'和ptn_dt = '02 -28-2019').每个分区包含以下列-num_key和active_indicator(是或否).

I need to compare data changes in two partitions of table in Hive. Specifically, I have two partitions (ptn_dt='01-31-2019' and ptn_dt='02-28-2019'). Each partition contains the following columns - num_key and active_indicator (true or false).

我需要确定以下内容:

到目前为止,我已成功获取每个分区的计数,并使用下面的Hive SQL查询按active_indicator进行了分组.但是,尝试在分区之间进行比较时遇到麻烦.

So far, I was successful in getting the count per partition, grouped by active_indicator using the Hive SQL query below. However, I am having trouble when trying to compare in between partitions.

"""SELECT active_indicator,count(*) from table_name WHERE ptn_dt='2019-01-31' GROUP BY active_indicator"""

使用示例数据编辑并获得所需结果:

EDITED WITH SAMPLE DATA AND DESIRED RESULTS:

ptn_dt ='2019-01-31'

ptn_dt = '2019-01-31'

<身体>
num_key active_indicator
111 true
112 false
113 false
114 false
115 true
116 true

ptn_dt ='2019-02-28'

ptn_dt = '2019-02-28'

<身体>
num_key active_indicator
111 true
112 false
113 true
114 true
115 true
116 true
117 true
118 false
119 true

输出:

推荐答案

您可以使用条件聚合.这会将比较结果放在同一行:

You can use conditional aggregation. This puts the comparisons in the same row:

SELECT active_indicator,
       SUM(CASE WHEN ptn_dt = '2019-01-31' THEN 1 ELSE 0 END),
       SUM(CASE WHEN ptn_dt = '2019-02-28' THEN 1 ELSE 0 END)
FROM table_name
WHERE ptn_dt IN ('2019-01-31', '2019-02-28')
GROUP BY active_indicator;

或者,在不同的行中,您可以使用:

Or, in different rows, you could use:

SELECT active_indicator, ptn_dt, COUNT(*)
FROM table_name
WHERE ptn_dt IN ('2019-01-31', '2019-02-28')
GROUP BY active_indicator, ptn_dt;

根据您的评论,使用 lag().对于所有组合:

Based on your comment, use lag(). For all combinations:

select prev_active_indicator, active_indicator, count(*)
from (select t.*,
             lag(active_indicator) over (partition by num_key order by ptn_dt) as prev_active_indicator
      from table_name t
      where ptn_dt IN ('2019-01-31', '2019-02-28')
     ) t
where ptn_dt = '2019-02-28'
group by prev_active_indicator, active_indicator;

这篇关于比较Hive中表的两个分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆