在某个列中删除表中的行,并在HIVE的其他列中保留相应的值 [英] De-duplicating rows in a table with respect to certain columns and retaining the corresponding values in the other columns in HIVE

查看:731
本文介绍了在某个列中删除表中的行,并在HIVE的其他列中保留相应的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用一个现有的有7列的表格在HIVE中创建一个临时表。我只想摆脱前三列的重复项,并在其他4列中保留相应的值。我不关心哪一行实际上被删除,而是单独使用前三行来去除重复。

解决方案

下面如果你不考虑订购

 创建表格2作为
选择col1,col2,col3,
,split(agg_col,|)[0]作为col4
,split(agg_col,|)[1]作为col5
,split(agg_col,|)[2] as col6
split(agg_col,|)[3] as col7
from(选择col1,col2,col3,
max(concat(cast(col4 as string),| ,
cast(col5 as string),|,
cast(col6 as string),|,
cast(col7 as string)))as agg_col
from table1
group by col1,col2,col3)A;

下面是另一种方法,它可以很好地控制排序,但比上面的方法慢b
$ b

 创建表格table2为
选择col1,col2,col3,max(col4),max(col5),max(col6),max (col7)
from(选择col1,col2,col3,col4,col5,col6,col7,
rank()over(由col1,col2,col3分区
按col4 desc,col5排序desc,col6 desc,col7 desc)作为col_rank
from table1)A
其中A.col_rank = 1
GROUP BY col1,col2,col3; $($)$ <$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $按列排序都是平等的。在我们的例子中,如果所有七列中有两列具有完全相同的值,那么当我们使用过滤器作为col_rank = 1时将会有重复。这些重复项可以使用上面查询中写的max和group by子句来发起。


I need to create a temporary table in HIVE using an existing table that has 7 columns. I just want to get rid of duplicates with respect to first three columns and also retain the corresponding values in the other 4 columns. I don't care which row is actually dropped while de-duplicating using first three rows alone.

解决方案

You could use something as below if you are not considered about ordering

create table table2 as 
select col1, col2, col3, 
      ,split(agg_col,"|")[0] as col4
      ,split(agg_col,"|")[1] as col5
      ,split(agg_col,"|")[2] as col6
      ,split(agg_col,"|")[3] as col7
from (Select col1, col2, col3,
             max(concat(cast(col4 as string),"|", 
                        cast(col5 as string),"|",
                        cast(col6 as string),"|",
                        cast(col7 as string))) as agg_col
from table1
group by col1,col2,col3 ) A;

Below is another approach, which gives much control over ordering but slower than above approach

create table table2 as 
select col1, col2, col3,max(col4), max(col5), max(col6), max(col7)
from (Select col1, col2, col3,col4, col5, col6, col7,
             rank() over ( partition by col1, col2, col3 
                           order by col4 desc, col5 desc, col6 desc, col7 desc ) as col_rank
from table1 ) A
where A.col_rank = 1
GROUP BY col1, col2, col3;

rank() over(..) function returns more than one column with rank as '1' if order by columns are all equal. In our case if there are 2 columns with exact same values for all seven columns then there will be duplicates when we use filter as col_rank =1. These duplicates can be eleminated using max and group by clauses as written in above query.

这篇关于在某个列中删除表中的行,并在HIVE的其他列中保留相应的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆