根据列删除配置单元中的DUPLICATE行 [英] Removing DUPLICATE rows in hive based on columns

查看:108
本文介绍了根据列删除配置单元中的DUPLICATE行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个10列的HIVE表,其中前9列将有重复的行,而第10列不会像CREATE_DATE那样会有它的创建日期。



<例如:

如果我今天向表中插入10行,它将具有CREATE_DATE作为今天的日期。如果我明天插入相同的10行,它将会有一个不同的CREATE_DATE,它使用DISTINCT创建我的问题。



是否有删除基于9列的重复记录并忽略第10列的方法。



示例:让我们考虑一下表中有5列。这是一个由DAYID和MARKETID分区的EXTERNAL HIVE TABLE。只要CREATEDATE以外的列(如第1行和第2行所指的行)相同或者行重复时(如第3行和第4行所指),它应保留其中任何一行。

  COL1 COL2 CREATEDATE DAYID MARKETID 
A 1 20131206 20131207 1234
A 1 20131207 20131207 1234
A 1 20131206 20131207 1234
B 1 20131206 20131207 1234
B 1 20131206 20131207 1234
C 2 20131206 20131207 1234
C 2 20131207 20131207 5678

输出---

  COL1 COL2 CREATEDATE DAYID MARKETID 
A 1 20131206 20131207 1234
B 1 20131206 20131207 1234
C 2 20131206 20131207 1234
C 2 20131207 20131207 5678

感谢
Nates

解决方案

您可以执行以下操作:

 选择col1,col2,dayid,marketid,max(createdate )从tabl创建
通过col1,col2,dayid,marketid




组为单位创建组除了数据之外的所有列,所以如果在这些列中有相同值的行,它们将在同一组中,然后,通过使用像max / min等聚合函数选择所需的创建。 / p>

I have a HIVE table with 10 columns where first 9 columns will have duplicate rows while the 10th column will not as it CREATE_DATE which will have the date it was created.

Example:

If I insert 10 rows into the table today it will have the CREATE_DATE as todays date.. If I insert the same 10 rows again tomorrow it will have a different CREATE_DATE which creates the problem of me using DISTINCT..

Is there a way of deleting the duplicate records based on 9 columns and ignoring the 10th.

Example: Lets consider i have 5 columns in the table. This is an EXTERNAL HIVE TABLE partitioned by DAYID and MARKETID. Whenever the columns other than CREATEDATE (as referred by Row 1 and 2) are same OR if the rows are duplicate (as referred by Row 3 and 4) it should retain any one of those rows. Doesn't matter which it retains.

COL1 COL2 CREATEDATE   DAYID    MARKETID  
A     1    20131206   20131207 1234  
A     1    20131207   20131207 1234  
A     1    20131206   20131207 1234  
B     1    20131206   20131207 1234  
B     1    20131206   20131207 1234  
C     2    20131206   20131207 1234  
C     2    20131207   20131207 5678 

output---

COL1 COL2 CREATEDATE   DAYID    MARKETID
A     1    20131206   20131207   1234
B     1    20131206   20131207   1234
C     2    20131206   20131207   1234
C     2    20131207   20131207   5678

Thanks Nates

解决方案

You can do the following :

select col1,col2,dayid,marketid,max(createdate) as createdate
from tablename
group by col1,col2,dayid,marketid

This way you are grouping the data by all the columns except the data so if there are rows with the same values in these columns they will be in the same group, and then, just "choose" the createdate you want by using an aggregate function like max/min etc.

这篇关于根据列删除配置单元中的DUPLICATE行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆