根据列删除配置单元中的DUPLICATE行 [英] Removing DUPLICATE rows in hive based on columns
问题描述
我有一个10列的HIVE表,其中前9列将有重复的行,而第10列不会像CREATE_DATE那样会有它的创建日期。
<例如:
如果我今天向表中插入10行,它将具有CREATE_DATE作为今天的日期。如果我明天插入相同的10行,它将会有一个不同的CREATE_DATE,它使用DISTINCT创建我的问题。
是否有删除基于9列的重复记录并忽略第10列的方法。
示例:让我们考虑一下表中有5列。这是一个由DAYID和MARKETID分区的EXTERNAL HIVE TABLE。只要CREATEDATE以外的列(如第1行和第2行所指的行)相同或者行重复时(如第3行和第4行所指),它应保留其中任何一行。
COL1 COL2 CREATEDATE DAYID MARKETID
A 1 20131206 20131207 1234
A 1 20131207 20131207 1234
A 1 20131206 20131207 1234
B 1 20131206 20131207 1234
B 1 20131206 20131207 1234
C 2 20131206 20131207 1234
C 2 20131207 20131207 5678
输出---
COL1 COL2 CREATEDATE DAYID MARKETID
A 1 20131206 20131207 1234
B 1 20131206 20131207 1234
C 2 20131206 20131207 1234
C 2 20131207 20131207 5678
感谢
Nates
您可以执行以下操作:
选择col1,col2,dayid,marketid,max(createdate )从tabl创建
通过col1,col2,dayid,marketid
以
组为单位创建组除了数据之外的所有列,所以如果在这些列中有相同值的行,它们将在同一组中,然后,通过使用像max / min等聚合函数选择所需的创建。 / p>
I have a HIVE table with 10 columns where first 9 columns will have duplicate rows while the 10th column will not as it CREATE_DATE which will have the date it was created.
Example:
If I insert 10 rows into the table today it will have the CREATE_DATE as todays date.. If I insert the same 10 rows again tomorrow it will have a different CREATE_DATE which creates the problem of me using DISTINCT..
Is there a way of deleting the duplicate records based on 9 columns and ignoring the 10th.
Example: Lets consider i have 5 columns in the table. This is an EXTERNAL HIVE TABLE partitioned by DAYID and MARKETID. Whenever the columns other than CREATEDATE (as referred by Row 1 and 2) are same OR if the rows are duplicate (as referred by Row 3 and 4) it should retain any one of those rows. Doesn't matter which it retains.
COL1 COL2 CREATEDATE DAYID MARKETID
A 1 20131206 20131207 1234
A 1 20131207 20131207 1234
A 1 20131206 20131207 1234
B 1 20131206 20131207 1234
B 1 20131206 20131207 1234
C 2 20131206 20131207 1234
C 2 20131207 20131207 5678
output---
COL1 COL2 CREATEDATE DAYID MARKETID
A 1 20131206 20131207 1234
B 1 20131206 20131207 1234
C 2 20131206 20131207 1234
C 2 20131207 20131207 5678
Thanks Nates
You can do the following :
select col1,col2,dayid,marketid,max(createdate) as createdate
from tablename
group by col1,col2,dayid,marketid
This way you are grouping the data by all the columns except the data so if there are rows with the same values in these columns they will be in the same group, and then, just "choose" the createdate you want by using an aggregate function like max/min etc.
这篇关于根据列删除配置单元中的DUPLICATE行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!