grep的只是局部的重复之一 [英] Grep only one of partial duplicates

查看：71 发布时间：2016/7/28 16:45:17 awk grep duplicate-removal

本文介绍了grep的只是局部的重复之一的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我收集了以下文件：


20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343380;7591;somevalue9
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
20130304;379612;7501;somevalue2
20130308;343380;7591;NA

这是一个; 分隔的文件，4列。然而塔2和3的组合必须是独一无二的。由于该数据集有上百万行的我正在寻找一种有效的方式让每一个重复的第一次出现。因此，我需要部分匹配列2和3的组合，然后选择第一个。

This is a ; seperated file with 4 columns. The combination of column 2 and 3 however must be unique. Since this dataset has millions of rows I'm looking for an efficient way to get the first occurence of every duplicate. I therefore need to partial match the combination of column 2 and 3 and then select the first one.

预期的结果应该是：


20130304;114137911;8051;somevalue1
20130304;343268;7591;NA
20130304;379612;7501;somevalue2
20130304;343380;7591;somevalue8
20130304;343380;7591;somevalue9 #REMOVED
20130304;343212;7591;NA
20130304;183278;7851;somevalue3
20130304;114141486;8051;somevalue5
20130304;114143219;8051;somevalue6
20130304;343247;7591;NA
20130304;379612;7501;somevalue2 #REMOVED
20130308;343380;7591;NA #$REMOVED

我已经做了一些尝试自己。第一个是：

I have made a few attempts myself. The first one is:

grep -oE "\;(.*);" orders_20130304to20140219_v3.txt | uniq

然而，这仅选择列2和3，并删除所有其他数据。此外，它没有考虑到后来发生匹配。我可以修复通过添加排序，但我preFER不进行排序。

However this selects only column 2 and 3 and removes all other data. Furthermore it does not take into account a match that occurs later. I can fix that by adding sort, but I prefer not to sort.

另外一个尝试是：

awk '!x[$0]++' test.txt

这不需要任何排序，但整条生产线相匹配。

This does not require any sorting, but matches the complete line.

我觉得第二次尝试接近，但需要以改为只能看第二和第三列，而不是整行。有谁知道如何把这个？

I think the second attempt is close, but that needs to be changed in order to only look at the second and third column instead of the whole line. Does anyone know how to incorporate this?

grep的只是局部的重复之一 [英] Grep only one of partial duplicates

问题描述

推荐答案

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录关闭

grep的只是局部的重复之一 [英] Grep only one of partial duplicates

问题描述

推荐答案

相关文章

Linux/Unix最新文章

热门教程

热门工具

登录 关闭

登录关闭