使用awk删除基于两个字段的文件中的冗余 [英] remove redundancy in a file based on two fields, using awk

查看:85
本文介绍了使用awk删除基于两个字段的文件中的冗余的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图根据前两列的值删除一个非常大的文件(约100,000条记录)中的重复行,而不考虑它们的顺序,然后打印这些字段+其他列.

I'm trying to remove duplicate lines in a very large file (~100,000 records) according to the values of the first two columns without taking into account their order, and then print those fields + the other columns.

因此,从此输入:

A B XX XX
A C XX XX
B A XX XX
B D XX XX
B E XX XX
C A XX XX

我想拥有:

A B XX XX
A C XX XX
B D XX XX
B E XX XX

(也就是说,我想删除"BA"和"CA",因为它们已经以相反的顺序出现;我不在乎下一列中的内容,但我也想打印它)

(That is, I want to remove 'B A' and 'C A' because they already appear in the opposite order; I don't care about what's in the next columns but I want to print it too)

我觉得使用awk +数组应该很容易做到这一点,但是我无法提供解决方案.

I've the impression that this should be easy to do with awk + arrays, but I can't come with a solution.

到目前为止,我正在对此进行修改:

So far, I'm tinkering with this:

awk '
NR == FNR {
h[$1] = $2   
next
}
$1 in h {
print h[$1],$2}' input.txt

我将第二列存储在由第一列(h)索引的数组中,然后检查存储的数组中是否存在第一字段.然后,打印该行.但是出了点问题,我没有输出.

I'm storing the second column in an array indexed by the first (h), and then check if there are occurrences of the first field in the stored array. Then, print the line. But something's wrong and I have no output.

对不起,因为我的代码根本没有帮助,但是我对此感到困惑.

I'm sorry because my code is not helpful at all but I'm kind of stuck with this.

你有什么主意吗?

非常感谢!

推荐答案

只需跟踪出现在两种格式上的内容:

Just keep track of the things that appear on the two formats:

$ awk '!seen[$1,$2]++ && !seen[$2,$1]++' file
A B XX XX
A C XX XX
B D XX XX
B E XX XX

相当于awk '!(seen[$1,$2]++ || seen[$2,$1]++)' file.

请注意,这也等同于没有++第二个表达式(请参见注释):

Note it is also equivalent to not having ++ the second expression (see comments):

awk '!seen[$1,$2]++ && !seen[$2,$1]' file

说明

打印唯一行的典型方法是:

Explanation

The typical approach to print unique lines is:

awk '!seen[$0]++' file

这将创建一个数组seen[],其索引是到目前为止已出现的行.因此,如果它是新的,则seen[$0]为0并递增为1.但是以前打印它是因为表达式! var ++首先计算! var(在awk中,True触发打印当前行的动作) .当已经看到该行时,seen[$0]具有一个正值,因此!seen[$0]为false并且不会触发打印动作.

This creates an array seen[] whose indexes are the lines that have appeared so far. So if it is new, seen[$0] is 0 and gets incremented to 1. But previously it is printed because the expression ! var ++ evaluates ! var first (and in awk, True triggers the action of printing the current line). When the line has been seen already, seen[$0] has a positive value, so !seen[$0] is false and doesn't trigger the printing action.

对于您而言,无论顺序如何,您都希望跟踪出现的内容,所以我要做的是将索引存储在两个可能的位置.

In your case you want to keep track of what appeared, no matter the order, so what I am doing is to store the indexes in both possible positions.

这篇关于使用awk删除基于两个字段的文件中的冗余的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆