如何基于与AWK两个文件之间的公共字段在一个文件中删除重复的行? [英] How to delete duplicated rows on one file based on a common field between two files with AWK?

查看:484
本文介绍了如何基于与AWK两个文件之间的公共字段在一个文件中删除重复的行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件


  1. 文件1包含3个字段


  2. 文件2包含4个领域


文件1的行的数目大于文件2小得多

我想基于与以下操作

第1场的两个文件之间的比较

<青霉>如果在文件1出现在文件2的行的第一场的任何行的第一场,不打印该行用于文件2 的

任何意见,将不胜感激。

 输入文件1 S13109 3739 31082
 S45002 3800 31873
 S43722 3313 26638输入文件2 S13109 3738 31081 0
 S13109 3737 31080 0
 S00033 3008 29985 0
 S00033 3007 29984 0
 S00022 4130 31838 0
 S00022 4129 31837 0
 S00188 3317 27372 0
 S45002 3759 31832 0
 S45002 3758 31831 0
 S45002 3757 31830 0
 S43722 3020 26345 0
 S43722 3019 26344 0
 S00371 3737 33636 0
 S00371 3736 33635 0所需的输出 S00033 3008 29985 0
 S00033 3007 29984 0
 S00022 4130 31838 0
 S00022 4129 31837 0
 S00188 3317 27372 0
 S00371 3737 33636 0
 S00371 3736 33635 0


解决方案

的awk'FNR == NR!{A [$ 1] ++;旁}一[$ 1]'文件1文件2

它是如何工作:

  FNR == NR

当你有两个(或更多)的输入文件的awk, NR 将重置回1上的下一个文件,而第一线FNR 将继续从中断处递增。通过检查 FNR == NR 我们基本上是检查是否我们目前正在分析的第一个文件。

  A [$ 1] ++

如果我们的的解析的第一个文件(见上文)然后创建的第一个字段关联数组 $ 1 为重点和后增量通过1.本的价值基本上是让我们创建一个看到列表中。

 下一

此命令告诉awk中没有处理任何进一步的命令,并在接下来的记录读取和重新开始。我们这样做是因为文件1只是为了设置关联数组

 !一个[$ 1]

此行​​仅在 FNR == NR 是假的,也就是我们的的解析文件1,因此必须进行解析文件2执行。然后,我们使用文件2的第一个域 $ 1 为重点,以索引之前创建我们看到名单。如果返回的值是0则意味着我们没有看到它在文件1,因此,我们应打印此行。相反,如果该值不为零,那么我们的没有的看到它在文件1,因此,我们要的的打印其值。需要注意的是!一个[$ 1] 等同于!一个[$ 1] {打印} ,因为默认的动作,当一个不给是打印整行。

I have two files

  1. File 1 contains 3 fields

  2. File 2 contains 4 fields

The number of rows of File 1 is much smaller than that of File 2

I would like to compare between two files based on 1st field with the following operation

If the first field in any row of file 1 appears in the first field of a row in file 2, don't print that row for file 2.

Any advice would be grateful.

Input File 1

 S13109 3739 31082 
 S45002 3800 31873 
 S43722 3313 26638 

Input File 2

 S13109 3738 31081 0 
 S13109 3737 31080 0 
 S00033 3008 29985 0 
 S00033 3007 29984 0 
 S00022 4130 31838 0 
 S00022 4129 31837 0 
 S00188 3317 27372 0 
 S45002 3759 31832 0 
 S45002 3758 31831 0 
 S45002 3757 31830 0 
 S43722 3020 26345 0 
 S43722 3019 26344 0 
 S00371 3737 33636 0 
 S00371 3736 33635 0 

Desired Output

 S00033 3008 29985 0 
 S00033 3007 29984 0
 S00022 4130 31838 0 
 S00022 4129 31837 0 
 S00188 3317 27372 0
 S00371 3737 33636 0 
 S00371 3736 33635 0 

解决方案

awk 'FNR==NR{a[$1]++;next}!a[$1]' file1 file2

How it works:

FNR==NR

When you have two (or more) input files to awk, NR will reset back to 1 on the first line of the next file whereas FNR will continuing incrementing from where it left off. By checking FNR==NR we are essentially checking to see if we are currently parsing the first file.

a[$1]++

If we are parsing the first file (see above) then create an associative array with the first field $1 as the key and post increment the value by 1. This essentially lets us create a 'seen' list.

next

This command tells awk not to process any further commands and to read in the next record and start over. We do this because file1 is only meant to set the associative array

!a[$1]

This line only executes when FNR==NR is false, i.e. we are not parsing file1 and thus must be parsing file2. We then use the first field $1 of file2 as the key to index into our 'seen' list created earlier. If the value returned is 0 it means we didn't see it in file1 and therefore we should print this line. Conversely, if the value is non-zero then we did see it in file1 and thus we should not print its value. Note that !a[$1] is equivalent to !a[$1]{print} because the default action when one is not given is to print the entire line.

这篇关于如何基于与AWK两个文件之间的公共字段在一个文件中删除重复的行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆