如何将一个文件比较一群在linux文件 [英] How to compare one file with bunch of files in linux

查看:98
本文介绍了如何将一个文件比较一群在linux文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个的fileA如下图所示:

 文件中的CHR1 123 AA B C D
CHR1 234 A B C D
CHR1 345 AA B C D
CHR1 456 A B C D
....

和我有一个类似的DIRB柱,我有比较文件中的一堆类似的文件。

要做到这一点我用猫到名为FILEB,然后比较单一文件既如下图所示基于关键列1和2中的文件连接起来DIRB中的所有文件:

 的awk'FNR == {NR一个[$ 1,$ 2] ++;}旁边!一个[$ 1,$ 2]FILEB的fileA

此命令使用列1和2作为密钥,并给出其具有仅在的fileA键的行。

然而,这里的问题是,FILEB是巨大的空间和内存方面来处理的时候有大量的文件运行。

可能有人提出一个替代方案中,使它跳过连接所有文件来创建FILEB的步骤。相反,的fileA可以直接与DIRB所有文件相比

  CHR1 123 AA B C D XXXX ABCD
CHR1 234 A B C D
CHR1 345 AA B C D YYYY DEFG
CHR1 456 A B C D


解决方案

沿着这些路线或许一句:

 的awk'NR == FNR {a [$ 1,$ 2] = $ 0;下一个 }
                {删除[$ 1,$ 2]}
            END {为(我的)打印[I]}
 A.TXT b1.txt b2.txt ...

与文件中的开始,每个键添加到其行的值内容的数组。那么对于所有的B文件,删除从阵列匹配的密钥的任何元素。在最后剩下的任何元素都是那些有那些不以任何对B的文件,这样我们就可以通过刚才并打印出来。循环

I have a fileA as shown below:

file A

chr1   123 aa b c d
chr1   234 a  b c d
chr1   345 aa b c d
chr1   456 a  b c d
....

And I have a bunch of similar files with similar columns in a dirB with which i have to compare file A.

To do this I concatenated all the files in dirB using cat into a single file called fileB and then compared both the files based on key columns 1 and 2 as shown below:

awk 'FNR==NR{a[$1,$2]++;next}!a[$1,$2]' fileB fileA

This command uses the columns 1 and 2 as keys and gives the rows which have key only in fileA.

However, the issue here is, fileB is to huge to handle in terms of space and memory to run when there are large number of files.

Could someone suggest an alternative, so that it skips the step of concatenating all files to create fileB. Instead, fileA could be directly compared with all the files in dirB

chr1   123    aa    b    c    d    xxxx    abcd
chr1   234    a     b    c    d
chr1   345    aa    b    c    d    yyyy    defg
chr1   456    a    b    c    d

解决方案

Perhaps something along these lines:

 awk 'NR == FNR { a[$1,$2] = $0; next } 
                { delete a[$1, $2] }
            END { for (i in a) print a[i] }
 ' a.txt b1.txt b2.txt ...

Starting with file A, add each key to an array with the contents of its row for the value. Then for all the B files, delete any elements from the array with matching keys. At the end any elements remaining are those in A that weren't in any of the B files so we can just loop through and print them out.

这篇关于如何将一个文件比较一群在linux文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆