如何将一个文件比较一群在linux文件 [英] How to compare one file with bunch of files in linux
问题描述
我有一个的fileA如下图所示:
文件中的CHR1 123 AA B C D
CHR1 234 A B C D
CHR1 345 AA B C D
CHR1 456 A B C D
....
和我有一个类似的DIRB柱,我有比较文件中的一堆类似的文件。
要做到这一点我用猫到名为FILEB,然后比较单一文件既如下图所示基于关键列1和2中的文件连接起来DIRB中的所有文件:
的awk'FNR == {NR一个[$ 1,$ 2] ++;}旁边!一个[$ 1,$ 2]FILEB的fileA
此命令使用列1和2作为密钥,并给出其具有仅在的fileA键的行。
然而,这里的问题是,FILEB是巨大的空间和内存方面来处理的时候有大量的文件运行。
可能有人提出一个替代方案中,使它跳过连接所有文件来创建FILEB的步骤。相反,的fileA可以直接与DIRB所有文件相比
CHR1 123 AA B C D XXXX ABCD
CHR1 234 A B C D
CHR1 345 AA B C D YYYY DEFG
CHR1 456 A B C D
沿着这些路线或许一句:
的awk'NR == FNR {a [$ 1,$ 2] = $ 0;下一个 }
{删除[$ 1,$ 2]}
END {为(我的)打印[I]}
A.TXT b1.txt b2.txt ...
与文件中的开始,每个键添加到其行的值内容的数组。那么对于所有的B文件,删除从阵列匹配的密钥的任何元素。在最后剩下的任何元素都是那些有那些不以任何对B的文件,这样我们就可以通过刚才并打印出来。循环
I have a fileA as shown below:
file A
chr1 123 aa b c d
chr1 234 a b c d
chr1 345 aa b c d
chr1 456 a b c d
....
And I have a bunch of similar files with similar columns in a dirB with which i have to compare file A.
To do this I concatenated all the files in dirB using cat into a single file called fileB and then compared both the files based on key columns 1 and 2 as shown below:
awk 'FNR==NR{a[$1,$2]++;next}!a[$1,$2]' fileB fileA
This command uses the columns 1 and 2 as keys and gives the rows which have key only in fileA.
However, the issue here is, fileB is to huge to handle in terms of space and memory to run when there are large number of files.
Could someone suggest an alternative, so that it skips the step of concatenating all files to create fileB. Instead, fileA could be directly compared with all the files in dirB
chr1 123 aa b c d xxxx abcd
chr1 234 a b c d
chr1 345 aa b c d yyyy defg
chr1 456 a b c d
Perhaps something along these lines:
awk 'NR == FNR { a[$1,$2] = $0; next }
{ delete a[$1, $2] }
END { for (i in a) print a[i] }
' a.txt b1.txt b2.txt ...
Starting with file A, add each key to an array with the contents of its row for the value. Then for all the B files, delete any elements from the array with matching keys. At the end any elements remaining are those in A that weren't in any of the B files so we can just loop through and print them out.
这篇关于如何将一个文件比较一群在linux文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!