找到与GT共同要素; 2档 [英] find common elements in >2 files

查看:89
本文介绍了找到与GT共同要素; 2档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有三个文件,如下图所示。

I have three files as shown below

FILE1.TXT

file1.txt

"aba" 0 0 
"aba" 0 0 1
"abc" 0 1
"abd" 1 1 
"xxx" 0 0

FILE2.TXT

file2.txt

"xyz" 0 0
"aba" 0 0 0 0
"aba" 0 0 0 1
"xxx" 0 0
"abc" 1 1

file3.txt

file3.txt

"xyx" 0 0
"aba" 0 0 
"aba" 0 1 0
"xxx" 0 0 0 1
"abc" 1 1

我想找到基于前两列所有三个文件相似的元素。要查找两个文件相似的元素我已经使用类似

I want to find the similar elements in all the three files based on first two columns. To find similar elements in two files i have used something like

awk 'FNR==NR{a[$1,$2]++;next}a[$1,$2]' file1.txt file2.txt 

但是,我们如何才能找到中的所有文件相似的元素,当输入文件大于2?
谁能帮助?

But, how can we find similar elements in all the files, when the input files are more than 2? Can anyone help?

在当前AWK解决方案,输出忽略重复键列,并给出了输出

With the current awk solution, the output ignores the duplicate key columns and gives the output as

"xxx" 0 0

如果我们假设输出来自FILE1.TXT,所期望的输出是:

If we assume the output comes from file1.txt, the expected output is:

"aba" 0 0 
"aba" 0 0 1
"xxx" 0 0 

即它应该具有重复键列的行也是如此。

i.e it should get the rows with duplicate key columns as well.

推荐答案

尝试以下解决方案概括为 N 文件。这样可以节省第一个文件的数据与 1 价值的哈希,并从明年文件的每个命中该值递增。最后我比较,如果每个键的值是一样处理的文件的数量,只有那些匹配的打印。

Try following solution generalized for N files. It saves data of first file in a hash with value of 1, and for each hit from next files that value is incremented. At the end I compare if the value of each key it's the same as the number of files processed and print only those that match.

awk '
    FNR == NR { arr[$1,$2] = 1; next }
    { if ( arr[$1,$2] ) { arr[$1,$2]++ } }
    END { 
        for ( key in arr ) {
            if ( arr[key] != ARGC - 1 ) { continue }
            split( key, key_arr, SUBSEP )
            printf "%s %s\n", key_arr[1], key_arr[2] 
        } 
    }
' file{1..3}

它产生的:

"xxx" 0
"aba" 0


修改以添加打印整条生产线(见注释)的一个版本。我加入,我保存线相同的密钥另一个数组,并且还用它在的printf 功能。我已经离开老code评论说。


EDIT to add a version that prints the whole line (see comments). I've added another array with same key where I save the line, and also use it in the printf function. I've left old code commented.

awk '
    ##FNR == NR { arr[$1,$2] = 1; next }
    FNR == NR { arr[$1,$2] = 1; line[$1,$2] = $0; next }
    { if ( arr[$1,$2] ) { arr[$1,$2]++ } }
    END { 
        for ( key in arr ) {
            if ( arr[key] != ARGC - 1 ) { continue }
            ##split( key, key_arr, SUBSEP )
            ##printf "%s %s\n", key_arr[1], key_arr[2] 
            printf "%s\n", line[ key ] 
        } 
    }
' file{1..3}


新修改(见注释)来添加处理多行用相同的密钥版本。基本上我加入,而不是储蓄只有一个,改变行[$ 1,$ 2] = $ 1,0 行[$ 1,$ 2] =行[$ 1的所有条目, $ 2](行[$ 1,$ 2] SUBSEP:)$ 1,0 。在印刷的时候,我做的分隔符( SUBSEP 变量)相反的分裂和打印每个条目。


NEW EDIT (see comments) to add a version that handles multiple lines with same key. Basically I join all entries instead saving only one, changing line[$1,$2] = $0 with line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0. At the time of printing I do the reverse splitting with the separator (SUBSEP variable) and print each entry.

awk '
    FNR == NR { 
        arr[$1,$2] = 1
        line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
        next
    }
    FNR == 1 { delete found }
    { if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
    END { 
        num_files = ARGC -1 
        for ( key in arr ) {
            if ( arr[key] < num_files ) { continue }
            split( line[ key ], line_arr, SUBSEP )
            for ( i = 1; i <= length( line_arr ); i++ ) { 
                printf "%s\n", line_arr[ i ]
            } 
        } 
    }
' file{1..3}

使用问题编辑新的数据,它产生:

With new data edited in question, it yields:

"xxx" 0 0
"aba" 0 0 
"aba" 0 0 1

这篇关于找到与GT共同要素; 2档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆