在 >2 个文件中查找共同元素 [英] find common elements in >2 files

查看:26
本文介绍了在 >2 个文件中查找共同元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

I have three files as shown below

file1.txt

"aba" 0 0 
"aba" 0 0 1
"abc" 0 1
"abd" 1 1 
"xxx" 0 0

file2.txt

"xyz" 0 0
"aba" 0 0 0 0
"aba" 0 0 0 1
"xxx" 0 0
"abc" 1 1

file3.txt

"xyx" 0 0
"aba" 0 0 
"aba" 0 1 0
"xxx" 0 0 0 1
"abc" 1 1

I want to find the similar elements in all the three files based on first two columns. To find similar elements in two files i have used something like

awk 'FNR==NR{a[$1,$2]++;next}a[$1,$2]' file1.txt file2.txt 

But, how can we find similar elements in all the files, when the input files are more than 2? Can anyone help?

With the current awk solution, the output ignores the duplicate key columns and gives the output as

"xxx" 0 0

If we assume the output comes from file1.txt, the expected output is:

"aba" 0 0 
"aba" 0 0 1
"xxx" 0 0 

i.e it should get the rows with duplicate key columns as well.

解决方案

Try following solution generalized for N files. It saves data of first file in a hash with value of 1, and for each hit from next files that value is incremented. At the end I compare if the value of each key it's the same as the number of files processed and print only those that match.

awk '
    FNR == NR { arr[$1,$2] = 1; next }
    { if ( arr[$1,$2] ) { arr[$1,$2]++ } }
    END { 
        for ( key in arr ) {
            if ( arr[key] != ARGC - 1 ) { continue }
            split( key, key_arr, SUBSEP )
            printf "%s %s
", key_arr[1], key_arr[2] 
        } 
    }
' file{1..3}

It yields:

"xxx" 0
"aba" 0


EDIT to add a version that prints the whole line (see comments). I've added another array with same key where I save the line, and also use it in the printf function. I've left old code commented.

awk '
    ##FNR == NR { arr[$1,$2] = 1; next }
    FNR == NR { arr[$1,$2] = 1; line[$1,$2] = $0; next }
    { if ( arr[$1,$2] ) { arr[$1,$2]++ } }
    END { 
        for ( key in arr ) {
            if ( arr[key] != ARGC - 1 ) { continue }
            ##split( key, key_arr, SUBSEP )
            ##printf "%s %s
", key_arr[1], key_arr[2] 
            printf "%s
", line[ key ] 
        } 
    }
' file{1..3}


NEW EDIT (see comments) to add a version that handles multiple lines with same key. Basically I join all entries instead saving only one, changing line[$1,$2] = $0 with line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0. At the time of printing I do the reverse splitting with the separator (SUBSEP variable) and print each entry.

awk '
    FNR == NR { 
        arr[$1,$2] = 1
        line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
        next
    }
    FNR == 1 { delete found }
    { if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
    END { 
        num_files = ARGC -1 
        for ( key in arr ) {
            if ( arr[key] < num_files ) { continue }
            split( line[ key ], line_arr, SUBSEP )
            for ( i = 1; i <= length( line_arr ); i++ ) { 
                printf "%s
", line_arr[ i ]
            } 
        } 
    }
' file{1..3}

With new data edited in question, it yields:

"xxx" 0 0
"aba" 0 0 
"aba" 0 0 1

这篇关于在 &gt;2 个文件中查找共同元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆