如何找到在用awk多个文件共同行 [英] How to find common rows in multiple files using awk

查看:108
本文介绍了如何找到在用awk多个文件共同行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有制表符分隔的文本文件,它们之间的共同行是根据1列和2键列被发现。
示例文件:

I have tab delimited text files in which common rows between them are to be found based on columns 1 and 2 as key columns. Sample files:

file1.txt 

aba 0 0 
aba 0 0 1
abc 0 1
abd 1 1 
xxx 0 0

file2.txt

xyz 0 0
aba 0 0 0 0
aba 0 0 0 1
xxx 0 0
abc 1 1

file3.txt

xyx 0 0
aba 0 0 
aba 0 1 0
xxx 0 0 0 1
abc 1 1

下面code不相同,并返回行只有在键列在所有的N个文件中(3个文件在这种情况下)。

The below code does the same and returns the rows only if the key column is found in all the N files (3 files in this case).

awk '
FNR == NR { 
    arr[$1,$2] = 1
    line[$1,$2] = line[$1,$2] ( line[$1,$2] ? SUBSEP : "" ) $0
    next
}
FNR == 1 { delete found }
{ if ( arr[$1,$2] && ! found[$1,$2] ) { arr[$1,$2]++; found[$1,$2] = 1 } }
END { 
    num_files = ARGC -1 
    for ( key in arr ) {
        if ( arr[key] < num_files ) { continue }
        split( line[ key ], line_arr, SUBSEP )
        for ( i = 1; i <= length( line_arr ); i++ ) { 
            printf "%s\n", line_arr[ i ]
        } 
    } 
}
 ' *.txt  > commoninall.txt

输出:

 xxx 0 0
 aba 0 0 
 aba 0 0 1

不过,现在我想获得输出,如果X文件具有键列。
例如X = 2即,行是共同在两个文件中基于键列1和2在此情况下的输出将是:

However, now I would like to get the output if 'x' files have the key columns. For example x=2 i.e. rows which are common in two files based on key columns 1 and 2. The output in this case would be:

xyz 0 0
abc 1 1

在真实的情景我必须对于x指定不同的值。任何人都可以提出修改建议,以这样或一个新的解决方案。

In real scenario I do have to specify different values for x. Can anybody suggest an edit to this or a new solution.

推荐答案

我觉得你只需要修改 END 块一点点,和命令调用:

First attempt

I think you just need to modify the END block a little, and the command invocation:

awk -v num_files=${x:-0} '
…
…script as before…
…
END { 
    if (num_files == 0) num_files = ARGC - 1
    for (key in arr) {
        if (arr[key] == num_files) {
            split(line[key], line_arr, SUBSEP)
            for (i = 1; i <= length(line_arr); i++) {
                printf "%s\n", line_arr[i]
            }
        }
    }
}
'

基本上,这需要X 根据 $命令行参数,默认为0,并将其分配给 AWK 变量 NUM_FILES 。在 END 块,为 NUM_FILES 的code盘是零,并将其重置为文件数通过在命令行上。 (有趣的是, ARGC 价值打折任何 -v VAR =值选项和一个命令行脚本或 -f script.awk ,所以 ARGC-1 长期仍然是正确的。数组 ARGV 包含 AWK (或任何名称你调用它)在 ARGV [0] 和文件在来处理ARGV [1] ARGV [ARGC-1] )的循环,然后为所需的检查像以前比赛和打印数量。您可以 == 更改为方式&gt;如果你想'以上'选项=

Basically, this takes a command line parameter based on $x, defaulting to 0, and assigning it to the awk variable num_files. In the END block, the code checks for num_files being zero, and resets it to the number of files passed on the command line. (Interestingly, the value in ARGC discounts any -v var=value options and either a command line script or -f script.awk, so the ARGC-1 term remains correct. The array ARGV contains awk (or whatever name you invoked it with) in ARGV[0] and the files to be processed in ARGV[1] through ARGV[ARGC-1].) The loop then checks for the required number of matches and prints as before. You can change == to >= if you want the 'or more' option.

我评论指出:

我不清楚你在问什么。我认为你的code的工作有三个文件的例子并产生正确的答案。我只是建议如何修改工作code来处理N个文件,其中至少M个共用一个入口。我刚刚认识,同时打字这一点,是有一点更多的工作要做。一个条目可以从其他的第一个文件,但present丢失,将需要进行处理,因此。很容易地报告所有出现在每一个文件,或在任何文件中的第一次出现。这是很难只用一个关键的第一个文件报告所有事件。

I'm not clear what you are asking. I took it that your code was working for the example with three files and producing the right answer. I simply suggested how to modify the working code to handle N files and at least M of them sharing an entry. I have just realized, while typing this, that there is a bit more work to do. An entry could be missing from the first file but present in the others and will need to be processed, therefore. It is easy to report all occurrences in every file, or the first occurrence in any file. It is harder to report all occurrences only in the first file with a key.

的反应是:

这是完全没有报告任何文件中第一次出现,不需要只从第一个文件是。然而,所建议的修改的问题,它正在为 X

It is perfectly fine to report first occurrence in any file and need not be only from the first file. However, the issue with the suggested modification is, it is producing the same output for different values of x.

这是奇怪的:我能够从修改code与其中关键字必须出现的文件的数量不同的价值观得到理智的输出。我用这个shell脚本。在code在 AWK 程序到 END 块是相同的问题;唯一的变化是在END处理模块。

That's curious: I was able to get sane output from the amended code with different values for the number of files where the key must appear. I used this shell script. The code in the awk program up to the END block is the same as in the question; the only change is in the END processing block.

#!/bin/bash

while getopts n: opt
do
    case "$opt" in
    (n) num_files=$OPTARG;;
    (*) echo "Usage: $(basename "$0" .sh) [-n number] file [...]" >&2
        exit 1;;
    esac
done

shift $(($OPTIND - 1))

awk -v num_files=${num_files:-$#} '
FNR == NR { 
    arr[$1,$2] = 1
    line[$1,$2] = line[$1,$2] (line[$1,$2] ? SUBSEP : "") $0
    next
}
FNR == 1 { delete found }
{ if (arr[$1,$2] && ! found[$1,$2]) { arr[$1,$2]++; found[$1,$2] = 1 } }
END { 
    if (num_files == 0) num_files = ARGC - 1
    for (key in arr) {
        if (arr[key] == num_files) {
            split(line[key], line_arr, SUBSEP)
            for (i = 1; i <= length(line_arr); i++) {
                printf "%s\n", line_arr[i]
            }
        }
    }
}
' "$@"

样品运行(从问题的数据文件):

Sample runs (data files from question):

$ bash common.sh file?.txt
xxx 0 0
aba 0 0 
aba 0 0 1
$ bash common.sh -n 3 file?.txt
xxx 0 0
aba 0 0 
aba 0 0 1
$ bash common.sh -n 2 file?.txt
$ bash common.sh -n 1 file?.txt
abc 0 1
abd 1 1 
$

这表明取决于通过 -n 指定的值不同的答案。注意,这仅显示出现在第一文件中的行和出现在总正好N个文件。出现在两个文件中的唯一键( ABC / 1 )不会出现在第一个文件,所以它是不受此code中的第一个文件被处理之后停止关注新的密钥上市。

That shows different answers depending on the value specified via -n. Note that this only shows lines that appear in the first file and appear in exactly N files in total. The only key that appears in two files (abc/1) does not appear in the first file, so it is not listed by this code which stops paying attention to new keys after the first file is processed.

不过,这里有一个重写,使用一些相同的想法,但更彻底的工作。

However, here's a rewrite, using some of the same ideas, but working more thoroughly.

#!/bin/bash
# SO 30428099

# Given that the key for a line is the first two columns, this script
# lists all appearances in all files of a given key if that key appears
# in N different files (where N defaults to the number of files). For
# the benefit of debugging, it includes the file name and line number
# with each line.

usage()
{
    echo "Usage: $(basename "$0" .sh) [-n number] file [...]" >&2
    exit 1
}

while getopts n: opt
do
    case "$opt" in
    (n) num_files=$OPTARG;;
    (*) usage;;
    esac
done

shift $(($OPTIND - 1))

if [ "$#" = 0 ]
then usage
fi

# Record count of each key, regardless of file: keys
# Record count of each key in each file: key_file
# Count of different files containing each key: files
# Accumulate line number, filename, line for each key: lines

awk -v num_files=${num_files:-$#} '
{ 
    keys[$1,$2]++;
    if (++key_file[$1,$2,FILENAME] == 1)
        files[$1,$2]++
    #printf "%s:%d: Key (%s,%s); keys = %d; key_file = %d; files = %d\n",
    #        FILENAME, FNR, $1, $2, keys[$1,$2], key_file[$1,$2,FILENAME], files[$1,$2];
    sep = lines[$1,$2] ? RS : ""
    #printf "B: [[\n%s\n]]\n", lines[$1,$2]
    lines[$1,$2] = lines[$1,$2] sep FILENAME OFS FNR OFS $0
    #printf "A: [[\n%s\n]]\n", lines[$1,$2]
}
END {
    #print "END"
    for (key in files)
    {
        #print "Key =", key, "; files =", files[key]
        if (files[key] == num_files)
        {
            #printf "TAG\n%s\nEND\n", lines[key]
            print lines[key]
        }
    }
}
' "$@"

样本输出(从给出的相关数据文件):

Sample output (given the data files from the question):

$ bash common.sh file?.txt
file1.txt 5 xxx 0 0
file2.txt 4 xxx 0 0
file3.txt 4 xxx 0 0 0 1
file1.txt 1 aba 0 0 
file1.txt 2 aba 0 0 1
file2.txt 2 aba 0 0 0 0
file2.txt 3 aba 0 0 0 1
file3.txt 2 aba 0 0 
file3.txt 3 aba 0 1 0
$ bash common.sh -n 2 file?.txt
file2.txt 5 abc 1 1
file3.txt 5 abc 1 1
$ bash common.sh -n 1 file?.txt
file1.txt 3 abc 0 1
file3.txt 1 xyx 0 0
file1.txt 4 abd 1 1 
file2.txt 1 xyz 0 0
$ bash common.sh -n 3 file?.txt
file1.txt 5 xxx 0 0
file2.txt 4 xxx 0 0
file3.txt 4 xxx 0 0 0 1
file1.txt 1 aba 0 0 
file1.txt 2 aba 0 0 1
file2.txt 2 aba 0 0 0 0
file2.txt 3 aba 0 0 0 1
file3.txt 2 aba 0 0 
file3.txt 3 aba 0 1 0
$ bash common.sh -n 4 file?.txt
$

您可以体健这个给你想要的(可能丢失的文件名和行号)的输出。如果你只是想从包含给定键的第一个文件中的行,你只将信息添加到文件[$ 1,$ 2] == 1 。您可以使用记录的信息分开 SUBSEP 而不是 RS OFS 如果您preFER。

You can fettle this to give the output you want (probably missing file name and line number). If you only want the lines from the first file containing a given key, you only add the information to lines when files[$1,$2] == 1. You can separate the recorded information with SUBSEP instead of RS and OFS if you prefer.

这篇关于如何找到在用awk多个文件共同行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆