如何比较和合并多个文件? [英] How to compare and merge multiple files?

查看:136
本文介绍了如何比较和合并多个文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

参考文件

chr1    288598  288656

chr1    779518  779576

chr2    2569592 2569660

chr3    5018399 5018464

chr4    5182842 5182882

文件1

chr1    288598  288656 12

chr1    779518  779576 14

chr2    2569592 2569660 26

chr3    5018399 5018464 27

chr4    5182842 5182882 37

文件2

chr1    288598  288656 35

chr2    2569592 2569660 348

chr3    5018399 5018464 4326

chr4    5182842 5182882 68

除了参考文件外,我还有六个类似的文件.

I have six similar files excluding the reference file.

这里的前三个字段与参考文件相似.因此,我只想从所有6个文件中导出第4列,并放入参考文件中以进行新的输出.它应该等同于参考文件.如果不匹配,则置零.

Here first three fields are similar to the reference file. Therefore, I would like export only 4th column from all 6 files and put into the reference file to make a new output. which should be equivalent to the reference files. Where they don't match put zero.

所需的输出

chr1    288598  288656 23 35 57 68 769 68

chr1    779518  779576 23 0 57 68 768 0

chr2    2569592 2569660 23 35 0 68 79 0

chr3    5018399 5018464 0 36 0 68 769 0

chr4    5182842 5182882 23 0 0 0 0 0

注意:参考文件的长度约为2000 ans,其他文件的长度却不一定相同(约500、400、200、100等).这就是为什么需要添加零的原因.

Note: the reference file length is about 2000 ans the other files are not always in same the length (about 500, 400, 200, 100 etc). That is why need zero added.

我尝试了这个问题的答案

paste ref.file file1 file2 file3 file4 file5 file6 |  awk '{OFS="\t";print $1,$2,$3,$7,$11,$15,$19,$23,$27}' > final.common.out

,但似乎不起作用-遗漏了一些值.而且我不明白如何在没有匹配项的地方添加零.

but seems it's not working — some values are missed. And I can't understand how to add zero where there is no match.

推荐答案

我认为这样的事情应该可以满足您的要求.我们使用散列来收集参考"文件,并将其转换为具有空数组的一组键.

I think something like this should do what you want. We use a hash to gather the 'reference' file and turn it into a set of keys with an empty array.

然后,我们迭代其他文件,提取"3个值"作为键,最后一个值作为实际值.

Then we iterate on the other files, extracting '3 values' as key, and the last value as an actual value.

然后我们将两者进行比较,使用值或零更新引用"哈希.请注意,参考文件中的任何 not 行(或重复行)都将消失.

And then we compare the two, updating the 'reference' hash with either the value or zero. The caveat here - any lines not in your reference file (or duplicates) will just disappear.

#!/usr/bin/perl

use strict;
use warnings;
use autodie;


#read 'reference file' into a hash:
my %ref;
open( my $ref_fh, "<", "reference_file" );
while (<$ref_fh>) {
    my ( $first, $second, $third ) = split;

    #turn the first three fields into space delimited key.
    $ref{"$first $second $third"} = ();
}

#open each of the files.
my @files = qw ( file1 file2 file3 file4 file5 file6 );
foreach my $input (@files) {
    open( my $input_fh, "<", $input );
    my %current;
    while (<$input_fh>) {

        #line by line, extract 'first 3 fields' to use as a key.
        #then 'value' which we store.
        my ( $first, $second, $third, $value ) = split;
        $current{"$first $second $third"} = $value;
    }

    #refer to 'reference file' and insert matching value or zero into
    #the array.
    foreach my $key ( keys %ref ) {
        push( @{ $ref{$key} }, $current{$key} ? $current{$key} : 0 );
    }
}

foreach my $key ( keys %ref ) {
    print join( " ", $key, @{ $ref{$key} } );
}

这篇关于如何比较和合并多个文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆