根据另一个文件中的映射合并文件 [英] Merge files based on a mapping in another file

查看:92
本文介绍了根据另一个文件中的映射合并文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Perl中编写了一个脚本,该脚本根据第三个文件中的映射合并文件;我不使用join的原因是行并不总是匹配.该代码有效,但是给出了似乎不会影响输出的错误:Use of uninitialized value in join or string at join.pl line 43, <$fh> line 21.由于我是Perl的新手,所以我一直无法理解导致此错误的原因.解决此错误的任何帮助或有关我的代码的建议,将不胜感激.我在下面提供了示例输入和输出.

I have written a script in Perl that merges files based on a mapping in a third file; the reason I am not using join is because lines won't always match. The code works, but gives an error that doesn't appear to affect output: Use of uninitialized value in join or string at join.pl line 43, <$fh> line 21. As I am relatively new to Perl I have been unable to understand what is causing this error. Any help resolving this error or advice about my code would be greatly appreciated. I have provided example input and output below.

join.pl

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use Tie::File;
use Scalar::Util qw(looks_like_number);

chomp( my $infile  = $ARGV[0] );
chomp( my $infile1 = $ARGV[1] );
chomp( my $infile2 = $ARGV[2] );
chomp( my $outfile = $ARGV[3] );

open my $mapfile,   '<', $infile  or die "Could not open $infile: $!";
open my $file1,   '<', $infile1  or die "Could not open $infile1: $!";
open my $file2,   '<', $infile2  or die "Could not open $infile2: $!";
tie my @tieFile1, 'Tie::File', $infile1 or die "Could not open $infile1: $!";
tie my @tieFile2, 'Tie::File', $infile2 or die "Could not open $infile2: $!";
open my $output, '>', $outfile or die "Could not open $outfile: $!";

my %map1;
my %map2;
# This loop will read two input files and populate two hashes
# using the coordinates (field 2) and the current line number
while ( my $line1 = <$file1>, my $line2 = <$file2> ) {
    my @row1 = split( "\t", $line1 );
    my @row2 = split( "\t", $line2 );
    # $. holds the line number
    $map1{$row1[1]} = $.;
    $map2{$row2[1]} = $.;
}
close($file1);
close($file2);

while ( my $line = <$mapfile> ) {
    chomp $line;
    my @row = split( "\t", $line );
    my $species1 = $row[1];
    my $reference1 = $map1{$species1};
    my $species2 = $row[3];
    my $reference2 = $map2{$species2};
    my @nomatch  = ("NA", "", "NA", "", "", "", "", "NA", "NA");
    # test numeric
    if ( looks_like_number($reference1) && looks_like_number($reference2) ) {
        # do the do using the maps
        print $output join("\t", $tieFile1[$reference1], $tieFile2[$reference2]), "\n";
    }
    elsif ( looks_like_number($reference1) )
    {
        print $output join("\t", $tieFile1[$reference1], @nomatch), "\n";
    }
    elsif ( looks_like_number($reference2) )
    {
        print $output join("\t", @nomatch, $tieFile2[$reference2]), "\n";
    }
}
close($output);
untie @tieFile1;
untie @tieFile2;

input_1:

Scf_3L  12798910    T   0   41  0   0   NA  NA
Scf_3L  12798911    C   0   0   43  0   NA  NA
Scf_3L  12798912    A   42  0   0   0   NA  NA
Scf_3L  12798913    G   0   0   0   44  NA  NA
Scf_3L  12798914    T   0   42  0   0   NA  NA
Scf_3L  12798915    G   0   0   0   44  NA  NA
Scf_3L  12798916    T   0   42  0   0   NA  NA
Scf_3L  12798917    A   41  0   0   0   NA  NA
Scf_3L  12798918    G   0   0   0   43  NA  NA
Scf_3L  12798919    T   0   43  0   0   NA  NA
Scf_3L  12798920    T   0   41  0   0   NA  NA

input_2:

3L  12559896    T   0   31  0   0   NA  NA
3L  12559897    C   0   0   33  0   NA  NA
3L  12559898    A   34  0   0   0   NA  NA
3L  12559899    G   0   0   0   33  NA  NA
3L  12559900    T   0   34  0   0   NA  NA
3L  12559901    G   0   0   0   33  NA  NA
3L  12559902    T   0   33  0   0   NA  NA
3L  12559903    A   33  0   0   0   NA  NA
3L  12559904    G   0   0   0   33  NA  NA
3L  12559905    T   0   34  0   0   NA  NA
3L  12559906    T   0   33  0   0   NA  NA

地图:

3L  12798910    T   12559896    T
3L  12798911    C   12559897    C
3L  12798912    A   12559898    A
3L  12798913    G   12559899    G
3L  12798914    T   12559900    T
3L  12798915    G   12559901    G
3L  12798916    T   12559902    T
3L  12798917    A   12559903    A
3L  12798918    G   12559904    G
3L  12798919    T   12559905    T
3L  12798920    T   12559906    T

输出:

Scf_3L  12798910    T   0   41  0   0   NA  NA    3L    12559896    T   0   31  0   0   NA  NA
Scf_3L  12798911    C   0   0   43  0   NA  NA    3L    12559897    C   0   0   33  0   NA  NA
Scf_3L  12798912    A   42  0   0   0   NA  NA    3L    12559898    A   34  0   0   0   NA  NA
Scf_3L  12798913    G   0   0   0   44  NA  NA    3L    12559899    G   0   0   0   33  NA  NA
Scf_3L  12798914    T   0   42  0   0   NA  NA    3L    12559900    T   0   34  0   0   NA  NA
Scf_3L  12798915    G   0   0   0   44  NA  NA    3L    12559901    G   0   0   0   33  NA  NA
Scf_3L  12798916    T   0   42  0   0   NA  NA    3L    12559902    T   0   33  0   0   NA  NA
Scf_3L  12798917    A   41  0   0   0   NA  NA    3L    12559903    A   33  0   0   0   NA  NA
Scf_3L  12798918    G   0   0   0   43  NA  NA    3L    12559904    G   0   0   0   33  NA  NA
Scf_3L  12798919    T   0   43  0   0   NA  NA    3L    12559905    T   0   34  0   0   NA  NA
Scf_3L  12798920    T   0   41  0   0   NA  NA    3L    12559906    T   0   33  0   0   NA  NA

推荐答案

直接的问题是,绑定数组的索引从零开始,而$.中的行号从1开始.这意味着在使用前,您需要从$.$reference变量中减去一个.它们作为索引.因此,您得到的数据最初是永远不会正确的,并且如果不是警告的话,您可能会忽略了它!

The immediate problem is that the indices of the tied arrays start at zero, while the line numbers in $. start at 1. That means you need to subtract one from $. or from the $reference variables before using them as indices. So your resulting data was never correct in the first place, and you may have overlooked that if it weren't for the warning!

我修复了该问题,并还整理了一些代码.我主要添加了use autodie,因此无需检查IO操作的状态(Tie::File除外),更改为列表分配,将代码移动到将文件读取到子例程中,并添加了代码块,以便词法分析文件句柄将自动关闭

I fixed that and also tidied up your code a little. I mostly added use autodie so that there's no need to check the status of IO operations (except for Tie::File), changed to list assignments, moved the code to read the files into a subroutine, and added code blocks so that the lexical file handles would be closed automatically

我还使用绑定数组来构建%map哈希,而不是单独打开文件,这意味着它们的值已经是基于零的了,因为它们必须是

I also used the tied arrays to build the %map hashes instead of opening the files separately, which means their values are already zero-based as they need to be

哦,我删除了looks_like_number,因为$reference变量必须是数字或undef,因为这就是我们放入哈希表中的全部内容.检查值是否不是undef的正确方法是使用defined运算符

Oh, and I removed looks_like_number, because the $reference variables must be either numeric or undef because that's all we put into the hash. The correct way to check that a value isn't undef is with the defined operator

#!/usr/bin/perl

use strict;
use warnings 'all';
use autodie;

use Fcntl 'O_RDONLY';
use Tie::File;

my ( $mapfile, $infile1, $infile2, $outfile ) = @ARGV;

{
    tie my @file1, 'Tie::File' => $infile1, mode => O_RDONLY
        or die "Could not open $infile1: $!";

    tie my @file2, 'Tie::File' =>$infile2, mode => O_RDONLY
            or die "Could not open $infile2: $!";

    my %map1 = map { (split /\t/, $file1[$_], 3)[1] => $_ } 0 .. $#file1;
    my %map2 = map { (split /\t/, $file2[$_], 3)[1] => $_ } 0 .. $#file2;

    open my $map_fh, '<', $mapfile;

    open my $out_fh, '>', $outfile;

    while ( <$map_fh> ) {
        chomp;
        my @row = split /\t/;

        my ( $species1, $species2 ) = @row[1,3];
        my $reference1 = $map1{$species1};
        my $reference2 = $map2{$species2};

        my @nomatch    = ( "NA", "", "NA", "", "", "", "", "NA", "NA" );

        my @fields = (
            ( defined $reference1 ? $file1[$reference1] : @nomatch),
            ( defined $reference2 ? $file2[$reference2] : @nomatch),
        );

        print $out_fh join( "\t", @fields ), "\n";
    }
}

输出

Scf_3L  12798910    T   0   41  0   0   NA  NA  NA      NA                  NA  NA
Scf_3L  12798911    C   0   0   43  0   NA  NA  NA      NA                  NA  NA
Scf_3L  12798912    A   42  0   0   0   NA  NA  NA      NA                  NA  NA
Scf_3L  12798913    G   0   0   0   44  NA  NA  NA      NA                  NA  NA
Scf_3L  12798914    T   0   42  0   0   NA  NA  NA      NA                  NA  NA
Scf_3L  12798915    G   0   0   0   44  NA  NA  NA      NA                  NA  NA
Scf_3L  12798916    T   0   42  0   0   NA  NA  NA      NA                  NA  NA
Scf_3L  12798917    A   41  0   0   0   NA  NA  NA      NA                  NA  NA
Scf_3L  12798918    G   0   0   0   43  NA  NA  NA      NA                  NA  NA
Scf_3L  12798919    T   0   43  0   0   NA  NA  NA      NA                  NA  NA
Scf_3L  12798920    T   0   41  0   0   NA  NA  NA      NA                  NA  NA

这篇关于根据另一个文件中的映射合并文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆