在Windows上运行Perl时奇怪地表现IF块 [英] Curiously behaving IF block in Perl run on Windows

查看：123 发布时间：2018/7/17 9:45:05 perl debugging if-statement nested-loops bioinformatics

本文介绍了在Windows上运行Perl时奇怪地表现IF块的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

后台：我有一个Perl脚本，我写了两个文件。该脚本的基本要点是识别一个坐标列表之间的重叠，定义随机选择的染色体片段的开始和结束，以及第二个坐标列表，定义实际基因转录本的开始和结束。

Background: I have a Perl script that I wrote to go through two files. The basic point of the script is to identify overlaps between one list of coordinates, defining the beginnings and ends of randomly selected chromosomal segments, and a second list of coordinates, defining the beginnings and endings of actual gene transcripts.

第一个输入文件包含三列。第一个是染色体编号，第二个和第三个是随机选择区域的碱基对的近端和远端坐标。例如，

The first input file contains three columns. The first is for the chromosome number, and the second and third are the proximal and distal coordinates, in base pairs, of the randomly selected regions. For eg,

chr1    1100349    2035647
chr1    47837656   736474584
.       .          .
.       .          .
.       .          .

第二个输入文件包含四列：染色体编号，近端坐标，远端坐标和名称基因。例如，

The second input file contains four columns: chromosome number, proximal coordinate, distal coordinate, and the name of the gene. For eg,

chr1    1588354    2283765    geneA
chr1    55943837   787653743    geneB

这是我用来开始的一组测试文件。 第一套。

Here is a set of test files I used to start off with. First set.

chr1    1   10
chr1    5   10
chr1    5   15
chr1    14  15
chr1    100 101
chr1    11  17

第二套。

chr1    1   5   geneA
chr1    7   10  geneB
chr1    12  16  geneC
chr1    18  21  geneD
chr10   126602211   126609396   B4galnt1

该脚本从第一个列表中读取第一行，然后读取第二个列表的所有行，并打印出第一个坐标对是否以及如何与第二个坐标对重叠（是第二个坐标对之外的第一个坐标对）？第一对是内部还是与第二对重叠？）然后，脚本返回并从第一个列表中读取第二行，并重复该过程。第一个文件有200,000行。第二个几千。它现在一夜之间运行。

The script reads off the first line from the first list, then reads through all the lines of the second list, and prints for me whether and how the first coordinate pair overlaps with the second coordinate pair (Is the first coordinate pair outside the second pair? Is the first pair inside or overlapping with the second?) Then, the script goes back and reads off the second line from the first list, and repeats the process. The first file has 200,000 lines. The second several thousand. It is running now overnight.

问题：当脚本确定第一个和第二个坐标对之间的关系时，它会打印出一行到输出文件。并非所有这些打印语句都需要发送到输出，所以我试着将它们注释掉。但是，当我这样做时，没有打印到输出文件的打印语句。但是，语句将打印到屏幕上，而不是输出文件。脚本正在运行，但正在使用所有打印到输出语句，因此输出文件变得越来越大。如果脚本只打印输出只有那些重叠的坐标，输出文件将非常非常小。目前，输出文件现在是2,131,294 KB！而这仅限于11号染色体。还有8个要经过，尽管规模较小，但文件大小仍将大大扩展。

The problem: When the script determines the relationship between the first and second coordinate pairs, it prints out a line to an output file. Not all these print statements need to be sent to output, so I tried to comment them out. However, when I did this, none of the print statements sending information to the output file got printed. Statements are printed to the screen, though, just not to the output file. The script is running, but all the print to output statements are being used, so the output file is getting huge. If the script would just print to output for only those coordinates that overlap, the output file would be very, very much smaller. At present, the output file is now 2,131,294 KB! And that's only up to chromosome 11. There are eight more to go through, albeit smaller ones, but the file size is still going to expand greatly.

更新信息：这是在我原始发布后编辑的。更确切地说，只有当我注释掉循环内的第一个 print $ output...; 语句时（第一个语句是打印）一个标题，这是在循环之前）脚本无法打印任何内容，即使所有其他人都被单独留下（未注释）。

Updated information: This is edited in after my original posting. To be more precise, it is only when I comment out the first print $output "..."; statement that is inside the loop (the very first statement is to print a header, and this is before the loop) that the script fails to print anything, even when all the others are left alone (not commented).

In重要的是：我使用Fraise在我的Mac上编写了脚本，但我在PC上运行它，该脚本包含在记事本文本文件中。

In case it matters: I wrote the script on my Mac, using Fraise, but I am running it on a PC, the script contained in a Notepad text file.

这是脚本：注意：文件中有很多打印语句，很多都被注释掉了。感兴趣的打印语句是打印到输出文件的打印语句。那些是当一个或多个被注释掉时，最终不会向输出文件发送信息的那些。这些陈述如下：

Here's the script: Note: there are many print statements in the file, many commented out. The print statements of interest are those printing to the output file. Those are the ones that, when one or more are commented out, wind up never sending information to the output file. Those statements look like:

print $output  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";

实际脚本：

#!/bin/usr/perl
use strict; use warnings;

#############
## findGenes_after_ASboot_v5.pl
#############

#############
#  After making a big list of randomly placed intervals,
#  this script uses RefGene.txt file and identifies the 
#  the gene symbols encompassed or overlapped by each random interval 
#############

unless(scalar @ARGV == 2) {
    # $0 name of the program being executed;
    print "\n usage: $0 filename containig your list of positions and a RefGene-type file \n\n"; 
    exit;
}

#for ( my $i = 0; $i < 25; $i++ ){
#     print "#########################################\n";
#}

open( my $positions, "<", $ARGV[0] ) or die;
open( my $RefGene,   "<", $ARGV[1] ) or die;

open( my $output, ">>", "output.txt") or die;

# print header
print $output "chr\tpos count\tpos1\tpos2\tchr\tref count\tref1\tref2\tname2\trelationship\n";

my $pos_count = 1;
my $ref_count = 1;

for my $position_line (<$positions>) {
    #print "$position_line";
    my @posline = split('\t', $position_line);
    #print "$posline[0]\t$posline[1]\t$posline[2]";
    open( my $RefGene,   "<", $ARGV[1] ) or die;

    for my $ref (<$RefGene>){
        #print "\t$ref";    
        my @refline = split('\t', $ref);
        # print "\t$refline[0]\t$refline[1]\t$refline[2]\t$refline[3]";
        chomp $posline[2];
        chomp $refline[3];     
        if ( $posline[0] eq $refline[0] ){
            #print "\tchr match\n";

            # am i entirely prox to a gene?
            if ( $posline[2] < $refline[1] ){
                #print "too proximal\n";
                print "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo proximal\n";

                #the following print statement is one I'd like to be able to comment out
                print $output "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo proximal\n";
                $ref_count++; 
                next; 
            }

            # am i entirely distal to a gene?
            elsif ( $posline[1] > $refline[2] ){
                #print "too distal\n";
                print  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo distal\n";
                #the following print statement is one I'd like to be able to comment out
                print $output  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\ttoo distal\n";
                $ref_count++; 
                next; 
            }

            # am i completely inside a gene?
            elsif ( $posline[1] >= $refline[1] &&
                $posline[2] <= $refline[2]    ){
                #print "inside\n";
                print  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";
                print $output  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tinside\n";
                $ref_count++; 
                next; 
            }

            # am i proximally overlapping?
            elsif ( $posline[1] < $refline[1] &&
                $posline[2] <= $refline[2]    ){
                #print "proximal overlap\n";
                print  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tproximal overlap\n";
                print $output  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tproximal overlap\n";
                $ref_count++; 
                next; 
            }
            # am i distally overlapping?
            elsif ( $posline[1] >= $refline[1] &&
                $posline[2] > $refline[2]    ){
                #print "distal overlap\n";
                print  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tdistal overlap\n";
                print $output  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tdistal overlap\n";
                $ref_count++; 
                next; 
            }

            else {
                #print "encompassing\n";
                print  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tencompassing\n";
                print $output  "$posline[0]\t$pos_count\t$posline[1]\t$posline[2]\t$refline[0]\t$ref_count\t$refline[1]\t$refline[2]\t$refline[3]\tencompassing\n";
                $ref_count++; 
                next;
            }       

        } # if a match with chr

        else {
            next;
        }

    } # for each reference
    $pos_count++;    
} # for each position

数据文件：

http：//www.filedropper .com / proxdistalpositionsofrandompositions

http：//www.filedropper。 com / modifiedrefgene

一些输出： http：// www。 filedropper.com/output_17

http://www.filedropper.com/proxdistalpositionsofrandompositions
http://www.filedropper.com/modifiedrefgene
Some output: http://www.filedropper.com/output_17

推荐答案

我发现代码中存在两个潜在的缺陷：

I see two potential flaws in your code:

处理文件而不是<$时，始终在时使用 c $ c> for 。

每当你使用后者时，你实际上是将整个文件加载到内存中而不是一行一行地加载处理。如果你真的能够支持这样做，你应该继续完全加载你的小文件，然后迭代就行了。

Whenever you use the latter, you're actually loading the entire file into memory versus just doing line by line processing. If you're actually able to support doing that though, you should go ahead and load your smaller file entirely and just iterate on the lines.

拆分\t不在'\ t '。

后者几乎肯定是一个错误，除非您确实对数据使用了2个字符的分隔符。

The latter is almost certainly a bug, unless you really do use a 2 character delimiter for your data.

无论如何，我已经大大清理了你的代码。删除重复的行等很可能很多这些更改可能不起作用（因为它未经测试）或不是你想要的。但是，如果你仔细阅读代码，或许它至少会给你一些想法：

Anyway, I've cleaned up your code considerably. Removing duplicated lines etc. It's likely that a lot of these changes may either not work (as it's untested) or not be what you want. However, if you go through the code, perhaps it will give you ideas at the very least:

#!/bin/usr/perl
use strict;
use warnings;
use autodie;

#############
## findGenes_after_ASboot_v5.pl
#############

#############
#  After making a big list of randomly placed intervals,
#  this script uses RefGene.txt file and identifies the 
#  the gene symbols encompassed or overlapped by each random interval 
#############

die "\n usage: $0 filename containig your list of positions and a RefGene-type file \n\n"
    if @ARGV != 2;

open my $positions, "<", $ARGV[0];

# Cache file by key
my %refgenes;
open my $RefGene,   "<", $ARGV[1];
while (<$RefGene>) {
    chomp;
    my @cols = split "\t";
    push @{$refgenes{$cols[0]}}, \@cols;
}

open my $output, ">>", "output.txt";

# print header
print $output "chr\tpos count\tpos1\tpos2\tchr\tref count\tref1\tref2\tname2\trelationship\n";

my $pos_count = 1;
my $ref_count = 1;

while (my $position_line = <$positions>) {
    chomp $position_line;
    my @posline = split "\t", $position_line;

    # Only iterate on matching refs
    for my $ref (@{ $refgenes{$posline[0]} }) {
        my @refline = @$ref;

        my $desc = join "\t", ($posline[0], $pos_count, @posline[1,2], $refline[0], $ref_count, @refline[1,2,3]);
        my $message = '';

        # am i entirely prox to a gene?
        if ( $posline[2] < $refline[1] ){
            $message = 'too proximal';

        # am i entirely distal to a gene?
        } elsif ( $posline[1] > $refline[2] ) {
            $message = 'too distal';

        # am i completely inside a gene?
        } elsif ( $posline[1] >= $refline[1] && $posline[2] <= $refline[2] ) {
            $message = 'inside';

        # am i proximally overlapping?
        } elsif ( $posline[1] < $refline[1] && $posline[2] <= $refline[2] ) {
            $message = 'proximal overlap';

        # am i distally overlapping?
        } elsif ( $posline[1] >= $refline[1] && $posline[2] > $refline[2] ) {
            $message = 'distal overlap';

        } else {
            $message = 'encompassing';
        }

        print "$desc\t$message\n";
        print $output "$desc\t$message\n";

        $ref_count++; 
    } # for each reference
    $pos_count++;
} # for each position

这篇关于在Windows上运行Perl时奇怪地表现IF块的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Windows上运行Perl时奇怪地表现IF块 [英] Curiously behaving IF block in Perl run on Windows

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Windows上运行Perl时奇怪地表现IF块 [英] Curiously behaving IF block in Perl run on Windows

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭