Perl:如何联接文本文件的两列,其中第一列的值应与第二列的值顺序匹配 [英] Perl: How to join two columns of a text file, in which values of the first column should match in order with the values of the second column

查看:216
本文介绍了Perl:如何联接文本文件的两列,其中第一列的值应与第二列的值顺序匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Perl编程的初学者.我现在正在研究的问题是如何从文本文件中获取基因长度.文本文件包含基因名称(第10列),起始位点(第6列),结束位点(第7列).长度可以从第6列和第7列的差异中得出.但是我的问题是如何将基因名称(来自第10列)与从第6列和第7列的差异中得出的相应差异进行匹配.非常感谢!/p>

I am a beginner with Perl programming. The problem I am working on right now is how to get the gene length from a text file. Text file contains the gene name (column 10), start site (column 6), end site (column 7). The length can be derived from the difference of column 6 and 7. But my problem is how to match the gene name (from column 10) with the corresponding difference derived from the difference of column 6 and column 7. Thank you very much!

open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");

while ($a = <IN>) {
    @data = split (/\t/, $a);
    $list {$data[10]}++;
    $genelength {$data[7] - $data[6]};
}

foreach $sub (keys %list){
    $gene = join ($sub, $genelength);

    print "$gene\n";
}
close (IN);
close (OUT);

推荐答案

我不确定这一点,因为我没有看到您的数据.但我认为您正在为此付出不必要的努力.我认为每个基因所需的一切都在输入文件的一行中,因此您可以一次处理一行文件,而无需使用任何额外的变量.像这样:

I'm not sure about this as I haven't seen your data. But I think you're making this far harder than necessary. I think that everything you need for each gene is in a single line of the input file, so you can process the file a line at a time and not use any extra variables. Something like this:

open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");

while ($a = <IN>) {
    @data = split (/\t/, $a);
    print "Gene: $data[10] / Length: ", $data[7] - $data[6], "\n";
}

但是我们可以做一些改进.首先,我们将停止使用$a(这是一个特殊变量,不应在随机代码中使用),而是切换到$_.同时,我们将添加use strictuse warnings并确保声明了所有变量.

But there are some improvements we can make. First, we'll stop using $a (which is a special variable and shouldn't be used in random code) and switch to $_ instead. At the same time we'll add use strict and use warnings and ensure that all of our variables are declared.

use strict;
use warnings;

open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");

while (<IN>) { # This puts the line into $_
    my @data = split (/\t/); # split uses $_ by default
    print OUT "Gene: $data[10] / Length: ", $data[7] - $data[6], "\n";
}

接下来,我们将删除split()调用上不必要的括号,并使用列表切片仅获取所需的值并将其存储在各个变量中.

Next we'll remove the unnecessary parentheses on the split() call and use a list slice to just get the values you want and store them in individual variables.

use strict;
use warnings;

open (IN, "Alu.txt");
open (OUT, ">Alu_subfamlength3.csv");

while (<IN>) { # This puts the line into $_
    my ($start, $end, $gene) = (split /\t/)[6, 7, 10]; # split uses $_ by default
    print OUT "Gene: $gene / Length: ", $end - $start, "\n";
}

接下来,我们将删除显式文件名.相反,我们将从STDIN中读取数据并将其写入STDOUT.这是一种常见的Unix/Linux方法,称为 I/O过滤器.它将使您的程序更加灵活(此外,更容易编写).

Next, we'll remove the explicit filenames. Instead, we'll read data from STDIN and write it to STDOUT. This is a common Unix/Linux approach called an I/O filter. It will make your program more flexible (and, as a bonus, easier to write).

use strict;
use warnings;

while (<>) { # Empty <> reads from STDIN
    my ($start, $end, $gene) = (split /\t/)[6, 7, 10];
    # print to STDOUT
    print "Gene: $gene / Length: ", $end - $start, "\n";
}

要使用此程序,我们使用称为 I/O重定向的操作系统功能.如果程序被称为filter_genes,我们将这样称呼它:

To use this program, we use an operating system feature called I/O redirection. If the program is called filter_genes, we would call it like this:

$ ./filter_genes < Alu.txt > Alu_subfamlength3.csv

如果将来文件名更改,则无需更改程序,只需更改调用它的命令行即可.

And if the names of your files change in the future, you don't need to change your program, just the command line that calls it.

这篇关于Perl:如何联接文本文件的两列,其中第一列的值应与第二列的值顺序匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆