如何比较两个文件的第一列,但获得第二列(使用Perl) [英] How to compare first column of two files but get second ones (using Perl)

查看:118
本文介绍了如何比较两个文件的第一列,但获得第二列(使用Perl)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件(每列两个,由tab分割),我想根据第一列进行比较。如果第一列上的值在两个文件上都相同,我想使用第二列值创建一个新文件。此外,请注意,FILE1第一列中的ID可以重复。基本上我有:

I have two files (two columns each, split by tab) and I want to compare them based on the first column. If the value on the first column is the same on both files, I want to create a new file using second column values. Also, take into account that IDs in the first column of FILE1 can be duplicated. Basically I have:

FILE1:

TRINITY_DN10001_c0_g1_i1     TRINITY_DN10001_c0_g1_TRINITY_DN10001_c0_g1_i1_g.84091_m.84091
TRINITY_DN100032_c0_g2_i1    TRINITY_DN100032_c0_g2_TRINITY_DN100032_c0_g2_i1_g.20078_m.20078
TRINITY_DN100032_c0_g2_i1    TRINITY_DN100032_c0_g2_TRINITY_DN100032_c0_g2_i1_g.42263_m.42263
.....
TRINITY_DN99985_c0_g1_i1     TRINITY_DN99985_c0_g1_TRINITY_DN99985_c0_g1_i1_g.21199_m.21199

FILE2:

TRINITY_DN100007_c0_g1_i1   GO:0001071,GO:0003674
TRINITY_DN100032_c0_g2_i1   GO:0000149,GO:0001775
.....
TRINITY_DN99997_c0_g1_i1    GO:0000166,GO:0001882

我需要这个:

TRINITY_DN100032_c0_g2_TRINITY_DN100032_c0_g2_i1_g.20078_m.20078    GO:0000149,GO:0001775
TRINITY_DN100032_c0_g2_TRINITY_DN100032_c0_g2_i1_g.42263_m.42263    GO:0000149,GO:0001775
.....

我认为这可以通过在Perl中组合两个哈希表来实现,有时类似的到此回答

I think this can be done by combining two hash tables in Perl, somehow similar to this answer.

但是我对Perl很新,所以我完全不知道如何做到这一点。我真的很感激,如果有人可以帮助修改以前的脚本(或以不同的方式解决这个问题)。

But I'm quite new with Perl so I exactly don't know how to do this. I would really appreciate if someone can help to modify the previous script (or to solve this problem in a different way).

提前感谢! ☺

推荐答案

文件有多大?它们足够小以适合内存吗?它们是否已排序?

How big are the files? Are they small enough to fit in memory? Are they sorted?

假设其中一个文件足够小以适合内存,您可以读取该文件,hash - key是第一列,value是第二列。然后,读取其他文件,检查哈希,你去看看它是否存在,如果是,打印出第二列(其中一个是哈希值)。

Assuming that one of the files are small enough to fit in memory, you can read that file, and hash it - key is the first column, value is the second column. And then, read through the other file, checking the hash as you go to see if it exists, and, if so, print out the second columns (one of which is the value from the hash).

假设我们有 $ file1 $ file2 ,并且 $ file1 足够小,我们得到这样的:

Assuming we have $file1 and $file2, and that $file1 is small enough, we get something like this:

open my $fh, '<', $file1 or die "Can't read $file1: $!";
my %file1 = map { split /\t/, $_, 2 } <$fh>; # this slurps in the file, be sure you can fit it all in memory multiple times over!
close $fh;
open $fh, '<', $file2 or die "Can't read $file2: $!";
while (<$fh>) {
    my ($k, $v) = split /\t/, $_, 2;
    if ($file1{$k}) {
        print join("\t", $file1{$k}, $v), "\n";
    }
}



假设相同,但允许file1具有重复项:

Assuming the same, but allowing file1 to have duplicates:

open my $fh, '<', $file1 or die "Can't read $file1: $!";
my %file1;
while (<$fh>) {
    my ($k, $v) = split /\t/, $_, 2;
    push @{$file1{$k}}, $v;
}
close $fh;
open $fh, '<', $file2 or die "Can't read $file2: $!";
while (<$fh>) {
    my ($k, $v) = split /\t/, $_, 2;
    if ($file1{$k}) {
        print join("\t", $_, $v), "\n" for @{$file1{$k}};
    }
}

注意,输出将有来自file1的重复键始终以与file1相同的顺序。

Note that the output will have the duplicate keys from file1 always in the same order as file1.

这篇关于如何比较两个文件的第一列,但获得第二列(使用Perl)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆