删除一个文件中存在于另一文件中的行 [英] Removing lines in one file that are present in another file

查看:97
本文介绍了删除一个文件中存在于另一文件中的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个具有基因组数据的.vcf文件,我想删除第一个文件中也存在于第二个文件中的行.到目前为止,我所拥有的代码似乎只迭代一次,删除了第一个匹配项,然后停止了搜索.任何帮助将不胜感激,因为我不知道问题出在哪里.抱歉,输入任何错误代码...

我选择使用数组而不是哈希数组,因为必须保持文件的初始顺序,我认为使用数组可以更好地实现这一点...

代码:

#!/usr/bin/perl

use strict;
use warnings;

## bring files to program

MAIN: {

my ($chrC, $posC, $junkC, $baserefC, $allrestC);

my (@ref_arrayH, @ref_arrayC);

my ($chrH, $posH, $baserefH);
my $entriesH;

# open the het file first
open (HET, $het) or die "Could not open file $het - $!";
while (<HET>) {
    if (defined) {
        chomp;
        if (/^#/) { next; }
        if ( /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) {   # is a VCF file
            my @line_arrayH = split(/\t/, $_);              
            push (@ref_arrayH, \@line_arrayH);      # the "reference" of each line_array is now in each element of @ref_array
            $entriesH = scalar @ref_arrayH;             # count the number of entries in the het file
        }
    }
}
close HET;

#   print $entriesH,"\n";

open (CNS, $cns) or die "Could not open file $cns - $!";

foreach my $refH ( @ref_arrayH ) {
    $chrH = $refH -> [0];
    $posH = $refH -> [1];
    $baserefH = $refH -> [3];

    foreach my $line (<CNS>) {
        chomp $line;
        if ($line =~ /^#/) { next; }
        if ($line =~ /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) {   # is a VCF file
            ($chrC, $posC, $junkC, $baserefC, $allrestC) = split(/\t/,$line);
                if ( $chrC eq $chrH and $posC == $posH and $baserefC eq $baserefH ) { next }
                else { print "$line\n"; }
        }
    }
}
 #  close CNS;

 }

CNS文件:

chrI    1084    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1085    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1086    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1087    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4
chrI    1088    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1089    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1090    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1091    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1099    .   A   .   32.8    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL  0
chrI    1100    .   G   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1101    .   G   .   12.3    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL  18
chrI    1102    .   G   .   32.9    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL  0
chrI    1103    .   A   .   5.45    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   PL  26
chrI    1104    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4
chrI    1105    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0

HET文件:

chrI    1087    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4
chrI    1104    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4

我希望输出像这样

chrI    1084    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1085    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1086    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1088    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1089    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1090    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1091    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1099    .   A   .   32.8    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL  0
chrI    1100    .   G   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1101    .   G   .   12.3    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL  18
chrI    1102    .   G   .   32.9    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL  0
chrI    1103    .   A   .   5.45    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   PL  26
chrI    1105    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0

但是要给我这个:

chrI    1084    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1085    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1086    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1088    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1089    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1090    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1091    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1099    .   A   .   32.8    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL  0
chrI    1100    .   G   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1101    .   G   .   12.3    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL  18
chrI    1102    .   G   .   32.9    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL  0
chrI    1103    .   A   .   5.45    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   PL  26
chrI    1104    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4
chrI    1105    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0

为什么此嵌套循环无法正常工作?如果我想保留这种数组数组的结构,为什么只在第一次进行迭代?

最好更改foreach循环

foreach my $refH ( @ref_arrayH ) {

带有for循环

for (my $i = 0; $i <= $entriesH; $i++) {

?

解决方案

#!/usr/bin/env perl

use strict;
use warnings;

my %seen;

open my $het, '<', 't.het' or die $!;
$seen{ $_ } = undef while <$het>;
close $het or die $!;

open my $cns, '<', 't.cns' or die $!;

while (my $line = <$cns>) {
    next if exists $seen{ $line };
    print $line;
}

close $cns or die $!;

如果要匹配整行以外的其他内容,请提取字段并使用它(或它们的组合)键入%seen哈希.

这将与HET文件中的行数成比例使用内存.

要减少内存使用量,可以将%seen绑定到磁盘上的 DBM_File . /p>

I have 2 .vcf files with genomic data and I want to remove lines in the 1st file that are also present in the second file. The code I have so far it seems to iterate only one time, removing the first hit and then stops the search. Any help would be very appreciated since I can not figure out where the problem is. Sorry for any mis-code...

I opted for arrays of arrays instead of hashes because the initial order of the file must be maintained, and I think that this is better achieved with arrays...

Code:

#!/usr/bin/perl

use strict;
use warnings;

## bring files to program

MAIN: {

my ($chrC, $posC, $junkC, $baserefC, $allrestC);

my (@ref_arrayH, @ref_arrayC);

my ($chrH, $posH, $baserefH);
my $entriesH;

# open the het file first
open (HET, $het) or die "Could not open file $het - $!";
while (<HET>) {
    if (defined) {
        chomp;
        if (/^#/) { next; }
        if ( /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) {   # is a VCF file
            my @line_arrayH = split(/\t/, $_);              
            push (@ref_arrayH, \@line_arrayH);      # the "reference" of each line_array is now in each element of @ref_array
            $entriesH = scalar @ref_arrayH;             # count the number of entries in the het file
        }
    }
}
close HET;

#   print $entriesH,"\n";

open (CNS, $cns) or die "Could not open file $cns - $!";

foreach my $refH ( @ref_arrayH ) {
    $chrH = $refH -> [0];
    $posH = $refH -> [1];
    $baserefH = $refH -> [3];

    foreach my $line (<CNS>) {
        chomp $line;
        if ($line =~ /^#/) { next; }
        if ($line =~ /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) {   # is a VCF file
            ($chrC, $posC, $junkC, $baserefC, $allrestC) = split(/\t/,$line);
                if ( $chrC eq $chrH and $posC == $posH and $baserefC eq $baserefH ) { next }
                else { print "$line\n"; }
        }
    }
}
 #  close CNS;

 }

CNS file:

chrI    1084    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1085    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1086    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1087    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4
chrI    1088    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1089    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1090    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1091    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1099    .   A   .   32.8    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL  0
chrI    1100    .   G   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1101    .   G   .   12.3    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL  18
chrI    1102    .   G   .   32.9    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL  0
chrI    1103    .   A   .   5.45    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   PL  26
chrI    1104    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4
chrI    1105    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0

HET file:

chrI    1087    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4
chrI    1104    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4

I would like the output to be like this

chrI    1084    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1085    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1086    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1088    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1089    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1090    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1091    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1099    .   A   .   32.8    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL  0
chrI    1100    .   G   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1101    .   G   .   12.3    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL  18
chrI    1102    .   G   .   32.9    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL  0
chrI    1103    .   A   .   5.45    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   PL  26
chrI    1105    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0

but is giving me this instead:

chrI    1084    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1085    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1086    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1088    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1089    .   A   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1090    .   C   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1091    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1099    .   A   .   32.8    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL  0
chrI    1100    .   G   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0
chrI    1101    .   G   .   12.3    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL  18
chrI    1102    .   G   .   32.9    .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL  0
chrI    1103    .   A   .   5.45    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   PL  26
chrI    1104    .   C   T   3.55    .   DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30   GT:PL:GQ    0/1:31,3,0:4
chrI    1105    .   T   .   33  .   DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30   PL  0

Why is this nested loop not working properly? If I want to keep this structure of array-of-arrays, why is only doing the iteration the first time?

Would it be better to change the foreach loop

foreach my $refH ( @ref_arrayH ) {

with a for loop

for (my $i = 0; $i <= $entriesH; $i++) {

?

解决方案

#!/usr/bin/env perl

use strict;
use warnings;

my %seen;

open my $het, '<', 't.het' or die $!;
$seen{ $_ } = undef while <$het>;
close $het or die $!;

open my $cns, '<', 't.cns' or die $!;

while (my $line = <$cns>) {
    next if exists $seen{ $line };
    print $line;
}

close $cns or die $!;

If you want to match something other than entire lines, extract the field(s) and use it (or their combination) to key into the %seen hash.

This will use memory in proportion to the number of lines in the HET file.

To reduce memory usage, you can tie %seen to a DBM_File on disk.

这篇关于删除一个文件中存在于另一文件中的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆