删除一个文件中存在于另一文件中的行 [英] Removing lines in one file that are present in another file
问题描述
我有2个具有基因组数据的.vcf文件,我想删除第一个文件中也存在于第二个文件中的行.到目前为止,我所拥有的代码似乎只迭代一次,删除了第一个匹配项,然后停止了搜索.任何帮助将不胜感激,因为我不知道问题出在哪里.抱歉,输入任何错误代码...
我选择使用数组而不是哈希数组,因为必须保持文件的初始顺序,我认为使用数组可以更好地实现这一点...
代码:
#!/usr/bin/perl
use strict;
use warnings;
## bring files to program
MAIN: {
my ($chrC, $posC, $junkC, $baserefC, $allrestC);
my (@ref_arrayH, @ref_arrayC);
my ($chrH, $posH, $baserefH);
my $entriesH;
# open the het file first
open (HET, $het) or die "Could not open file $het - $!";
while (<HET>) {
if (defined) {
chomp;
if (/^#/) { next; }
if ( /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) { # is a VCF file
my @line_arrayH = split(/\t/, $_);
push (@ref_arrayH, \@line_arrayH); # the "reference" of each line_array is now in each element of @ref_array
$entriesH = scalar @ref_arrayH; # count the number of entries in the het file
}
}
}
close HET;
# print $entriesH,"\n";
open (CNS, $cns) or die "Could not open file $cns - $!";
foreach my $refH ( @ref_arrayH ) {
$chrH = $refH -> [0];
$posH = $refH -> [1];
$baserefH = $refH -> [3];
foreach my $line (<CNS>) {
chomp $line;
if ($line =~ /^#/) { next; }
if ($line =~ /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) { # is a VCF file
($chrC, $posC, $junkC, $baserefC, $allrestC) = split(/\t/,$line);
if ( $chrC eq $chrH and $posC == $posH and $baserefC eq $baserefH ) { next }
else { print "$line\n"; }
}
}
}
# close CNS;
}
CNS文件:
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1087 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
HET文件:
chrI 1087 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
我希望输出像这样
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
但是要给我这个:
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
为什么此嵌套循环无法正常工作?如果我想保留这种数组数组的结构,为什么只在第一次进行迭代?
最好更改foreach循环
foreach my $refH ( @ref_arrayH ) {
带有for循环
for (my $i = 0; $i <= $entriesH; $i++) {
?
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
open my $het, '<', 't.het' or die $!;
$seen{ $_ } = undef while <$het>;
close $het or die $!;
open my $cns, '<', 't.cns' or die $!;
while (my $line = <$cns>) {
next if exists $seen{ $line };
print $line;
}
close $cns or die $!;
如果要匹配整行以外的其他内容,请提取字段并使用它(或它们的组合)键入%seen
哈希.
这将与HET
文件中的行数成比例使用内存.
要减少内存使用量,可以将%seen
绑定到磁盘上的 DBM_File . /p>
I have 2 .vcf files with genomic data and I want to remove lines in the 1st file that are also present in the second file. The code I have so far it seems to iterate only one time, removing the first hit and then stops the search. Any help would be very appreciated since I can not figure out where the problem is. Sorry for any mis-code...
I opted for arrays of arrays instead of hashes because the initial order of the file must be maintained, and I think that this is better achieved with arrays...
Code:
#!/usr/bin/perl
use strict;
use warnings;
## bring files to program
MAIN: {
my ($chrC, $posC, $junkC, $baserefC, $allrestC);
my (@ref_arrayH, @ref_arrayC);
my ($chrH, $posH, $baserefH);
my $entriesH;
# open the het file first
open (HET, $het) or die "Could not open file $het - $!";
while (<HET>) {
if (defined) {
chomp;
if (/^#/) { next; }
if ( /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) { # is a VCF file
my @line_arrayH = split(/\t/, $_);
push (@ref_arrayH, \@line_arrayH); # the "reference" of each line_array is now in each element of @ref_array
$entriesH = scalar @ref_arrayH; # count the number of entries in the het file
}
}
}
close HET;
# print $entriesH,"\n";
open (CNS, $cns) or die "Could not open file $cns - $!";
foreach my $refH ( @ref_arrayH ) {
$chrH = $refH -> [0];
$posH = $refH -> [1];
$baserefH = $refH -> [3];
foreach my $line (<CNS>) {
chomp $line;
if ($line =~ /^#/) { next; }
if ($line =~ /(^.+?)\s(\d+?)\s(.+?)\s([A-Za-z\.]+?)\s([A-Za-z\.]+?)\s(.+?)\s(.+?)\s(.+)/m ) { # is a VCF file
($chrC, $posC, $junkC, $baserefC, $allrestC) = split(/\t/,$line);
if ( $chrC eq $chrH and $posC == $posH and $baserefC eq $baserefH ) { next }
else { print "$line\n"; }
}
}
}
# close CNS;
}
CNS file:
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1087 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
HET file:
chrI 1087 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
I would like the output to be like this
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
but is giving me this instead:
chrI 1084 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1085 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1086 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1088 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1089 . A . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1090 . C . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1091 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1099 . A . 32.8 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.2 PL 0
chrI 1100 . G . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
chrI 1101 . G . 12.3 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30.1 PL 18
chrI 1102 . G . 32.9 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30.1 PL 0
chrI 1103 . A . 5.45 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 PL 26
chrI 1104 . C T 3.55 . DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=52;FQ=-30 GT:PL:GQ 0/1:31,3,0:4
chrI 1105 . T . 33 . DP=1;AF1=0;AC1=0;DP4=1,0,0,0;MQ=52;FQ=-30 PL 0
Why is this nested loop not working properly? If I want to keep this structure of array-of-arrays, why is only doing the iteration the first time?
Would it be better to change the foreach loop
foreach my $refH ( @ref_arrayH ) {
with a for loop
for (my $i = 0; $i <= $entriesH; $i++) {
?
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
open my $het, '<', 't.het' or die $!;
$seen{ $_ } = undef while <$het>;
close $het or die $!;
open my $cns, '<', 't.cns' or die $!;
while (my $line = <$cns>) {
next if exists $seen{ $line };
print $line;
}
close $cns or die $!;
If you want to match something other than entire lines, extract the field(s) and use it (or their combination) to key into the %seen
hash.
This will use memory in proportion to the number of lines in the HET
file.
To reduce memory usage, you can tie %seen
to a DBM_File on disk.
这篇关于删除一个文件中存在于另一文件中的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!