从 perl 中的多个文本文件中删除重复条目? [英] Remove duplicates entries from multiple text file in perl?
问题描述
我是这个网站的新手,需要帮助从多个文本文件中删除重复条目(在循环中).尝试了下面的代码,但这并没有删除多个文件的重复项,但是它适用于单个文件.
I am new to this site,need help to remove duplicate entries from multiple text file(in a loop).tried the below code but this is not removing duplicates for multiple files,however it is working for a single file.
代码:
my $file = "$Log_dir/File_listing.txt";
my $outfile = "$Log_dir/Remove_duplicate.txt";;
open (IN, "<$file") or die "Couldn't open input file: $!";
open (OUT, ">$outfile") or die "Couldn't open output file: $!";
my %seen = ();
{
my @ARGV = ($file);
# local $^I = '.bac';
while(<IN>){
print OUT $seen{$_}++;
next if $seen{$_} > 1;
print OUT ;
}
}
谢谢,艺术
推荐答案
脚本中的错误:
- 你用
$file
覆盖了(一个新的)@ARGV
,这样它就再也不能有任何文件参数了. - ...这无关紧要,因为你在分配给
@ARGV
之前打开了文件句柄,而且你没有循环参数,你只有一个块{... }
围绕无用的代码. %seen
将包含您打开的所有文件的重复数据删除数据,除非您重置它.- 您将计数
$seen{$_}
打印到输出文件中,我确定您不需要.
- You overwrite (a new copy of)
@ARGV
with$file
, so it can never have any more file arguments. - ...which doesn't matter, because you open the file handle before you assign to
@ARGV
, plus you do not loop around the arguments, you just have a block{ ... }
around the code that serves no purpose. %seen
will contain dedupe data for all the files you open unless you reset it.- You print the count
$seen{$_}
to the output file, which I am sure you don't need.
您可以使用菱形运算符使用 @ARGV
参数的隐式打开,但是由于您(可能)需要为每个新文件分配一个正确的输出文件名,这是一个不必要的并发症这样的解决方案.
You could use the implicit open of @ARGV
arguments using the diamond operator, but since you (probably) need to assign a proper output file name for each new file, that is an unwanted complication with such a solution.
use strict;
use warnings; # always use these
for my $file (@ARGV) { # loop over all file names
my $out = "$file.deduped"; # create output file name
open my $infh, "<", $file or die "$file: $!";
open my $outfh, ">", $out or die "$out: $!";
my %seen;
while (<$infh>) {
print $outfh $_ if !$seen{$_}++; # print if a line is never seen before
}
}
请注意,使用词法范围的 %seen
变量会使脚本检查每个文件中的重复项.如果您将变量移到 for 循环之外,您将检查 所有 文件中的重复项.我不确定你更喜欢哪个.
Note that using a lexically scoped %seen
variable makes the script check for duplicates inside each individual file. If you move the variable outside the for loop, you will check for duplicates across all files. I am not sure which you prefer.
这篇关于从 perl 中的多个文本文件中删除重复条目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!