从 perl 中的多个文本文件中删除重复条目? [英] Remove duplicates entries from multiple text file in perl?

查看:64
本文介绍了从 perl 中的多个文本文件中删除重复条目?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是这个网站的新手,需要帮助从多个文本文件中删除重复条目(在循环中).尝试了下面的代码,但这并没有删除多个文件的重复项,但是它适用于单个文件.

I am new to this site,need help to remove duplicate entries from multiple text file(in a loop).tried the below code but this is not removing duplicates for multiple files,however it is working for a single file.

代码:

my $file = "$Log_dir/File_listing.txt";
my $outfile  = "$Log_dir/Remove_duplicate.txt";; 

open (IN, "<$file") or die "Couldn't open input file: $!"; 
open (OUT, ">$outfile") or die "Couldn't open output file: $!"; 
my %seen = ();
{
  my @ARGV = ($file);
  # local $^I = '.bac';
  while(<IN>){
    print OUT $seen{$_}++;
    next if $seen{$_} > 1;
    print OUT ;
  }
}

谢谢,艺术

推荐答案

脚本中的错误:

  • 你用 $file 覆盖了(一个新的)@ARGV,这样它就再也不能有任何文件参数了.
  • ...这无关紧要,因为你在分配给 @ARGV 之前打开了文件句柄,而且你没有循环参数,你只有一个块 {... } 围绕无用的代码.
  • %seen 将包含您打开的所有文件的重复数据删除数据,除非您重置它.
  • 您将计数 $seen{$_} 打印到输出文件中,我确定您不需要.
  • You overwrite (a new copy of) @ARGV with $file, so it can never have any more file arguments.
  • ...which doesn't matter, because you open the file handle before you assign to @ARGV, plus you do not loop around the arguments, you just have a block { ... } around the code that serves no purpose.
  • %seen will contain dedupe data for all the files you open unless you reset it.
  • You print the count $seen{$_} to the output file, which I am sure you don't need.

您可以使用菱形运算符使用 @ARGV 参数的隐式打开,但是由于您(可能)需要为每个新文件分配一个正确的输出文件名,这是一个不必要的并发症这样的解决方案.

You could use the implicit open of @ARGV arguments using the diamond operator, but since you (probably) need to assign a proper output file name for each new file, that is an unwanted complication with such a solution.

use strict;
use warnings;                      # always use these

for my $file (@ARGV) {             # loop over all file names
    my $out = "$file.deduped";     # create output file name
    open my $infh,  "<", $file or die "$file: $!";
    open my $outfh, ">", $out  or die "$out: $!";
    my %seen;
    while (<$infh>) {
        print $outfh $_ if !$seen{$_}++;   # print if a line is never seen before
    }
}

请注意,使用词法范围的 %seen 变量会使脚本检查每个文件中的重复项.如果您将变量移到 for 循环之外,您将检查 所有 文件中的重复项.我不确定你更喜欢哪个.

Note that using a lexically scoped %seen variable makes the script check for duplicates inside each individual file. If you move the variable outside the for loop, you will check for duplicates across all files. I am not sure which you prefer.

这篇关于从 perl 中的多个文本文件中删除重复条目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆