在Perl中检查行中重复列的最快方法 [英] Fastest way to check duplicate columns in a line in perl

查看:133
本文介绍了在Perl中检查行中重复列的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含一百万行的文件

  aaa,111 
bbb,222
...
...
a3z,222(行号500,000)
...
...
bz1,444(最后一行#1百万)

我需要检查的是逗号后的第二个值是否唯一。如果没有,请打印出行号。在上面的示例中,应打印出

 重复:行:500000值:a3z,222 

为此,我使用perl并将第二列的值存储在数组中。如果在数组中找不到值,则将其添加到其中。如果该值已经存在,则将其打印出来。



问题是我使用的逻辑超级慢。它需要2-3个小时才能完成。有什么办法可以加快速度吗?如果不需要,我不想创建数组。我只想检查文件第2列中的重复值。



如果在批处理文件中有更快的处理方法,我可以接受。



这是我的工作代码。

 #标头
使用警告;
使用DateTime;
使用严格;
使用POSIX qw(strftime);
使用File :: Find;
使用File :: Slurp;
使用File :: Spec;
use List :: MoreUtils qw(uniq);

打印 Perl Starting ... \n\n;

#打开文件以进行读取访问:
打开$ filehandle,<, my_huge_input_file.txt;

我的$ count = 0;
我的@uniqueArray;
#循环浏览每行:
while(defined(my $ recordLine =< $ filehandle>))
{

#跟踪行号
$ counter ++;


#在最后删除换行符。
chomp $ recordLine;

my @fields = split(/,/,$ recordLine);
我的$ val1 = $ fields [0];
my $ val2 = $ fields [1];

if(!($ val2 ~~ @uniqueArray)&($ val2 ne)))
{
push(@uniqueArray,$ val2);
}
else
{
print( DUP行:$ counter-val1:$ val1-val2:$ val2 \n);
}

}

打印 \nPerl End ... \n\n;


解决方案

这是哈希的目的之一

 使用功能qw(说); 

...

我的%second_field_value;

而(已定义(我的$ recordLine =< $ filehandle>))
{
chomp $ recordLine;
my @fields = split /,/,$ recordLine;

if(exist存在$ second_field_value {$ fields [1]}){
说 DIP行:$。-@fields [0,1];
}
++ $ second_field_value {$ fields [1]};
}

这将根据需要累积该字段的所有可能值。我们还可以存储有关伪造品的合适信息,具体取决于需要报告的伪造品。



(最后读取的文件句柄的)行号可用在 $。变量;不需要 $ counter



请注意,可以在一个表达式中进行检查和标记/计数器设置,对于

  if($ second_field_values {$ fields [1]} ++){说...}# 

这是检查重复项的惯用法。感谢ikegami提出来。通过将后增量放在条件(因此检查是使用旧值完成的,并且块中的计数是最新的)。



我必须对智能匹配运算符进行注释( ~~ )。众所周知,它目前的形式存在很大的问题,实际上可以肯定它会遭受重大的改变,甚至更糟。因此,简而言之,我会说:不要使用它。带有代码的代码将来很有可能在某个未指定的时间点中断,可能没有警告,并且可能会悄悄地中断。






<注释中提到了有关性能和计算复杂性的注释。



在每行的数组中搜索 O n m )复杂度( n 行, m 值),什么才是真正的 O n 2 ),因为数组在每一行上都获得一个新值(所以 m = n-1 );此外,整个数组会(实际上)每行都进行搜索,因为通常不会出现重复。使用散列,复杂度为 O n ),因为我们在每一行上都有一个固定时间的查找。



关键是所有有关输入大小的内容。对于几百行的文件,我们无法区分。在百万行的情况下,报告的运行时间为 2-3小时内的任何地方(带有数组)和 5秒内的(带有哈希)。



复杂性评估涉及投入量的事实具有实际意义。



其中之一是,使用粗心构建的算法的代码运行出色可能会因意外的大输入而惨遭破坏,或者,一旦涉及到生产运行,就可能破坏真实数据



另一方面,使用更干净,更简单的代码运行通常会很令人满意,即使它的复杂性更差-当我们了解其用法时案件。



通常,复杂度告诉我们运行时如何取决于大小,而不是实际大小。因此, O n 2 )算法的运行速度可能比 O n log n )一个足够小的输入。这具有很大的实际意义,并广泛用于选择算法。


I have a file with 1 million lines like this

aaa,111
bbb,222
...
...
a3z,222 (line# 500,000)
...
...
bz1,444 (last line# 1 million)

What I need to check is whether second value after comma is unique or not. If not then print out the line number. In example above it should print out

Duplicate: line: 500000 value: a3z,222

For this I am using perl and storing value of second column in an array. If I don't find a value in the array I add it to it. If the value already exists then I print it out as a duplicate.

The problem is the logic I am using is super slow. It takes anywhere from 2-3 hours to complete. Is there a way I can speed this up? I don't want to create an array if I don't have to. I just want to check duplicate values in column 2 of the file.

If there is a faster way to do it in a batch-file I am open to it.

Here's my working code.

# header
use warnings;
use DateTime;
use strict;
use POSIX qw(strftime);
use File::Find;
use File::Slurp;
use File::Spec;
use List::MoreUtils qw(uniq);

print "Perl Starting ... \n\n"; 

# Open the file for read access:
open my $filehandle, '<', 'my_huge_input_file.txt';

my $counter = 0;
my @uniqueArray;
# Loop through each line:
while (defined(my $recordLine = <$filehandle>))
{

  # Keep track of line numbers
  $counter++;


  # Strip the linebreak character at the end.
  chomp $recordLine;

  my @fields = split(/,/, $recordLine);
  my $val1=$fields[0]; 
  my $val2=$fields[1]; 

  if ( !($val2 ~~ @uniqueArray) && ($val2 ne "") )
  {
        push(@uniqueArray, $val2);
  }
  else
  {
    print ("DUP line: $counter - val1: $val1 - val2: $val2 \n");
  }

}

print "\nPerl End ... \n\n"; 

解决方案

That's one of the things a hash is for

use feature qw(say);

...

my %second_field_value;

while (defined(my $recordLine = <$filehandle>))
{
    chomp $recordLine; 
    my @fields = split /,/, $recordLine;

    if (exists $second_field_value{$fields[1]}) {
        say "DIP line: $. -- @fields[0,1]";
    }
    ++$second_field_value{$fields[1]};
}

This will accumulate all possible values for this field, as it must. We can also store suitable info about dupes as they are found, depending on what needs to be reported about them.

The line number (of the last read filehandle) is available in $. variable; no need for $counter.

Note that a check and a flag/counter setting can be done in one expression, for

if ($second_field_values{$fields[1]}++) { say ... }  # already seen before

which is an idiom when checking for duplicates. Thanks to ikegami for bringing it up. This works by having the post-increment in the condition (so the check is done with the old value, and the count is up to date in the block).

I have to comment on the smart-match operator (~~) as well. It is widely understood that it has great problems in its current form and it is practically certain that it will suffer major changes, or worse. Thus, simply put, I'd say: don't use it. The code with it has every chance of breaking at some unspecified point in the future, possibly without a warning, and perhaps quietly.


Note on performance and "computational complexity," raised in comments.

Searching through an array on every line has O(nm) complexity (n lines, m values), what is really O(n2) here since the array gets a new value on each line (so m = n-1); further, the whole array gets searched for (practically) every line as there normally aren't dupes. With the hash the complexity is O(n), as we have a constant-time lookup on each line.

The crucial thing is that all that is about the size of input. For a file of a few hundred lines we can't tell a difference. With a million lines, the reported run times are "anywhere from 2-3 hours" with array and "under 5 seconds" with hash.

The fact that "complexity" assessment deals with input size has practical implications.

For one, code with carelessly built algorithms which "runs great" may break miserably for unexpectedly large inputs -- or, rather, for realistic data once it comes to production runs.

On the other hand, it is often quite satisfactory to run with code that is cleaner and simpler even as it has worse "complexity" -- when we understand its use cases.

Generally, the complexity tells us how the runtime depends on size, not what exactly it is. So an O(n2) algorithm may well run faster than an O(nlogn) one for small enough input. This has great practical importance and is used widely in choosing algorithms.

这篇关于在Perl中检查行中重复列的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆