perl 中非常庞大的关联数组 [英] very huge assosiative array in perl

查看:51
本文介绍了perl 中非常庞大的关联数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将两个文件合并成一个新文件.

I need to merge two files into a new file.

两者有超过 3 亿条管道分隔记录,第一列作为主键.行未排序.第二个文件可能有第一个文件没有的记录.

The two have over 300 Millions pipe-separated records, with first column as primary key. The rows aren't sorted. The second file may have records the first file does not.

示例文件 1:

1001234|X15X1211,J,S,12,15,100.05

示例文件 2:

1231112|AJ32,,,18,JP     
1001234|AJ15,,,16,PP

输出:

1001234,X15X1211,J,S,12,15,100.05,AJ15,,,16,PP

我正在使用以下代码:

tie %hash_REP, 'Tie::File::AsHash', 'rep.in', split => '\|'
my $counter=0;
while (($key,$val) = each %hash_REP) {
    if($counter==0) {
        print strftime "%a %b %e %H:%M:%S %Y", localtime;
    }
}

准备关联数组需要将近 1 小时.它真的很好还是真的很糟糕?有没有更快的方法来处理关联数组中这样大小的记录?任何脚本语言的任何建议都会真正有帮助.

it takes almost 1 hour prepare associative array. is it really good or is it really bad? Is there any faster way to handle such size of records in associative array? Any suggestion in any scripting language would really help.

谢谢,尼丁 T.

我也尝试了以下程序,也花了 1+ 小时如下:

I also tried the following program, walso took 1+ Hour is as below:

#!/usr/bin/perl
use POSIX qw(strftime);
my $now_string = strftime "%a %b %e %H:%M:%S %Y", localtime;
print $now_string . "\n";

my %hash;
open FILE, "APP.in" or die $!;
while (my $line = <FILE>) {
     chomp($line);
      my($key, $val) = split /\|/, $line;
      $hash{$key} = $val;
 }
 close FILE;

my $filename = 'report.txt';
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!";
open FILE, "rep.in" or die $!;
while (my $line = <FILE>) {
      chomp($line);
  my @words = split /\|/, $line;
  for (my $i=0; $i <= $#words; $i++) {
    if($i == 0)
    {
       next;
    }
    print $fh  $words[$i] . "|^"
  }
  print $fh  $hash{$words[0]} . "\n";
 }
 close FILE;
 close $fh;
 print "done\n";

my $now_string = strftime "%a %b %e %H:%M:%S %Y", localtime;
print $now_string . "\n";

推荐答案

我会使用 sort 非常快速地对数据进行排序(10,000,000 行 5 秒),然后合并排序后的文件.

I'd use sort to sort the data very quickly (5 seconds for 10,000,000 rows), and then merge the sorted files.

perl -e'
   sub get {
      my $fh = shift;
      my $line = <$fh>;
      return () if !defined($line);

      chomp($line);
      return split(/\|/, $line);
   }

   sub main {
      @ARGV == 2
         or die("usage\n");

      open(my $fh1, "-|", "sort", "-n", "-t", "|", $ARGV[0]);
      open(my $fh2, "-|", "sort", "-n", "-t", "|", $ARGV[1]);

      my ($key1, $val1) = get($fh1)  or return;
      my ($key2, $val2) = get($fh2)  or return;

      while (1) {
         if    ($key1 < $key2) { ($key1, $val1) = get($fh1)  or return; }
         elsif ($key1 > $key2) { ($key2, $val2) = get($fh2)  or return; }
         else {
            print("$key1,$val1,$val2\n");
            ($key1, $val1) = get($fh1)  or return;
            ($key2, $val2) = get($fh2)  or return;
         }
      }
   }

   main();
' file1 file2 >file

对于每个文件中的 10,000,000 条记录,这在速度较慢的机器上需要 37 秒.

For 10,000,000 records in each file, this took 37 seconds on a slowish machine.

$ perl -e'printf "%d|%s\n", 10_000_000-$_, "X15X1211,J,S,12,15,100.05" for 1..10_000_000' >file1

$ perl -e'printf "%d|%s\n", 10_000_000-$_, "AJ15,,,16,PP" for 1..10_000_000' >file2

$ time perl -e'...' file1 file2 >file
real    0m37.030s
user    0m38.261s
sys     0m1.750s

<小时>

或者,可以将数据转储到数据库中并让其处理细节.


Alternatively, one could dump the data in database and letting it handle the details.

sqlite3 <<'EOI'
CREATE TABLE file1 ( id INTEGER, value TEXT );
CREATE TABLE file2 ( id INTEGER, value TEXT );
.mode list
.separator |
.import file1 file1
.import file2 file2
.output file
SELECT file1.id || "," || file1.value || "," || file2.value
  FROM file1
  JOIN file2
    ON file2.id = file1.id;
.exit
EOI

但是您为灵活性付出了代价.这花了两倍的时间.

But you pay for the flexbility. This took twice as long.

real    1m14.065s
user    1m11.009s
sys     0m2.550s

注意:我最初在 .import 命令之后有 CREATE INDEX file2_id ON file2 ( id ); ,但删除它对性能有很大帮助..

Note: I originally had CREATE INDEX file2_id ON file2 ( id ); after the .import commands, but removing it greatly helped performance..

这篇关于perl 中非常庞大的关联数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆