在Perl中快速加载大型哈希表 [英] fast loading of large hash table in Perl

查看:163
本文介绍了在Perl中快速加载大型哈希表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有大约30个文本文件,其结构为

  wordleft1 | wordright1 
wordleft2 | wordright2
wordleft3 | wordright3
...

这些文件的总大小约为1 GB,约3200万行文字组合。



我尝试了一些方法来尽可能快地加载它们,并将这些组合存储在一个散列中



<$ p $通过文件打开文件

$ hash {$ wordleft} = $ wordright

并逐行阅读约需42秒。然后我将这个散列存储到Stable模块中

$ store $%hash $ file
$ / code>




$ b

$ $ $ $ $ =检索$文件名

将时间缩短到大约28秒。我使用一个快速的SSD驱动器和一个快速的CPU,并有足够的内存来存放所有的数据(大约需要7GB)。



我正在寻找更快的方法将这些数据加载到RAM中(由于几个原因,我不能保留它)。 您可以尝试使用Dan Bernstein的CDB文件格式使用了一个并列散列,这将需要最少的代码更改。您可能需要安装 CDB_File 。在我的笔记本电脑上,cdb文件打开得非常快,我可以每秒做大约200-250k的查找。这是一个创建/使用/测试基准测试用例的脚本示例:

test_cdb.pl

 #!/ usr / bin / env perl 

使用警告;
使用strict;

使用基准qw(:all);
使用CDB_File'c​​reate';
使用Time :: HiRes qw(gettimeofday tv_interval);

标量@ARGV或死亡用法:$ 0 number_of_keys seconds_to_benchmark \\\
;
my($ size)= $ ARGV [0] || 1000;
我($秒)= $ ARGV [1] || 10;

my $ t0;
tic();

#创建CDB
my($ file,%data);

%data = map {$ _ => 'something'}(1.. $ size);
打印在内存中创建$ size元素哈希值\;
toc();

$ file ='data.cdb';
创建%data,$ file,$ file。$$;
my $ bytes = -s $ file;
printCreated data.cdb [$ size keys and values,$ bytes bytes] \\\
;
toc();

#从CDB读取
my $ c = tie my%h,'CDB_File','data.cdb'或者死掉'tie failed:$!\\\
';
打印打开data.cdb作为连接散列。; \\;
toc();

timethese(-1 * $ seconds,{
'Pick Random Key'=> sub {int rand $ size},
'Fetch Random Value'=> sub {$ h {int rand $ size};},
});

tic();
打印获取每个值\\\
;
(0 .. $ size){
没有警告; #无用的使用散列元素
$ h {$ _};
}
toc();

subic {
$ t0 = [gettimeofday];
}

sub toc {
my $ t1 = [gettimeofday];
my $ elapsed = tv_interval($ t0,$ t1);
$ t0 = $ t1;
print==>花费了$ elapsed seconds \\\
;




$ b

输出(100万个键,测试时间超过10秒)

  ./ test_cdb.pl 1000000 10 

在内存中创建了1000000个元素哈希
==>花了2.882813秒
创建data.cdb [1000000键和值,38890944字节]
==>花了2.333624秒
打开data.cdb作为一个并列散列。
==>花费0.00015秒
基准:运行提取随机值,选择随机密钥至少10个CPU秒...
提取随机值:10个壁钟秒(10.46 usr + 0.01 sys = 10.47 CPU)@ 236984.72 / s(n = 2481230)
提取随机密钥:9 wallclock秒(10.11 usr + 0.02 sys = 10.13 CPU)@ 3117208.98 / s(n = 31577327)
提取每个值
==> ;花费3.514183秒

输出(10万个键,经过10秒测试)

  ./ test_cdb.pl 10000000 10 

在内存中创建10000000个元素哈希
= =>花了44.72331秒
创建data.cdb [10000000键和值,398890945字节]
==>花了25.729652秒
打开data.cdb作为一个并列散列。
==>花费0.000222秒
基准:运行取随机值,选取随机密钥至少10个CPU秒...
取值随机值:14 wallclock秒(9.65 usr + 0.35 sys = 10.00 CPU)@ 209811.20 / s(n = 2098112)
提取随机密钥:12 wallclock secs(10.40 usr + 0.02 sys = 10.42 CPU)@ 2865335.22 / s(n = 29856793)
提取每个值
==> ;花了38.274356秒


I have about 30 text files with the structure

wordleft1|wordright1
wordleft2|wordright2
wordleft3|wordright3
...

The total size of the files is about 1 GB with about 32 million lines of word combinations.

I tried a few approaches to load them as fast as possible and store the combinations within a hash

$hash{$wordleft} = $wordright

Opening file by file and reading line by line takes about 42 seconds. I then store the hash with the Storable module

store \%hash, $filename

Loading the data again

$hashref = retrieve $filename

reduces the time to about 28 seconds. I use a fast SSD drive and a fast CPU and have enough RAM to hold all the data (it takes about 7 GB).

I'm searching for a faster way to load this data into the RAM (I can't keep it there for a few reasons).

解决方案

You could try using Dan Bernstein's CDB file format using a tied hash, which will require minimal code change. You may need to install CDB_File. On my laptop, the cdb file is opened very quickly and I can do about 200-250k lookups per second. Here is an example script to create/use/benchmark a cdb:

test_cdb.pl

#!/usr/bin/env perl

use warnings;
use strict;

use Benchmark qw(:all) ;
use CDB_File 'create';
use Time::HiRes qw( gettimeofday tv_interval );

scalar @ARGV or die "usage: $0 number_of_keys seconds_to_benchmark\n";
my ($size)    = $ARGV[0] || 1000;
my ($seconds) = $ARGV[1] || 10;

my $t0;
tic();

# Create CDB
my ($file, %data);

%data = map { $_ => 'something' } (1..$size);
print "Created $size element hash in memory\n";
toc();

$file = 'data.cdb';
create %data, $file, "$file.$$";
my $bytes = -s $file;
print "Created data.cdb [ $size keys and values, $bytes bytes]\n";
toc();

# Read from CDB
my $c = tie my %h, 'CDB_File', 'data.cdb' or die "tie failed: $!\n";
print "Opened data.cdb as a tied hash.\n";
toc();

timethese( -1 * $seconds, {
          'Pick Random Key'    => sub { int rand $size },
          'Fetch Random Value' => sub { $h{ int rand $size }; },
});

tic();
print "Fetching Every Value\n";
for (0..$size) {
    no warnings; # Useless use of hash element
    $h{ $_ };
}
toc();

sub tic {
    $t0 = [gettimeofday];    
}

sub toc {
    my $t1 = [gettimeofday];
    my $elapsed = tv_interval ( $t0, $t1);
    $t0 = $t1;
    print "==> took $elapsed seconds\n";
}

Output ( 1 million keys, tested over 10 seconds )

./test_cdb.pl 1000000 10

Created 1000000 element hash in memory
==> took 2.882813 seconds
Created data.cdb [ 1000000 keys and values, 38890944 bytes]
==> took 2.333624 seconds
Opened data.cdb as a tied hash.
==> took 0.00015 seconds
Benchmark: running Fetch Random Value, Pick Random Key for at least 10 CPU seconds...
Fetch Random Value: 10 wallclock secs (10.46 usr +  0.01 sys = 10.47 CPU) @ 236984.72/s (n=2481230)
Pick Random Key:  9 wallclock secs (10.11 usr +  0.02 sys = 10.13 CPU) @ 3117208.98/s (n=31577327)
Fetching Every Value
==> took 3.514183 seconds

Output ( 10 million keys, tested over 10 seconds )

./test_cdb.pl 10000000 10

Created 10000000 element hash in memory
==> took 44.72331 seconds
Created data.cdb [ 10000000 keys and values, 398890945 bytes] 
==> took 25.729652 seconds
Opened data.cdb as a tied hash.
==> took 0.000222 seconds
Benchmark: running Fetch Random Value, Pick Random Key for at least 10 CPU seconds...
Fetch Random Value: 14 wallclock secs ( 9.65 usr +  0.35 sys = 10.00 CPU) @ 209811.20/s (n=2098112)
Pick Random Key: 12 wallclock secs (10.40 usr +  0.02 sys = 10.42 CPU) @ 2865335.22/s (n=29856793)
Fetching Every Value
==> took 38.274356 seconds

这篇关于在Perl中快速加载大型哈希表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆