在Perl中快速加载大型哈希表 [英] fast loading of large hash table in Perl

查看：163 发布时间：2018/6/1 18:54:39 performance perl hash

本文介绍了在Perl中快速加载大型哈希表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有大约30个文本文件，其结构为

  wordleft1 | wordright1 
 wordleft2 | wordright2 
 wordleft3 | wordright3 
 ...

这些文件的总大小约为1 GB，约3200万行文字组合。

我尝试了一些方法来尽可能快地加载它们，并将这些组合存储在一个散列中

<$ p $通过文件打开文件
$ hash {$ wordleft} = $ wordright
并逐行阅读约需42秒。然后我将这个散列存储到Stable模块中

$ store $％hash $ file
$ / code>

$ b
$ $ $ $ $ =检索$文件名

将时间缩短到大约28秒。我使用一个快速的SSD驱动器和一个快速的CPU，并有足够的内存来存放所有的数据（大约需要7GB）。

我正在寻找更快的方法将这些数据加载到RAM中（由于几个原因，我不能保留它）。您可以尝试使用Dan Bernstein的CDB文件格式使用了一个并列散列，这将需要最少的代码更改。您可能需要安装 CDB_File 。在我的笔记本电脑上，cdb文件打开得非常快，我可以每秒做大约200-250k的查找。这是一个创建/使用/测试基准测试用例的脚本示例：

test_cdb.pl
＃！/ usr / bin / env perl 使用警告; 使用strict; 使用基准qw（：all）; 使用CDB_File'create'; 使用Time :: HiRes qw（gettimeofday tv_interval）; 标量@ARGV或死亡用法：$ 0 number_of_keys seconds_to_benchmark \\\ ; my（$ size）= $ ARGV [0] || 1000; 我（$秒）= $ ARGV [1] || 10; my $ t0; tic（）; ＃创建CDB my（$ file，％data）; ％data = map {$ _ => 'something'}（1.. $ size）; 打印在内存中创建$ size元素哈希值\; toc（）; $ file ='data.cdb'; 创建％data，$ file，$ file。$$; my $ bytes = -s $ file; printCreated data.cdb [$ size keys and values，$ bytes bytes] \\\ ; toc（）; ＃从CDB读取 my $ c = tie my％h，'CDB_File'，'data.cdb'或者死掉'tie failed：$！\\\ '; 打印打开data.cdb作为连接散列。; \\; toc（）; timethese（-1 * $ seconds，{ 'Pick Random Key'=> sub {int rand $ size}， 'Fetch Random Value'=> sub {$ h {int rand $ size};}， }）; tic（）; 打印获取每个值\\\ ; （0 .. $ size）{ 没有警告; ＃无用的使用散列元素 $ h {$ _}; } toc（）; subic { $ t0 = [gettimeofday]; } sub toc { my $ t1 = [gettimeofday]; my $ elapsed = tv_interval（$ t0，$ t1）; $ t0 = $ t1; print==>花费了$ elapsed seconds \\\ ; $ b 输出（100万个键，测试时间超过10秒） ./ test_cdb.pl 1000000 10 在内存中创建了1000000个元素哈希 ==>花了2.882813秒创建data.cdb [1000000键和值，38890944字节] ==>花了2.333624秒打开data.cdb作为一个并列散列。 ==>花费0.00015秒基准：运行提取随机值，选择随机密钥至少10个CPU秒... 提取随机值：10个壁钟秒（10.46 usr + 0.01 sys = 10.47 CPU）@ 236984.72 / s（n = 2481230）提取随机密钥：9 wallclock秒（10.11 usr + 0.02 sys = 10.13 CPU）@ 3117208.98 / s（n = 31577327）提取每个值 ==> ;花费3.514183秒输出（10万个键，经过10秒测试） ./ test_cdb.pl 10000000 10 在内存中创建10000000个元素哈希 = =>花了44.72331秒创建data.cdb [10000000键和值，398890945字节] ==>花了25.729652秒打开data.cdb作为一个并列散列。 ==>花费0.000222秒基准：运行取随机值，选取随机密钥至少10个CPU秒... 取值随机值：14 wallclock秒（9.65 usr + 0.35 sys = 10.00 CPU）@ 209811.20 / s（n = 2098112）提取随机密钥：12 wallclock secs（10.40 usr + 0.02 sys = 10.42 CPU）@ 2865335.22 / s（n = 29856793）提取每个值 ==> ;花了38.274356秒 I have about 30 text files with the structure wordleft1|wordright1 wordleft2|wordright2 wordleft3|wordright3 ... The total size of the files is about 1 GB with about 32 million lines of word combinations. I tried a few approaches to load them as fast as possible and store the combinations within a hash $hash{$wordleft} = $wordright Opening file by file and reading line by line takes about 42 seconds. I then store the hash with the Storable module store \%hash, $filename Loading the data again $hashref = retrieve $filename reduces the time to about 28 seconds. I use a fast SSD drive and a fast CPU and have enough RAM to hold all the data (it takes about 7 GB). I'm searching for a faster way to load this data into the RAM (I can't keep it there for a few reasons). 解决方案 You could try using Dan Bernstein's CDB file format using a tied hash, which will require minimal code change. You may need to install CDB_File. On my laptop, the cdb file is opened very quickly and I can do about 200-250k lookups per second. Here is an example script to create/use/benchmark a cdb: test_cdb.pl #!/usr/bin/env perl use warnings; use strict; use Benchmark qw(:all) ; use CDB_File 'create'; use Time::HiRes qw( gettimeofday tv_interval ); scalar @ARGV or die "usage: $0 number_of_keys seconds_to_benchmark\n"; my ($size) = $ARGV[0] || 1000; my ($seconds) = $ARGV[1] || 10; my $t0; tic(); # Create CDB my ($file, %data); %data = map { $_ => 'something' } (1..$size); print "Created $size element hash in memory\n"; toc(); $file = 'data.cdb'; create %data, $file, "$file.$$"; my $bytes = -s $file; print "Created data.cdb [ $size keys and values, $bytes bytes]\n"; toc(); # Read from CDB my $c = tie my %h, 'CDB_File', 'data.cdb' or die "tie failed: $!\n"; print "Opened data.cdb as a tied hash.\n"; toc(); timethese( -1 * $seconds, { 'Pick Random Key' => sub { int rand $size }, 'Fetch Random Value' => sub { $h{ int rand $size }; }, }); tic(); print "Fetching Every Value\n"; for (0..$size) { no warnings; # Useless use of hash element $h{ $_ }; } toc(); sub tic { $t0 = [gettimeofday]; } sub toc { my $t1 = [gettimeofday]; my $elapsed = tv_interval ( $t0, $t1); $t0 = $t1; print "==> took $elapsed seconds\n"; } Output ( 1 million keys, tested over 10 seconds ) ./test_cdb.pl 1000000 10 Created 1000000 element hash in memory ==> took 2.882813 seconds Created data.cdb [ 1000000 keys and values, 38890944 bytes] ==> took 2.333624 seconds Opened data.cdb as a tied hash. ==> took 0.00015 seconds Benchmark: running Fetch Random Value, Pick Random Key for at least 10 CPU seconds... Fetch Random Value: 10 wallclock secs (10.46 usr + 0.01 sys = 10.47 CPU) @ 236984.72/s (n=2481230) Pick Random Key: 9 wallclock secs (10.11 usr + 0.02 sys = 10.13 CPU) @ 3117208.98/s (n=31577327) Fetching Every Value ==> took 3.514183 seconds Output ( 10 million keys, tested over 10 seconds ) ./test_cdb.pl 10000000 10 Created 10000000 element hash in memory ==> took 44.72331 seconds Created data.cdb [ 10000000 keys and values, 398890945 bytes] ==> took 25.729652 seconds Opened data.cdb as a tied hash. ==> took 0.000222 seconds Benchmark: running Fetch Random Value, Pick Random Key for at least 10 CPU seconds... Fetch Random Value: 14 wallclock secs ( 9.65 usr + 0.35 sys = 10.00 CPU) @ 209811.20/s (n=2098112) Pick Random Key: 12 wallclock secs (10.40 usr + 0.02 sys = 10.42 CPU) @ 2865335.22/s (n=29856793) Fetching Every Value ==> took 38.274356 seconds 这篇关于在Perl中快速加载大型哈希表的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Perl中快速加载大型哈希表 [英] fast loading of large hash table in Perl

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Perl中快速加载大型哈希表 [英] fast loading of large hash table in Perl

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭