我可以在R中使用列表作为散列吗?如果是这样,为什么这么慢? [英] Can I use a list as a hash in R? If so, why is it so slow?
问题描述
#!/ usr / bin / perl -w
use严格;
my @letters =('a'..'z');
打印@letters。 \\\
;
my%testHash; (my $ i = 0; $ i <10000; $ i ++){
my $ r1 = int(rand(26));
;
my $ r2 = int(rand(26));
my $ r3 = int(rand(26));
my $ key = $信函[$ r1]。 $字母[$ r2]。 $信[$ R 3]。
my $ value = int(rand(1000));
$ testHash {$ key} = $ value;
}
my @keyArray = keys(%testHash);
my $ keyLen =标量@keyArray;
$ b $ for(my $ j = 0; $ j <10000; $ j ++){
my $ key = $ keyArray [int(rand($ keyLen))]];
my $ lookupValue = $ testHash {$ key};
打印钥匙。 $键。 查找$ lookupValue \\\
;
}
现在越来越多,我想要在类似哈希的数据结构R.以下是等效的R代码:
testHash< - list()
for (我在1:10000){
key.tmp = paste(字母[floor(26 * runif(3))],sep =)
key< - capture.output(cat key.tmp,sep =))
value< - floor(1000 * runif(1))
testHash [[key]] - value
keyArray< - attributes(testHash)$ names
keyLen = length(keyArray);
for(j in 1:10000){
key< - keyArray [floor(keyLen * runif(1))]
lookupValue = testHash [[key]]
print(paste(key,key,Lookup,lookupValue))
}
这段代码似乎在做同样的事情。然而,Perl的速度要快得多:
>时间./perlHashTest.pl
真实0m4.346s
用户** 0m0.110s **
sys 0m0.100s
与R:
time R CMD BATCH RHashTest.R
real 0m8.210s
user ** 0m7.630s **
sys 0m0.200s
什么解释了这种差异? R列表中的查找不好吗?
增加到100,000个列表长度和100,000个查找只会夸大这个差异?对于R中的散列数据结构,有没有比本地列表()更好的选择?
具有命名元素的列表不被散列。哈希查找是O(1),因为在插入期间,使用散列函数将密钥转换为整数,然后将值放入空间散列(键)%num_spots
数组 num_spots
long(这是一个大的简化,避免了处理冲突的复杂性)。查找密钥只需要对密钥进行散列查找值的位置(与O(n)数组查找相比,它是O(1))。 R列表使用O(n)的名称查找。
正如Dirk所说,使用散列包。与此相关的一个巨大局限是它使用环境(被哈希)和重写 [
方法来模拟哈希表。但是一个环境不能包含其他环境,所以你不能使用散列函数进行嵌套散列。
后来我开始在C / R中实现一个可以嵌套的纯哈希表数据结构,但是当我在其他工作环境中工作时,的东西。尽管如此: - )
Before using R, I used quite a bit of Perl. In Perl, I would often use hashes, and lookups of hashes are generally regarded as fast in Perl.
For example, the following code will populate a hash with up to 10000 key/value pairs, where the keys are random letters and the values are random integers. Then, it does 10000 random lookups in that hash.
#!/usr/bin/perl -w
use strict;
my @letters = ('a'..'z');
print @letters . "\n";
my %testHash;
for(my $i = 0; $i < 10000; $i++) {
my $r1 = int(rand(26));
my $r2 = int(rand(26));
my $r3 = int(rand(26));
my $key = $letters[$r1] . $letters[$r2] . $letters[$r3];
my $value = int(rand(1000));
$testHash{$key} = $value;
}
my @keyArray = keys(%testHash);
my $keyLen = scalar @keyArray;
for(my $j = 0; $j < 10000; $j++) {
my $key = $keyArray[int(rand($keyLen))];
my $lookupValue = $testHash{$key};
print "key " . $key . " Lookup $lookupValue \n";
}
Now that increasingly, I am wanting to have a hash-like data structure in R. The following is the equivalent R code:
testHash <- list()
for(i in 1:10000) {
key.tmp = paste(letters[floor(26*runif(3))], sep="")
key <- capture.output(cat(key.tmp, sep=""))
value <- floor(1000*runif(1))
testHash[[key]] <- value
}
keyArray <- attributes(testHash)$names
keyLen = length(keyArray);
for(j in 1:10000) {
key <- keyArray[floor(keyLen*runif(1))]
lookupValue = testHash[[key]]
print(paste("key", key, "Lookup", lookupValue))
}
The code seem to be doing equivalent things. However, the Perl one is much faster:
>time ./perlHashTest.pl
real 0m4.346s
user **0m0.110s**
sys 0m0.100s
Comparing to R:
time R CMD BATCH RHashTest.R
real 0m8.210s
user **0m7.630s**
sys 0m0.200s
What explains the discrepancy? Are lookups in R lists just not good?
Increasing to 100,000 list length and 100,000 lookups only exaggerates the discrepancy? Is there a better alternative for the hash data structure in R than the native list()?
The underlying reason is that R lists with named elements are not hashed. Hash lookups are O(1), because during insert the key is converted to an integer using a hash function, and then the value put in the space hash(key) % num_spots
of an array num_spots
long (this is a big simplification and avoids the complexity of dealing with collisions). Lookups of the key just require hashing the key to find the value's position (which is O(1), versus a O(n) array lookup). R lists use name lookups which are O(n).
As Dirk says, use the hash package. A huge limitation with this is that it uses environments (which are hashed) and overriding of [
methods to mimic hash tables. But an environment cannot contain another environment, so you cannot have nested hashes with the hash function.
A while back I worked on implementing a pure hash table data structure in C/R that could be nested, but it went on my project back burner while I worked on other things. It would be nice to have though :-)
这篇关于我可以在R中使用列表作为散列吗?如果是这样,为什么这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!