我可以在 R 中使用列表作为哈希吗?如果是这样,为什么这么慢? [英] Can I use a list as a hash in R? If so, why is it so slow?

查看:22
本文介绍了我可以在 R 中使用列表作为哈希吗?如果是这样,为什么这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在使用 R 之前,我使用了相当多的 Perl.在 Perl 中,我经常使用哈希,而在 Perl 中,哈希的查找通常被认为是快速的.

Before using R, I used quite a bit of Perl. In Perl, I would often use hashes, and lookups of hashes are generally regarded as fast in Perl.

例如,以下代码将使用最多 10000 个键/值对填充哈希,其中键是随机字母,值是随机整数.然后,它在该哈希中进行 10000 次随机查找.

For example, the following code will populate a hash with up to 10000 key/value pairs, where the keys are random letters and the values are random integers. Then, it does 10000 random lookups in that hash.

#!/usr/bin/perl -w
use strict;

my @letters = ('a'..'z');

print @letters . "
";
my %testHash;

for(my $i = 0; $i < 10000; $i++) {
    my $r1 = int(rand(26));
    my $r2 = int(rand(26));
    my $r3 = int(rand(26));
    my $key = $letters[$r1] . $letters[$r2] . $letters[$r3];
    my $value = int(rand(1000));
    $testHash{$key} = $value;
}

my @keyArray = keys(%testHash);
my $keyLen = scalar @keyArray;

for(my $j = 0; $j < 10000; $j++) {
    my $key = $keyArray[int(rand($keyLen))];
    my $lookupValue = $testHash{$key};
    print "key " .  $key . " Lookup $lookupValue 
";
}

现在,我越来越希望在 R 中有一个类似哈希的数据结构.以下是等效的 R 代码:

Now that increasingly, I am wanting to have a hash-like data structure in R. The following is the equivalent R code:

testHash <- list()

for(i in 1:10000) {
  key.tmp = paste(letters[floor(26*runif(3))], sep="")
  key <- capture.output(cat(key.tmp, sep=""))
  value <- floor(1000*runif(1))
  testHash[[key]] <- value
}

keyArray <- attributes(testHash)$names
keyLen = length(keyArray);

for(j in 1:10000) {
  key <- keyArray[floor(keyLen*runif(1))]
  lookupValue = testHash[[key]]
  print(paste("key", key, "Lookup", lookupValue))
}

代码似乎在做同样的事情.然而,Perl 的要快得多:

The code seem to be doing equivalent things. However, the Perl one is much faster:

>time ./perlHashTest.pl
real    0m4.346s
user    **0m0.110s**
sys 0m0.100s

与 R 比较:

time R CMD BATCH RHashTest.R

real    0m8.210s
user    **0m7.630s**
sys 0m0.200s

如何解释这种差异?在 R 列表中查找是不是很好?

What explains the discrepancy? Are lookups in R lists just not good?

增加到 100,000 个列表长度和 100,000 次查找只会夸大差异?R 中的散列数据结构是否有比原生 list() 更好的替代方案?

Increasing to 100,000 list length and 100,000 lookups only exaggerates the discrepancy? Is there a better alternative for the hash data structure in R than the native list()?

推荐答案

根本原因是带有命名元素的 R 列表没有经过哈希处理.哈希查找是 O(1),因为在插入时使用哈希函数将键转换为整数,然后将值放入数组 的空间 hash(key) % num_spotsnum_spots 长(这是一个简化,避免了处理冲突的复杂性).键的查找只需要对键进行散列以找到值的位置(这是 O(1),而不是 O(n) 数组查找).R 列表使用 O(n) 的名称查找.

The underlying reason is that R lists with named elements are not hashed. Hash lookups are O(1), because during insert the key is converted to an integer using a hash function, and then the value put in the space hash(key) % num_spots of an array num_spots long (this is a big simplification and avoids the complexity of dealing with collisions). Lookups of the key just require hashing the key to find the value's position (which is O(1), versus a O(n) array lookup). R lists use name lookups which are O(n).

正如 Dirk 所说,使用 hash 包.这样做的一个巨大限制是它使用环境(经过哈希处理)和覆盖 [ 方法来模拟哈希表.但是一个环境不能包含另一个环境,因此您不能使用哈希函数嵌套哈希.

As Dirk says, use the hash package. A huge limitation with this is that it uses environments (which are hashed) and overriding of [ methods to mimic hash tables. But an environment cannot contain another environment, so you cannot have nested hashes with the hash function.

不久前,我致力于在 C/R 中实现一个可以嵌套的纯哈希表数据结构,但在我从事其他工作时,它在我的项目中落后了.不过要是有就好了:-)

A while back I worked on implementing a pure hash table data structure in C/R that could be nested, but it went on my project back burner while I worked on other things. It would be nice to have though :-)

这篇关于我可以在 R 中使用列表作为哈希吗?如果是这样,为什么这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆