R中的缓存/记忆/散列选项 [英] Options for caching / memoization / hashing in R
问题描述
我试图找到一种简单的方法来在R中使用类似Perl的哈希函数(本质上是缓存),因为我打算同时执行Perl风格的哈希和编写我自己的计算记忆。然而,其他人已经打了我一拳,并有包裹为memoisation。我挖的越多,我发现的越多,例如 memoise
和 R.cache
,但差异并不明显。另外,还不清楚除了使用 hash
包以外,还可以如何获得Perl风格的哈希(或Python风格的字典)并编写自己的记忆,似乎并不支持这两个memoization包。
因为我找不到关于CRAN或其他地方的信息来区分这些选项,所以也许这应该是一个社区wiki问题在SO上:R中的memoization和caching有什么选择,它们有什么不同?
作为基础比较,这里是我找到的选项列表。此外,在我看来,所有依赖于哈希,所以我会注意到哈希选项。键/值存储有点相关,但是会打开一大堆关于数据库系统的蠕虫(例如BerkeleyDB,Redis,MemcacheDB和其他分数)。
它看起来像这样:
哈希
- 摘要 - 为任意R对象提供哈希。
记忆
- memoise - 一个非常简单的记忆功能工具。 /index.htmlrel =nofollow noreferrer> R.cache - 为memoization提供了更多的功能,尽管它似乎有些功能缺乏示例。
缓存
- 散列 - 提供类似于Perl的哈希和Python字典的缓存功能。
键/值存储
这些是外部存储的基本选项的R对象。
- stashr
- filehash
$ b检查点
- cacher - 这似乎更类似于检查点。
- CodeDepends - 一个OmegaHat pro ject支持
cacher
并提供了一些有用的功能。 DMTCP (不是R软件包) - 似乎支持一系列语言的点校验,并且开发人员最近在R中寻求协助测试DMTCP检查点。
其他
- Base R支持:命名向量和列表,数据帧的行列名称和名称在环境中的项目。在我看来,使用列表有点混乱。 (还有
pairlist
,但已弃用) - data.table 包支持快速查找数据表中的元素。
用例
虽然我最感兴趣的是了解选项,但我有两个基本的用例:
- 缓存:简单计数字符串。 [注:这不是NLP,而是一般用途,所以NLP库是矫枉过正的;表格是不够的,因为我不想等到整个字符串集合加载到内存中。
- 记录可怕的计算结果。
这些真的出现了,因为我挖掘一些slooooow代码的分析我真的很想简单地计算一下字符串,看看我是否可以通过记忆加速一些计算。能够散列输入值,即使我不记忆,也会让我看到memoization是否有帮助。 - Dirk Eddelbuettel:
digest
- 很多其他软件包依赖于此 > stashR - 这些以不同的方式解决不同的问题;请参阅罗杰网站以获取更多套餐。 - 克里斯托弗布朗:
哈希
- 似乎是一个有用的软件包,但不幸的是,与ODG的链接已关闭。 - Henrik Bengtsson:
R.cache
& Hadley Wickham:memoise
- 现在还不清楚何时比另一个更偏好一个包装。
注1: CRAN可重复研究任务视图列出了一对夫妇( cacher
和 R.cache
),但没有详细说明使用选项。
注2:为了帮助其他人寻找相关的代码,这里有一些关于某些作者或包的说明。一些作者使用SO。 :)
注3:有些人使用memoise / memoisation他人使用memoize / memoization。如果你在四处搜寻,请注意。 Henrik使用z,Hadley使用s。
我没有运气给 memoise
因为它给我试过的打包的某些函数带来了太深的递归
问题。用 R.cache
我有更好的运气。以下是我从 R.cache
文档修改的更多注释代码。代码显示了执行缓存的不同选项。
#避免在加载R.cache库
dir时出现问题的解决方法。 create(path =〜/ .Rcache,showWarnings = F)
library(R.cache)
setCacheRootPath(path =./。Rcache)#在当前工作目录下创建.Rcache
#如果我们需要缓存路径,但在本例中未使用。
cache.root = getCacheRootPath()
模拟< - function(mean,sd){
#1.尝试加载缓存的数据(如果已经生成)
key< - list(mean,sd)
data < - loadCache(key)
if(!is.null(data)){
cat(Loaded cached data \\\
)
return(data);
}
#2.如果不可用,生成它。
cat(从头开始生成数据...)
data < - rnorm(1000,mean = mean,sd = sd)
Sys.sleep(1)#仿真慢速算法
cat(ok\\\
)
saveCache(data,key = key,comment =simulate())
data;
}
data < - 模拟(2.3,3.0)
data < - 模拟(2.3,3.5)
a = 2.3
b = 3.0
数据< - 模拟(a,b)#将加载缓存数据,参数按值检查
#清理
file.remove(findCache(key = list(2.3,3.0)))$ b $ (函数(均值,sd))(
data -norm(1000,mean = sd = sd)
Sys.sleep(1)#模拟慢速算法
cat(完成从头生成数据\ n)
data;
}
#轻松一步来记忆一个函数
#,以便重新分配函数名称。
这将适用于外部软件包的任何功能。
mzs < - addMemoization(simulate2)
data <-mzs(2.3,3.0)
data <-mzs(2.3,3.5)
data< ; - mzs(2.3,3.0)#将加载缓存的数据
#作为重新分配函数名称的可能。
#但同一
#函数的不同记忆将返回相同的缓存结果
#如果输入参数相同
simulate2< - addMemoization(simulate2)
data< --simulate2(2.3,3.0)
#如果被评估的表达式取决于
#input对象,那么这些对象必须明确指定为
#关键对象。
for(ii in 1:2){
for(kk in 1:3){
cat(sprintf(Iteration#%d:\\\
,kk))
res< - evalWithMemoization({
cat(Evaluating expression ...)
a < - kk
Sys.sleep(1)
cat(done \\ \\ n)
a
},key = list(kk = kk))
'表达式'里面'res'被跳过重复运行
print(res)
#清点检查
stopifnot(a == kk)
#清理
rm(a)
}#for(kk ...)
}#for (ii ...)
I am trying to find a simple way to use something like Perl's hash functions in R (essentially caching), as I intended to do both Perl-style hashing and write my own memoisation of calculations. However, others have beaten me to the punch and have packages for memoisation. The more I dig, the more I find, e.g.memoise
and R.cache
, but differences aren't readily clear. In addition, it's not clear how else one can get Perl-style hashes (or Python-style dictionaries) and write one's own memoization, other than to use the hash
package, which doesn't seem to underpin the two memoization packages.
Since I can find no information on CRAN or elsewhere to distinguish between the options, perhaps this should be a community wiki question on SO: What are the options for memoization and caching in R, and what are their differences?
As a basis for comparison, here is a list of the options I've found. Also, it seems to me that all depend on hashing, so I'll note the hashing options as well. Key/value storage is somewhat related, but opens a huge can of worms regarding DB systems (e.g. BerkeleyDB, Redis, MemcacheDB and scores of others).
It looks like the options are:
Hashing
- digest - provides hashing for arbitrary R objects.
Memoization
- memoise - a very simple tool for memoization of functions.
- R.cache - offers more functionality for memoization, though it seems some of the functions lack examples.
Caching
- hash - Provides caching functionality akin to Perl's hashes and Python dictionaries.
Key/value storage
These are basic options for external storage of R objects.
Checkpointing
- cacher - this seems to be more akin to checkpointing.
- CodeDepends - An OmegaHat project that underpins
cacher
and provides some useful functionality. - DMTCP (not an R package) - appears to support checkpointing in a bunch of languages, and a developer recently sought assistance testing DMTCP checkpointing in R.
Other
- Base R supports: named vectors and lists, row and column names of data frames, and names of items in environments. It seems to me that using a list is a bit of a kludge. (There's also
pairlist
, but it is deprecated.) - The data.table package supports rapid lookups of elements in a data table.
Use case
Although I'm mostly interested in knowing the options, I have two basic use cases that arise:
- Caching: Simple counting of strings. [Note: This isn't for NLP, but general use, so NLP libraries are overkill; tables are inadequate because I prefer not to wait until the entire set of strings are loaded into memory. Perl-style hashes are at the right level of utility.]
- Memoization of monstrous calculations.
These really arise because I'm digging in to the profiling of some slooooow code and I'd really like to just count simple strings and see if I can speed up some calculations via memoization. Being able to hash the input values, even if I don't memoize, would let me see if memoization can help.
Note 1: The CRAN Task View on Reproducible Research lists a couple of the packages (cacher
and R.cache
), but there is no elaboration on usage options.
Note 2: To aid others looking for related code, here a few notes on some of the authors or packages. Some of the authors use SO. :)
- Dirk Eddelbuettel:
digest
- a lot of other packages depend on this. - Roger Peng:
cacher
,filehash
,stashR
- these address different problems in different ways; see Roger's site for more packages. - Christopher Brown:
hash
- Seems to be a useful package, but the links to ODG are down, unfortunately. - Henrik Bengtsson:
R.cache
& Hadley Wickham:memoise
-- it's not yet clear when to prefer one package over the other.
Note 3: Some people use memoise/memoisation others use memoize/memoization. Just a note if you're searching around. Henrik uses "z" and Hadley uses "s".
I did not have luck with memoise
because it gave too deep recursive
problem to some function of a packaged I tried with. With R.cache
I had better luck. Following is more annotated code I adapted from R.cache
documentation. The code shows different options to do caching.
# Workaround to avoid question when loading R.cache library
dir.create(path="~/.Rcache", showWarnings=F)
library("R.cache")
setCacheRootPath(path="./.Rcache") # Create .Rcache at current working dir
# In case we need the cache path, but not used in this example.
cache.root = getCacheRootPath()
simulate <- function(mean, sd) {
# 1. Try to load cached data, if already generated
key <- list(mean, sd)
data <- loadCache(key)
if (!is.null(data)) {
cat("Loaded cached data\n")
return(data);
}
# 2. If not available, generate it.
cat("Generating data from scratch...")
data <- rnorm(1000, mean=mean, sd=sd)
Sys.sleep(1) # Emulate slow algorithm
cat("ok\n")
saveCache(data, key=key, comment="simulate()")
data;
}
data <- simulate(2.3, 3.0)
data <- simulate(2.3, 3.5)
a = 2.3
b = 3.0
data <- simulate(a, b) # Will load cached data, params are checked by value
# Clean up
file.remove(findCache(key=list(2.3,3.0)))
file.remove(findCache(key=list(2.3,3.5)))
simulate2 <- function(mean, sd) {
data <- rnorm(1000, mean=mean, sd=sd)
Sys.sleep(1) # Emulate slow algorithm
cat("Done generating data from scratch\n")
data;
}
# Easy step to memoize a function
# aslo possible to resassign function name.
This would work with any functions from external packages.
mzs <- addMemoization(simulate2)
data <- mzs(2.3, 3.0)
data <- mzs(2.3, 3.5)
data <- mzs(2.3, 3.0) # Will load cached data
# aslo possible to resassign function name.
# but different memoizations of the same
# function will return the same cache result
# if input params are the same
simulate2 <- addMemoization(simulate2)
data <- simulate2(2.3, 3.0)
# If the expression being evaluated depends on
# "input" objects, then these must be be specified
# explicitly as "key" objects.
for (ii in 1:2) {
for (kk in 1:3) {
cat(sprintf("Iteration #%d:\n", kk))
res <- evalWithMemoization({
cat("Evaluating expression...")
a <- kk
Sys.sleep(1)
cat("done\n")
a
}, key=list(kk=kk))
# expressions inside 'res' are skipped on the repeated run
print(res)
# Sanity checks
stopifnot(a == kk)
# Clean up
rm(a)
} # for (kk ...)
} # for (ii ...)
这篇关于R中的缓存/记忆/散列选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!