R中的缓存/记忆/散列选项 [英] Options for caching / memoization / hashing in R

查看：118 发布时间：2018/6/1 15:16:06 r caching hash memoization memoise

本文介绍了R中的缓存/记忆/散列选项的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图找到一种简单的方法来在R中使用类似Perl的哈希函数（本质上是缓存），因为我打算同时执行Perl风格的哈希和编写我自己的计算记忆。然而，其他人已经打了我一拳，并有包裹为memoisation。我挖的越多，我发现的越多，例如 memoise 和 R.cache ，但差异并不明显。另外，还不清楚除了使用 hash 包以外，还可以如何获得Perl风格的哈希（或Python风格的字典）并编写自己的记忆，似乎并不支持这两个memoization包。

因为我找不到关于CRAN或其他地方的信息来区分这些选项，所以也许这应该是一个社区wiki问题在SO上：R中的memoization和caching有什么选择，它们有什么不同？

作为基础比较，这里是我找到的选项列表。此外，在我看来，所有依赖于哈希，所以我会注意到哈希选项。键/值存储有点相关，但是会打开一大堆关于数据库系统的蠕虫（例如BerkeleyDB，Redis，MemcacheDB和其他分数）。

它看起来像这样：

哈希

摘要 - 为任意R对象提供哈希。

记忆

memoise - 一个非常简单的记忆功能工具。
/index.htmlrel =nofollow noreferrer> R.cache - 为memoization提供了更多的功能，尽管它似乎有些功能缺乏示例。

缓存

散列 - 提供类似于Perl的哈希和Python字典的缓存功能。

键/值存储

这些是外部存储的基本选项的R对象。

stashr

filehash
$ b
检查点

cacher - 这似乎更类似于检查点。

CodeDepends - 一个OmegaHat pro ject支持 cacher 并提供了一些有用的功能。 DMTCP （不是R软件包） - 似乎支持一系列语言的点校验，并且开发人员最近在R中寻求协助测试DMTCP检查点。

其他

Base R支持：命名向量和列表，数据帧的行列名称和名称在环境中的项目。在我看来，使用列表有点混乱。（还有 pairlist ，但已弃用）

data.table 包支持快速查找数据表中的元素。

用例

虽然我最感兴趣的是了解选项，但我有两个基本的用例：

缓存：简单计数字符串。 [注：这不是NLP，而是一般用途，所以NLP库是矫枉过正的;表格是不够的，因为我不想等到整个字符串集合加载到内存中。

记录可怕的计算结果。

这些真的出现了，因为我挖掘一些slooooow代码的分析我真的很想简单地计算一下字符串，看看我是否可以通过记忆加速一些计算。能够散列输入值，即使我不记忆，也会让我看到memoization是否有帮助。

注1： CRAN可重复研究任务视图列出了一对夫妇（ cacher 和 R.cache ），但没有详细说明使用选项。

注2：为了帮助其他人寻找相关的代码，这里有一些关于某些作者或包的说明。一些作者使用SO。：）

Dirk Eddelbuettel： digest - 很多其他软件包依赖于此
> stashR - 这些以不同的方式解决不同的问题;请参阅罗杰网站以获取更多套餐。

克里斯托弗布朗：哈希 - 似乎是一个有用的软件包，但不幸的是，与ODG的链接已关闭。

Henrik Bengtsson： R.cache & Hadley Wickham： memoise - 现在还不清楚何时比另一个更偏好一个包装。

注3：有些人使用memoise / memoisation他人使用memoize / memoization。如果你在四处搜寻，请注意。 Henrik使用z，Hadley使用s。

解决方案
我没有运气给 memoise 因为它给我试过的打包的某些函数带来了太深的递归问题。用 R.cache 我有更好的运气。以下是我从 R.cache 文档修改的更多注释代码。代码显示了执行缓存的不同选项。
＃避免在加载R.cache库 dir时出现问题的解决方法。 create（path =〜/ .Rcache，showWarnings = F） library（R.cache） setCacheRootPath（path =./。Rcache）＃在当前工作目录下创建.Rcache ＃如果我们需要缓存路径，但在本例中未使用。 cache.root = getCacheRootPath（）模拟< - function（mean，sd）{ ＃1.尝试加载缓存的数据（如果已经生成） key< - list（mean，sd） data < - loadCache（key） if（！is.null（data））{ cat（Loaded cached data \\\ ） return（data）; } ＃2.如果不可用，生成它。 cat（从头开始生成数据...） data < - rnorm（1000，mean = mean，sd = sd） Sys.sleep（1）＃仿真慢速算法 cat（ok\\\ ） saveCache（data，key = key，comment =simulate（）） data; } data < - 模拟（2.3,3.0） data < - 模拟（2.3,3.5） a = 2.3 b = 3.0 数据< - 模拟（a，b）＃将加载缓存数据，参数按值检查＃清理 file.remove（findCache（key = list（2.3,3.0）））$ b $ （函数（均值，sd））（ data -norm（1000，mean = sd = sd） Sys.sleep（1）＃模拟慢速算法 cat（完成从头生成数据\ n） data; } ＃轻松一步来记忆一个函数＃，以便重新分配函数名称。这将适用于外部软件包的任何功能。 mzs < - addMemoization（simulate2） data <-mzs（2.3,3.0） data <-mzs（2.3,3.5） data< ; - mzs（2.3，3.0）＃将加载缓存的数据＃作为重新分配函数名称的可能。＃但同一＃函数的不同记忆将返回相同的缓存结果＃如果输入参数相同 simulate2< - addMemoization（simulate2） data< --simulate2（2.3,3.0）＃如果被评估的表达式取决于＃input对象，那么这些对象必须明确指定为＃关键对象。 for（ii in 1：2）{ for（kk in 1：3）{ cat（sprintf（Iteration＃％d：\\\ ，kk）） res< - evalWithMemoization（{ cat（Evaluating expression ...） a < - kk Sys.sleep（1） cat（done \\ \\ n） a }，key = list（kk = kk）） '表达式'里面'res'被跳过重复运行 print（res）＃清点检查 stopifnot（a == kk）＃清理 rm（a） }＃for（kk ...） }＃for （ii ...）

I am trying to find a simple way to use something like Perl's hash functions in R (essentially caching), as I intended to do both Perl-style hashing and write my own memoisation of calculations. However, others have beaten me to the punch and have packages for memoisation. The more I dig, the more I find, e.g.memoise and R.cache, but differences aren't readily clear. In addition, it's not clear how else one can get Perl-style hashes (or Python-style dictionaries) and write one's own memoization, other than to use the hash package, which doesn't seem to underpin the two memoization packages.

Since I can find no information on CRAN or elsewhere to distinguish between the options, perhaps this should be a community wiki question on SO: What are the options for memoization and caching in R, and what are their differences?

As a basis for comparison, here is a list of the options I've found. Also, it seems to me that all depend on hashing, so I'll note the hashing options as well. Key/value storage is somewhat related, but opens a huge can of worms regarding DB systems (e.g. BerkeleyDB, Redis, MemcacheDB and scores of others).

It looks like the options are:

Hashing

digest - provides hashing for arbitrary R objects.

Memoization

memoise - a very simple tool for memoization of functions.

R.cache - offers more functionality for memoization, though it seems some of the functions lack examples.

Caching

hash - Provides caching functionality akin to Perl's hashes and Python dictionaries.

Key/value storage

These are basic options for external storage of R objects.

stashr

filehash

Checkpointing

cacher - this seems to be more akin to checkpointing.

CodeDepends - An OmegaHat project that underpins cacher and provides some useful functionality.

DMTCP (not an R package) - appears to support checkpointing in a bunch of languages, and a developer recently sought assistance testing DMTCP checkpointing in R.

Other

Base R supports: named vectors and lists, row and column names of data frames, and names of items in environments. It seems to me that using a list is a bit of a kludge. (There's also pairlist, but it is deprecated.)

The data.table package supports rapid lookups of elements in a data table.

Use case

Although I'm mostly interested in knowing the options, I have two basic use cases that arise:

Caching: Simple counting of strings. [Note: This isn't for NLP, but general use, so NLP libraries are overkill; tables are inadequate because I prefer not to wait until the entire set of strings are loaded into memory. Perl-style hashes are at the right level of utility.]

Memoization of monstrous calculations.

These really arise because I'm digging in to the profiling of some slooooow code and I'd really like to just count simple strings and see if I can speed up some calculations via memoization. Being able to hash the input values, even if I don't memoize, would let me see if memoization can help.

Note 1: The CRAN Task View on Reproducible Research lists a couple of the packages (cacher and R.cache), but there is no elaboration on usage options.

Note 2: To aid others looking for related code, here a few notes on some of the authors or packages. Some of the authors use SO. :)

Dirk Eddelbuettel: digest - a lot of other packages depend on this.

Roger Peng: cacher, filehash, stashR - these address different problems in different ways; see Roger's site for more packages.

Christopher Brown: hash - Seems to be a useful package, but the links to ODG are down, unfortunately.

Henrik Bengtsson: R.cache & Hadley Wickham: memoise -- it's not yet clear when to prefer one package over the other.

Note 3: Some people use memoise/memoisation others use memoize/memoization. Just a note if you're searching around. Henrik uses "z" and Hadley uses "s".
解决方案
I did not have luck with memoise because it gave too deep recursive problem to some function of a packaged I tried with. With R.cache I had better luck. Following is more annotated code I adapted from R.cache documentation. The code shows different options to do caching.
# Workaround to avoid question when loading R.cache library dir.create(path="~/.Rcache", showWarnings=F) library("R.cache") setCacheRootPath(path="./.Rcache") # Create .Rcache at current working dir # In case we need the cache path, but not used in this example. cache.root = getCacheRootPath() simulate <- function(mean, sd) { # 1. Try to load cached data, if already generated key <- list(mean, sd) data <- loadCache(key) if (!is.null(data)) { cat("Loaded cached data\n") return(data); } # 2. If not available, generate it. cat("Generating data from scratch...") data <- rnorm(1000, mean=mean, sd=sd) Sys.sleep(1) # Emulate slow algorithm cat("ok\n") saveCache(data, key=key, comment="simulate()") data; } data <- simulate(2.3, 3.0) data <- simulate(2.3, 3.5) a = 2.3 b = 3.0 data <- simulate(a, b) # Will load cached data, params are checked by value # Clean up file.remove(findCache(key=list(2.3,3.0))) file.remove(findCache(key=list(2.3,3.5))) simulate2 <- function(mean, sd) { data <- rnorm(1000, mean=mean, sd=sd) Sys.sleep(1) # Emulate slow algorithm cat("Done generating data from scratch\n") data; } # Easy step to memoize a function # aslo possible to resassign function name. This would work with any functions from external packages. mzs <- addMemoization(simulate2) data <- mzs(2.3, 3.0) data <- mzs(2.3, 3.5) data <- mzs(2.3, 3.0) # Will load cached data # aslo possible to resassign function name. # but different memoizations of the same # function will return the same cache result # if input params are the same simulate2 <- addMemoization(simulate2) data <- simulate2(2.3, 3.0) # If the expression being evaluated depends on # "input" objects, then these must be be specified # explicitly as "key" objects. for (ii in 1:2) { for (kk in 1:3) { cat(sprintf("Iteration #%d:\n", kk)) res <- evalWithMemoization({ cat("Evaluating expression...") a <- kk Sys.sleep(1) cat("done\n") a }, key=list(kk=kk)) # expressions inside 'res' are skipped on the repeated run print(res) # Sanity checks stopifnot(a == kk) # Clean up rm(a) } # for (kk ...) } # for (ii ...)

这篇关于R中的缓存/记忆/散列选项的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

R中的缓存/记忆/散列选项 [英] Options for caching / memoization / hashing in R

问题描述

哈希

记忆

缓存

键/值存储

检查点

其他

用例

Hashing

Memoization

Caching

Key/value storage

Checkpointing

Other

Use case

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

R中的缓存/记忆/散列选项 [英] Options for caching / memoization / hashing in R

问题描述

哈希

记忆

缓存

键/值存储

检查点

其他

用例

Hashing

Memoization

Caching

Key/value storage

Checkpointing

Other

Use case

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭