R中的缓存/记忆/散列选项 [英] Options for caching / memoization / hashing in R

查看:118
本文介绍了R中的缓存/记忆/散列选项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找到一种简单的方法来在R中使用类似Perl的哈希函数(本质上是缓存),因为我打算同时执行Perl风格的哈希和编写我自己的计算记忆。然而,其他人已经打了我一拳,并有包裹为memoisation。我挖的越多,我发现的越多,例如 memoise R.cache ,但差异并不明显。另外,还不清楚除了使用 hash 包以外,还可以如何获得Perl风格的哈希(或Python风格的字典)并编写自己的记忆,似乎并不支持这两个memoization包。



因为我找不到关于CRAN或其他地方的信息来区分这些选项,所以也许这应该是一个社区wiki问题在SO上:R中的memoization和caching有什么选择,它们有什么不同?




作为基础比较,这里是我找到的选项列表。此外,在我看来,所有依赖于哈希,所以我会注意到哈希选项。键/值存储有点相关,但是会打开一大堆关于数据库系统的蠕虫(例如BerkeleyDB,Redis,MemcacheDB和其他分数)。



它看起来像这样:

哈希




  • 摘要 - 为任意R对象提供哈希。


记忆




  • memoise - 一个非常简单的记忆功能工具。
  • /index.htmlrel =nofollow noreferrer> R.cache - 为memoization提供了更多的功能,尽管它似乎有些功能缺乏示例。


缓存




  • 散列 - 提供类似于Perl的哈希和Python字典的缓存功能。



键/值存储



这些是外部存储的基本选项的R对象。




  • stashr

  • filehash
    $ b

    检查点





    其他








    用例



    虽然我最感兴趣的是了解选项,但我有两个基本的用例:


    1. 缓存:简单计数字符串。 [注:这不是NLP,而是一般用途,所以NLP库是矫枉过正的;表格是不够的,因为我不想等到整个字符串集合加载到内存中。
    2. 记录可怕的计算结果。



  • 这些真的出现了,因为我挖掘一些slooooow代码的分析我真的很想简单地计算一下字符串,看看我是否可以通过记忆加速一些计算。能够散列输入值,即使我不记忆,也会让我看到memoization是否有帮助。




    注1: CRAN可重复研究任务视图列出了一对夫妇( cacher R.cache ),但没有详细说明使用选项。



    注2:为了帮助其他人寻找相关的代码,这里有一些关于某些作者或包的说明。一些作者使用SO。 :)


    • Dirk Eddelbuettel: digest - 很多其他软件包依赖于此
    • > stashR - 这些以不同的方式解决不同的问题;请参阅罗杰网站以获取更多套餐。
    • 克里斯托弗布朗:哈希 - 似乎是一个有用的软件包,但不幸的是,与ODG的链接已关闭。

    • Henrik Bengtsson: R.cache & Hadley Wickham: memoise - 现在还不清楚何时比另一个更偏好一个包装。



    注3:有些人使用memoise / memoisation他人使用memoize / memoization。如果你在四处搜寻,请注意。 Henrik使用z,Hadley使用s。

    解决方案

    我没有运气给 memoise 因为它给我试过的打包的某些函数带来了太深的递归问题。用 R.cache 我有更好的运气。以下是我从 R.cache 文档修改的更多注释代码。代码显示了执行缓存的不同选项。

     #避免在加载R.cache库
    dir时出现问题的解决方法。 create(path =〜/ .Rcache,showWarnings = F)
    library(R.cache)
    setCacheRootPath(path =./。Rcache)#在当前工作目录下创建.Rcache
    #如果我们需要缓存路径,但在本例中未使用。
    cache.root = getCacheRootPath()
    模拟< - function(mean,sd){
    #1.尝试加载缓存的数据(如果已经生成)
    key< - list(mean,sd)
    data < - loadCache(key)
    if(!is.null(data)){
    cat(Loaded cached data \\\

    return(data);
    }
    #2.如果不可用,生成它。
    cat(从头开始生成数据...)
    data < - rnorm(1000,mean = mean,sd = sd)
    Sys.sleep(1)#仿真慢速算法
    cat(ok\\\

    saveCache(data,key = key,comment =simulate())
    data;
    }
    data < - 模拟(2.3,3.0)
    data < - 模拟(2.3,3.5)
    a = 2.3
    b = 3.0
    数据< - 模拟(a,b)#将加载缓存数据,参数按值检查
    #清理
    file.remove(findCache(key = list(2.3,3.0)))$ b $ (函数(均值,sd))(
    data -norm(1000,mean = sd = sd)
    Sys.sleep(1)#模拟慢速算法
    cat(完成从头生成数据\ n)
    data;
    }
    #轻松一步来记忆一个函数
    #,以便重新分配函数名称。
    这将适用于外部软件包的任何功能。
    mzs <​​ - addMemoization(simulate2)

    data <-mzs(2.3,3.0)
    data <-mzs(2.3,3.5)
    data< ; - mzs(2.3,3.0)#将加载缓存的数据
    #作为重新分配函数名称的可能。
    #但同一
    #函数的不同记忆将返回相同的缓存结果
    #如果输入参数相同
    simulate2< - addMemoization(simulate2)
    data< --simulate2(2.3,3.0)

    #如果被评估的表达式取决于
    #input对象,那么这些对象必须明确指定为
    #关键对象。
    for(ii in 1:2){
    for(kk in 1:3){
    cat(sprintf(Iteration#%d:\\\
    ,kk))
    res< - evalWithMemoization({
    cat(Evaluating expression ...)
    a < - kk
    Sys.sleep(1)
    cat(done \\ \\ n)
    a
    },key = list(kk = kk))
    '表达式'里面'res'被跳过重复运行
    print(res)
    #清点检查
    stopifnot(a == kk)
    #清理
    rm(a)
    }#for(kk ...)
    }#for (ii ...)


    I am trying to find a simple way to use something like Perl's hash functions in R (essentially caching), as I intended to do both Perl-style hashing and write my own memoisation of calculations. However, others have beaten me to the punch and have packages for memoisation. The more I dig, the more I find, e.g.memoise and R.cache, but differences aren't readily clear. In addition, it's not clear how else one can get Perl-style hashes (or Python-style dictionaries) and write one's own memoization, other than to use the hash package, which doesn't seem to underpin the two memoization packages.

    Since I can find no information on CRAN or elsewhere to distinguish between the options, perhaps this should be a community wiki question on SO: What are the options for memoization and caching in R, and what are their differences?


    As a basis for comparison, here is a list of the options I've found. Also, it seems to me that all depend on hashing, so I'll note the hashing options as well. Key/value storage is somewhat related, but opens a huge can of worms regarding DB systems (e.g. BerkeleyDB, Redis, MemcacheDB and scores of others).

    It looks like the options are:

    Hashing

    • digest - provides hashing for arbitrary R objects.

    Memoization

    • memoise - a very simple tool for memoization of functions.
    • R.cache - offers more functionality for memoization, though it seems some of the functions lack examples.

    Caching

    • hash - Provides caching functionality akin to Perl's hashes and Python dictionaries.

    Key/value storage

    These are basic options for external storage of R objects.

    Checkpointing

    Other

    • Base R supports: named vectors and lists, row and column names of data frames, and names of items in environments. It seems to me that using a list is a bit of a kludge. (There's also pairlist, but it is deprecated.)
    • The data.table package supports rapid lookups of elements in a data table.

    Use case

    Although I'm mostly interested in knowing the options, I have two basic use cases that arise:

    1. Caching: Simple counting of strings. [Note: This isn't for NLP, but general use, so NLP libraries are overkill; tables are inadequate because I prefer not to wait until the entire set of strings are loaded into memory. Perl-style hashes are at the right level of utility.]
    2. Memoization of monstrous calculations.

    These really arise because I'm digging in to the profiling of some slooooow code and I'd really like to just count simple strings and see if I can speed up some calculations via memoization. Being able to hash the input values, even if I don't memoize, would let me see if memoization can help.


    Note 1: The CRAN Task View on Reproducible Research lists a couple of the packages (cacher and R.cache), but there is no elaboration on usage options.

    Note 2: To aid others looking for related code, here a few notes on some of the authors or packages. Some of the authors use SO. :)

    • Dirk Eddelbuettel: digest - a lot of other packages depend on this.
    • Roger Peng: cacher, filehash, stashR - these address different problems in different ways; see Roger's site for more packages.
    • Christopher Brown: hash - Seems to be a useful package, but the links to ODG are down, unfortunately.
    • Henrik Bengtsson: R.cache & Hadley Wickham: memoise -- it's not yet clear when to prefer one package over the other.

    Note 3: Some people use memoise/memoisation others use memoize/memoization. Just a note if you're searching around. Henrik uses "z" and Hadley uses "s".

    解决方案

    I did not have luck with memoise because it gave too deep recursive problem to some function of a packaged I tried with. With R.cache I had better luck. Following is more annotated code I adapted from R.cache documentation. The code shows different options to do caching.

    # Workaround to avoid question when loading R.cache library
    dir.create(path="~/.Rcache", showWarnings=F) 
    library("R.cache")
    setCacheRootPath(path="./.Rcache") # Create .Rcache at current working dir
    # In case we need the cache path, but not used in this example.
    cache.root = getCacheRootPath() 
    simulate <- function(mean, sd) {
        # 1. Try to load cached data, if already generated
        key <- list(mean, sd)
        data <- loadCache(key)
        if (!is.null(data)) {
            cat("Loaded cached data\n")
            return(data);
        }
        # 2. If not available, generate it.
        cat("Generating data from scratch...")
        data <- rnorm(1000, mean=mean, sd=sd)
        Sys.sleep(1) # Emulate slow algorithm
        cat("ok\n")
        saveCache(data, key=key, comment="simulate()")
        data;
    }
    data <- simulate(2.3, 3.0)
    data <- simulate(2.3, 3.5)
    a = 2.3
    b = 3.0
    data <- simulate(a, b) # Will load cached data, params are checked by value
    # Clean up
    file.remove(findCache(key=list(2.3,3.0)))
    file.remove(findCache(key=list(2.3,3.5)))
    
    simulate2 <- function(mean, sd) {
        data <- rnorm(1000, mean=mean, sd=sd)
        Sys.sleep(1) # Emulate slow algorithm
        cat("Done generating data from scratch\n")
        data;
    }
    # Easy step to memoize a function
    # aslo possible to resassign function name.
    This would work with any functions from external packages. 
    mzs <- addMemoization(simulate2)
    
    data <- mzs(2.3, 3.0)
    data <- mzs(2.3, 3.5)
    data <- mzs(2.3, 3.0) # Will load cached data
    # aslo possible to resassign function name.
    # but different memoizations of the same 
    # function will return the same cache result
    # if input params are the same
    simulate2 <- addMemoization(simulate2)
    data <- simulate2(2.3, 3.0)
    
    # If the expression being evaluated depends on
    # "input" objects, then these must be be specified
    # explicitly as "key" objects.
    for (ii in 1:2) {
        for (kk in 1:3) {
            cat(sprintf("Iteration #%d:\n", kk))
            res <- evalWithMemoization({
                cat("Evaluating expression...")
                a <- kk
                Sys.sleep(1)
                cat("done\n")
                a
            }, key=list(kk=kk))
            # expressions inside 'res' are skipped on the repeated run
            print(res)
            # Sanity checks
            stopifnot(a == kk)
            # Clean up
            rm(a)
        } # for (kk ...)
    } # for (ii ...)
    

    这篇关于R中的缓存/记忆/散列选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆