避免每次都在Knitr中加载数据 [英] Avoid loading data every time in knitr

查看:42
本文介绍了避免每次都在Knitr中加载数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用knitr创建一个文档,并且每次在开发过程中解析文档时,都会发现从磁盘重新加载数据很麻烦.我已将该数据文件子集化以进行开发,以缩短加载时间.我也将knitr缓存设置为启用.

我尝试使用<<-将数据分配给全局环境,并使用 exists where = globalenv()但这没用.

任何人都知道如何在knitr中使用来自环境的预加载数据,或者有其他想法来加快开发速度?

解决方案

编织文档时,将在R中创建一个新环境,因此全局环境中的任何设置都不会传递给该文档.但是,这是有意完成的,因为意外地引用全局环境中的对象很容易破坏可重现的分析,因此每次进行干净会话都意味着RMarkdown文件可以自己运行,而与全局环境设置无关.

如果您有一个用例可以证明数据的预加载是正确的,那么您可以做一些事情.

示例数据

首先,我创建了一个最小的Rmd文件,如下所示,名为"RenderTest.Rmd":

 标题:渲染"作者:迈克尔·哈珀"日期:"2017年11月7日"输出:pdf_document---```{r汽车}摘要(汽车2)``` 

在此示例中, cars2 是我从全局会话中引用的一组数据.使用RStudio中的编织"命令在其上运行,这将返回以下错误:

摘要(汽车)错误:未找到对象'cars2':... withCallignHandlers-> withVisible-> eval-> eval->摘要执行停止

选项1:手动调用渲染函数

可以从另一个R脚本调用 rmarkdown 中的 render 函数.默认情况下,这不会为要运行的脚本创建新的环境,因此您可以使用任何已加载的参数.例如:

 #构建文件库(rmarkdown)cars2<-汽车render("RenderTest.Rmd") 

但是,我会小心地这样做.首先,使用RMarkdown的好处是它使脚本的可重复性非常容易.一旦开始使用外部脚本,由于所有设置都未包含在文件中,因此复制起来变得更加复杂.

选项2:将数据保存到R对象

如果您的分析需要一些时间才能运行,则可以将分析结果另存为R对象,然后可以将数据的最终版本重新加载到会话中.使用我上面的示例:

 ```{r dataProcess,cache = TRUE}汽车2<-汽车save(cars2,"carsData.RData")#保存'cars2'数据集```然后我们可以将数据重新加载到会话中:```{r}load("carsData.RData")#重新加载'cars2'数据集``` 

我更喜欢这种技术.块 dataProcess 已缓存,因此仅在对代码进行更改的情况下才运行.结果保存到文件,然后由下一个块加载.仍然需要将数据加载到会话中,但是如果需要进行任何数据清理,则可以保存最终数据集.

选项3:减少构建文件的频率

随着过去几年对RStudio的更新,无需不断重建文件.块可以直接在文件中运行,并且可以查看输出窗口.可能会为您节省大量时间来尝试优化脚本,而这只节省了几分钟的编译时间(无论如何,这通常是喝一杯热饮的好时机!).

I am creating a document using knitr and I am finding it tedious to reload the data from disk every time I parse the document while I'm in development. I've subsetted that datafile for development to shorten the load time. I also have knitr cache set to on.

I tried assigning the data to the global environment using <<-, and using exists with where=globalenv(), but that did not work.

Anyone know how to use preloaded data from the environment in knitr or have other ideas to speed up development?

解决方案

When a document is knitted, a new environment is created within R, and therefore any settings in the global environment will not be passed to the document. However, this is done intentionally, as accidentally referencing an object in the global environment is an easy thing to break a reproducible analysis, and therefore making a clean session each time means the RMarkdown file runs on its own, regardless of the global environment settings.

If you do have a use case which justifies preloading the data, there are a few things you can do.

Example Data

Firstly I have created a minimal Rmd file as below called "RenderTest.Rmd":

title: "Render"
author: "Michael Harper"
date: "7 November 2017"
output: pdf_document
---

```{r cars}
summary(cars2)
```

In this example, cars2 is a set of data I am referencing to from my global session. Run on its using the "Knit" command in RStudio, this will return the following error:

Error in summary(cars): object 'cars2' not found: ... withCallignHandlers -> withVisible -> eval -> eval -> summary Execution halted

Option 1: Manually Call the render function

The render function from rmarkdown can be called from another R script. This by default does not create a fresh environment for the script to run in, so you can use any parameters already loaded. As an example:

# Build file
library(rmarkdown)

cars2<- cars
render("RenderTest.Rmd")

I would, however, be careful doing this. Firstly, the benefit of using RMarkdown is that it makes reproducibility of the script is incredibly easy. As soon as you start using external scripts, it makes things more complicated to replicate as all the settings are not contained within the file.

Option 2: Save data to an R object

If you have some analysis which takes time to run, you can save the result of the analysis as an R object, and then you can reload the final version of the data into the session. Using my above example:

```{r dataProcess, cache = TRUE}
cars2 <- cars
save(cars2, "carsData.RData") # saves the 'cars2' dataset
```
and then we can just reload the data into the session:

```{r}
load("carsData.RData") # reloads the 'cars2' dataset
```

I prefer this technique. The chunk dataProcess is cached, so is only run if there are changes made to the code. The results are saved to file, which are then loaded by the next chunk. The data still has to be loaded into the session, but you can save the finalised dataset if you need to do any data cleaning.

Option 3: Build the file less frequently

With the updates made to RStudio over the past few years, there is less of a need to continuously rebuild the file. Chunks can be run directly within the file, and the output window viewed. It will potentially save you a lot of time trying to optimise the script, only to save a couple of minutes on compiling (which normally makes a good time to get a hot drink anyway!).

这篇关于避免每次都在Knitr中加载数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆