如何制作出色的R可重现示例 [英] How to make a great R reproducible example

查看:72
本文介绍了如何制作出色的R可重现示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与同事讨论绩效,教学,发送错误报告或在邮件列表中搜索指导时,以及在StackOverflow上的此处,这是一个可复制的示例经常被问到并且总是有帮助的.

When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on Stack Overflow, a reproducible example is often asked and always helpful.

创建出色示例的秘诀是什么?您如何从文本中的中粘贴数据结构格式?您还应该包括哪些其他信息?

What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include?

除了使用dput()dump()structure()之外,还有其他技巧吗?什么时候应该包含library()require()语句?除了cdfdata等之外,还应避免使用哪些保留字?

Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc.?

一个人如何使一个很好的的问题例子吗?

How does one make a great r reproducible example?

推荐答案

一个 最小可复制示例 由以下各项组成:

A minimal reproducible example consists of the following items:

  • 证明问题最基本的数据集
  • 再现错误所需的最少可运行代码,可以在给定的数据集上运行
  • 有关所用软件包,R版本和运行它的系统的必要信息.
  • 在随机过程的情况下,为可重现性提供种子(由set.seed()设置) 1
  • a minimal dataset, necessary to demonstrate the problem
  • the minimal runnable code necessary to reproduce the error, which can be run on the given dataset
  • the necessary information on the used packages, R version, and system it is run on.
  • in the case of random processes, a seed (set by set.seed()) for reproducibility1

有关良好的最小可重复示例的示例,请参见所使用功能的帮助文件.通常,此处给出的所有代码均满足最小可重现示例的要求:提供数据,提供最少的代码,并且一切都可运行.还要查看有关堆栈溢出问题的很多支持.

For examples of good minimal reproducible examples, see the help files of the function you are using. In general, all the code given there fulfills the requirements of a minimal reproducible example: data is provided, minimal code is provided, and everything is runnable. Also look at questions on Stack Overflow with lots of upvotes.

在大多数情况下,只需提供带有某些值的向量/数据帧即可轻松完成此操作.或者,您可以使用大多数软件包随附的内置数据集之一.
可以使用library(help = "datasets")查看内置数据集的完整列表.每个数据集都有一个简短的描述,例如,可以使用?mtcars获取更多信息,其中"mtcars"是列表中的数据集之一.其他软件包可能包含其他数据集.

For most cases, this can be easily done by just providing a vector/data frame with some values. Or you can use one of the built-in datasets, which are provided with most packages.
A comprehensive list of built-in datasets can be seen with library(help = "datasets"). There is a short description to every dataset and more information can be obtained for example with ?mtcars where 'mtcars' is one of the datasets in the list. Other packages might contain additional datasets.

制作向量很容易.有时有必要在其中添加一些随机性,并且有大量的函数可以实现此目的. sample()可以将向量随机化,或者给出仅包含几个值的随机向量. letters是包含字母的有用向量.这可以用来做因素.

Making a vector is easy. Sometimes it is necessary to add some randomness to it, and there are a whole number of functions to make that. sample() can randomize a vector, or give a random vector with only a few values. letters is a useful vector containing the alphabet. This can be used for making factors.

一些例子:

  • 随机值:x <- rnorm(10)表示正态分布,x <- runif(10)表示均匀分布,...
  • 一些值的排列:x <- sample(1:10)矢量1:10以随机顺序排列.
  • 一个随机因素:x <- sample(letters[1:4], 20, replace = TRUE)
  • random values : x <- rnorm(10) for normal distribution, x <- runif(10) for uniform distribution, ...
  • a permutation of some values : x <- sample(1:10) for vector 1:10 in random order.
  • a random factor : x <- sample(letters[1:4], 20, replace = TRUE)

对于矩阵,可以使用matrix(),例如:

For matrices, one can use matrix(), eg :

matrix(1:10, ncol = 2)

使用data.frame()可以完成数据帧的制作.应该注意对数据框中的条目进行命名,并且不要使其过于复杂.

Making data frames can be done using data.frame(). One should pay attention to name the entries in the data frame, and to not make it overly complicated.

一个例子:

set.seed(1)
Data <- data.frame(
    X = sample(1:10),
    Y = sample(c("yes", "no"), 10, replace = TRUE)
)

对于某些问题,可能需要特定的格式.为此,可以使用提供的任何as.someType函数:as.factoras.Dateas.xts,...这些与矢量和/或数据帧的技巧结合使用.

For some questions, specific formats can be needed. For these, one can use any of the provided as.someType functions : as.factor, as.Date, as.xts, ... These in combination with the vector and/or data frame tricks.

如果使用这些技巧难以构建某些数据,则始终可以使用head()subset()或索引来制作原始数据的子集.然后使用dput()给我们一些可以立即放入R的内容:

If you have some data that would be too difficult to construct using these tips, then you can always make a subset of your original data, using head(), subset() or the indices. Then use dput() to give us something that can be put in R immediately :

> dput(iris[1:4, ]) # first four rows of the iris data set
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5, 
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2, 
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = c("setosa", 
"versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length", 
"Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, 
4L), class = "data.frame")

如果您的数据框具有多个级别的因数,则dput输出可能会很笨拙,因为它仍会列出所有可能的因素级别,即使它们不在数据子集中也是如此.要解决此问题,可以使用droplevels()函数.请注意下面的内容,物种是如何只有一个水平的因子:

If your data frame has a factor with many levels, the dput output can be unwieldy because it will still list all the possible factor levels even if they aren't present in the the subset of your data. To solve this issue, you can use the droplevels() function. Notice below how species is a factor with only one level:

> dput(droplevels(iris[1:4, ]))
structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5, 
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2, 
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = "setosa",
class = "factor")), .Names = c("Sepal.Length", "Sepal.Width", 
"Petal.Length", "Petal.Width", "Species"), row.names = c(NA, 
4L), class = "data.frame")

使用dput时,您可能还希望仅包含相关列:

When using dput, you may also want to include only relevant columns:

> dput(mtcars[1:3, c(2, 5, 6)]) # first three rows of columns 2, 5, and 6
structure(list(cyl = c(6, 6, 4), drat = c(3.9, 3.9, 3.85), wt = c(2.62, 
2.875, 2.32)), row.names = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710"
), class = "data.frame")

dput的另一个警告是,它不适用于键控的data.table对象或dplyr中的分组的tbl_df(类grouped_df).在这种情况下,您可以在共享之前先转换回常规数据帧,dput(as.data.frame(my_data)).

One other caveat for dput is that it will not work for keyed data.table objects or for grouped tbl_df (class grouped_df) from dplyr. In these cases you can convert back to a regular data frame before sharing, dput(as.data.frame(my_data)).

最坏的情况是,您可以提供一个文本表示形式,可以使用read.tabletext参数读取该文本表示形式:

Worst case scenario, you can give a text representation that can be read in using the text parameter of read.table :

zz <- "Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa"

Data <- read.table(text=zz, header = TRUE)

产生最少的代码

这应该是容易的部分,但通常不是.您不应该做的是:

Producing minimal code

This should be the easy part but often isn't. What you should not do, is:

  • 添加所有类型的数据转换.确保提供的数据已经采用正确的格式(除非这是当然的问题)
  • 复制并粘贴会给出错误的整个函数/代码块.首先,尝试找出哪些行准确地导致了错误.通常,您会发现自己出了什么问题.

您应该做的是:

  • 添加任何使用的软件包(使用library())
  • 如果您打开连接或创建文件,请添加一些代码以关闭它们或删除文件(使用unlink())
  • 如果更改选项,请确保代码包含一条语句以将其还原为原始选项. (例如op <- par(mfrow=c(1,2)) ...some code... par(op))
  • 在一个新的空R会话中测试您的代码,以确保该代码可运行.人们应该能够只在控制台中复制粘贴您的数据和代码,并获得与您完全相同的信息.
  • add which packages should be used if you use any (using library())
  • if you open connections or create files, add some code to close them or delete the files (using unlink())
  • if you change options, make sure the code contains a statement to revert them back to the original ones. (eg op <- par(mfrow=c(1,2)) ...some code... par(op) )
  • test run your code in a new, empty R session to make sure the code is runnable. People should be able to just copy-paste your data and your code in the console and get exactly the same as you have.

在大多数情况下,仅R版本和操作系统就足够了.当软件包发生冲突时,提供sessionInfo()的输出确实可以提供帮助.在谈论与其他应用程序的连接(通过ODBC或其他方式)时,还应提供这些应用程序的版本号,并在可能的情况下还提供有关安装程序的必要信息.

In most cases, just the R version and the operating system will suffice. When conflicts arise with packages, giving the output of sessionInfo() can really help. When talking about connections to other applications (be it through ODBC or anything else), one should also provide version numbers for those, and if possible also the necessary information on the setup.

如果您正在使用rstudioapi::versionInfo() R Studio 中运行R,则有助于报告RStudio版本.

If you are running R in R Studio using rstudioapi::versionInfo() can be helpful to report your RStudio version.

如果您对特定的软件包有疑问,则可能需要通过提供packageVersion("name of the package")的输出来提供该软件包的版本.

If you have a problem with a specific package you may want to provide the version of the package by giving the output of packageVersion("name of the package").

1 注意:set.seed()的输出在R> 3.6.0和以前的版本之间有所不同.请务必指定用于随机过程的R版本,如果在遵循旧问题时得到的结果略有不同,请不要感到惊讶.在这种情况下,要获得相同的结果,可以在set.seed()之前使用RNGversion()功能(例如:RNGversion("3.5.2")).

1 Note: The output of set.seed() differs between R >3.6.0 and previous versions. Do specify which R version you used for the random process, and don't be surprised if you get slightly different results when following old questions. To get the same result in such cases, you can use the RNGversion()-function before set.seed() (e.g.: RNGversion("3.5.2")).

这篇关于如何制作出色的R可重现示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆