如何制作出色的 R 可重现示例 [英] How to make a great R reproducible example

查看:23
本文介绍了如何制作出色的 R 可重现示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当与同事讨论性能、教学、发送错误报告或在邮件列表和此处 StackOverflow 上搜索指南时,一个可重现的示例 经常被问到并且总是很有帮助.

When discussing performance with colleagues, teaching, sending a bug report or searching for guidance on mailing lists and here on Stack Overflow, a reproducible example is often asked and always helpful.

您对创建优秀示例的建议是什么?如何将 的数据结构粘贴到文本中格式?您还应该包含哪些其他信息?

What are your tips for creating an excellent example? How do you paste data structures from r in a text format? What other information should you include?

除了使用 dput()dump()structure() 之外,还有其他技巧吗?什么时候应该包含 library()require() 语句?除了cdfdata等,还应该避免哪些保留字?

Are there other tricks in addition to using dput(), dump() or structure()? When should you include library() or require() statements? Which reserved words should one avoid, in addition to c, df, data, etc.?

如何使出色的 可重现例子?

How does one make a great r reproducible example?

推荐答案

基本上是一个最小可重现示例 (MRE) 应该让其他人能够准确地在他们的机器上重现您的问题.

Basically a minimal reproducible example (MRE) should enable others to exactly reproduce your issue on their machines.

一个 MRE 包含以下项目:

A MRE consists of the following items:

  • 一个最小数据集,用于演示问题
  • 重现错误所需的最小可运行代码,可以在给定的数据集上运行
  • 有关所用软件包、R 版本及其运行的操作系统的所有必要信息.
  • 在随机过程的情况下,种子(由 set.seed() 设置)用于重现性
  • a minimal dataset, necessary to demonstrate the problem
  • the minimal runnable code necessary to reproduce the error, which can be run on the given dataset
  • all necessary information on the used packages, the R version, and the OS it is run on.
  • in the case of random processes, a seed (set by set.seed()) for reproducibility

有关良好 MRE 的示例,请参阅示例"部分;在您正在使用的功能的帮助文件底部.只需键入例如help(mean),或将 ?mean 简短地添加到 R 控制台中.

For examples of good MREs, see section "Examples" at the bottom of help files on the function you are using. Simply type e.g. help(mean), or short ?mean into your R console.

通常,没有必要共享庞大的数据集,而且可能会阻止其他人阅读您的问题.因此,最好使用内置数据集或创建一个小的玩具"类似于您的原始数据的示例,这实际上是最小的含义.如果由于某种原因您确实需要共享您的原始数据,您应该使用一种方法,例如 dput(),它允许其他人获得您数据的精确副本.

Usually, sharing huge data sets is not necessary and may rather discourage others from reading your question. Therefore, it is better to use built-in datasets or create a small "toy" example that resembles your original data, which is actually what is meant by minimal. If for some reason you really need to share your original data, you should use a method, such as dput(), that allows others to get an exact copy of your data.

您可以使用其中一个内置数据集.可以使用 data() 查看内置数据集的完整列表.每个数据集都有简短的描述,可以获取更多信息,例如使用 ?iris,用于 R 附带的iris"数据集.安装的包可能包含其他数据集.

You can use one of the built-in datasets. A comprehensive list of built-in datasets can be seen with data(). There is a short description of every data set, and more information can be obtained, e.g. with ?iris, for the 'iris' data set that comes with R. Installed packages might contain additional datasets.

初步说明: 有时您可能需要特殊格式(即类),例如因子、日期或时间序列.对于这些,请使用以下函数:as.factoras.Dateas.xts、... 示例:

Preliminary note: Sometimes you may need special formats (i.e. classes), such as factors, dates, or time series. For these, make use of functions like: as.factor, as.Date, as.xts, ... Example:

d <- as.Date("2020-12-30")

哪里

class(d)
# [1] "Date"

矢量

x <- rnorm(10)  ## random vector normal distributed
x <- runif(10)  ## random vector uniformly distributed    
x <- sample(1:100, 10)  ## 10 random draws out of 1, 2, ..., 100    
x <- sample(LETTERS, 10)  ## 10 random draws out of built-in latin alphabet

矩阵

m <- matrix(1:12, 3, 4, dimnames=list(LETTERS[1:3], LETTERS[1:4]))
m
#   A B C  D
# A 1 4 7 10
# B 2 5 8 11
# C 3 6 9 12

数据框

set.seed(42)  ## for sake of reproducibility
n <- 6
dat <- data.frame(id=1:n, 
                  date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"),
                  group=rep(LETTERS[1:2], n/2),
                  age=sample(18:30, n, replace=TRUE),
                  type=factor(paste("type", 1:n)),
                  x=rnorm(n))
dat
#   id       date group age   type         x
# 1  1 2020-12-26     A  27 type 1 0.0356312
# 2  2 2020-12-27     B  19 type 2 1.3149588
# 3  3 2020-12-28     A  20 type 3 0.9781675
# 4  4 2020-12-29     B  26 type 4 0.8817912
# 5  5 2020-12-30     A  26 type 5 0.4822047
# 6  6 2020-12-31     B  28 type 6 0.9657529

注意:虽然它被广泛使用,但最好不要将你的数据框命名为 df,因为 df() 是一个F 分布的密度(即曲线在 x 点处的高度)的 R 函数,您可能会与它发生冲突.

Note: Although it is widely used, better do not name your data frame df, because df() is an R function for the density (i.e. height of the curve at point x) of the F distribution and you might get a clash with it.

如果您有特定原因,或者数据难以构建示例,您可以提供原始数据的一小部分,最好使用 dput.

If you have a specific reason, or data that would be too difficult to construct an example from, you could provide a small subset of your original data, best by using dput.

为什么要使用 dput()?

dput 会抛出在控制台上准确重现数据所需的所有信息.您可以简单地复制输出并将其粘贴到您的问题中.

dput throws all information needed to exactly reproduce your data on your console. You may simply copy the output and paste it into your question.

调用 dat(从上面)产生的输出仍然缺乏关于变量类和其他功能的信息,如果你在你的问题中分享它.此外,type 列中的空格使其难以执行任何操作.即使我们开始使用这些数据,我们也无法正确获取您数据的重要特征.

Calling dat (from above) produces output that still lacks information about variable classes and other features if you share it in your question. Furthermore the spaces in the type column make it difficult to do anything with it. Even when we set out to use the data, we won't manage to get important features of your data right.

  id       date group age   type         x
1  1 2020-12-26     A  27 type 1 0.0356312
2  2 2020-12-27     B  19 type 2 1.3149588
3  3 2020-12-28     A  20 type 3 0.9781675

子集您的数据

共享一个子集,使用 head()subset() 或索引 iris[1:4, ].然后把它包装成 dput() 给其他人一些可以立即放入 R 的东西.示例

Tho share a subset, use head(), subset() or the indices iris[1:4, ]. Then wrap it into dput() to give others something that can be put in R immediately. Example

dput(iris[1:4, ]) # first four rows of the iris data set

在您的问题中分享的控制台输出:

structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6), Sepal.Width = c(3.5, 
3, 3.2, 3.1), Petal.Length = c(1.4, 1.4, 1.3, 1.5), Petal.Width = c(0.2, 
0.2, 0.2, 0.2), Species = structure(c(1L, 1L, 1L, 1L), .Label = c("setosa", 
"versicolor", "virginica"), class = "factor")), row.names = c(NA, 
4L), class = "data.frame")

使用 dput 时,您可能还想仅包含相关列,例如dput(mtcars[1:3, c(2, 5, 6)])

When using dput, you may also want to include only relevant columns, e.g. dput(mtcars[1:3, c(2, 5, 6)])

注意:如果您的数据框有一个具有多个级别的因子,dput 输出可能会很笨拙,因为它仍然会列出所有可能的因子级别,即使如果它们不存在于您的数据子集中.要解决此问题,您可以使用 droplevels() 函数.请注意下面的物种如何是只有一个水平的因素,例如dput(droplevels(iris[1:4, ])).dput 的另一个警告是它不适用于键控 data.table 对象或分组 tbl_df(类 grouped_dfcode>) 来自 tidyverse.在这些情况下,您可以在共享之前转换回常规数据框,dput(as.data.frame(my_data)).

Note: If your data frame has a factor with many levels, the dput output can be unwieldy because it will still list all the possible factor levels even if they aren't present in the the subset of your data. To solve this issue, you can use the droplevels() function. Notice below how species is a factor with only one level, e.g. dput(droplevels(iris[1:4, ])). One other caveat for dput is that it will not work for keyed data.table objects or for grouped tbl_df (class grouped_df) from the tidyverse. In these cases you can convert back to a regular data frame before sharing, dput(as.data.frame(my_data)).

结合最少的数据(见上文),您的代码应该通过简单地复制和粘贴在另一台机器上准确地重现问题.

Combined with the minimal data (see above), your code should exactly reproduce the problem on another machine by simply copying and pasting it.

这应该是容易的部分,但通常不是.你不应该做什么:

This should be the easy part but often isn't. What you should not do:

  • 显示各种数据转换;确保提供的数据格式正确(当然,除非这是问题所在)
  • 复制粘贴整个脚本,在某处出现错误.尝试找出导致错误的确切行.通常情况下,您会自己找出问题所在.

你应该做什么:

  • 如果您使用任何软件包,请添加您使用的软件包(使用library())
  • 在新的 R 会话中测试运行您的代码以确保代码可运行.人们应该能够在控制台中复制粘贴您的数据和代码,并获得与您相同的效果.
  • 如果您打开连接或创建文件,请添加一些代码来关闭它们或删除文件(使用 unlink())
  • 如果您更改选项,请确保代码包含将它们恢复为原始选项的语句.(例如 op <- par(mfrow=c(1,2)) ...一些代码... par(op) )

在大多数情况下,只需 R 版本和操作系统就足够了.当包发生冲突时,给出 sessionInfo() 的输出真的很有帮助.在谈论与其他应用程序的连接(通过 ODBC 或其他任何方式)时,还应提供这些应用程序的版本号,如果可能,还应提供有关设置的必要信息.

In most cases, just the R version and the operating system will suffice. When conflicts arise with packages, giving the output of sessionInfo() can really help. When talking about connections to other applications (be it through ODBC or anything else), one should also provide version numbers for those, and if possible, also the necessary information on the setup.

如果您在 R Studio 中运行 R,则使用 rstudioapi::versionInfo() 可以帮助报告您的 RStudio 版本.

If you are running R in R Studio, using rstudioapi::versionInfo() can help report your RStudio version.

如果您对特定包有问题,您可能希望通过给出 packageVersion(包名称") 的输出来提供包版本.

If you have a problem with a specific package, you may want to provide the package version by giving the output of packageVersion("name of the package").

使用set.seed()你可以指定一个seed1,即特定状态,R的随机数生成器是固定的.这使得随机函数(例如 sample()rnorm()runif() 和许多其他函数)始终返回成为可能相同的结果,示例:

Using set.seed() you may specify a seed1, i.e. the specific state, R's random number generator is fixed. This makes it possible for random functions, such as sample(), rnorm(), runif() and lots of others, to always return the same result, Example:

set.seed(42)
rnorm(3)
# [1]  1.3709584 -0.5646982  0.3631284

set.seed(42)
rnorm(3)
# [1]  1.3709584 -0.5646982  0.3631284

1 注意: set.seed() 的输出在 R >3.6.0 和以前的版本之间有所不同.指定用于随机过程的 R 版本,如果在回答旧问题时得到略有不同的结果,请不要感到惊讶.为了在这种情况下获得相同的结果,您可以在 set.seed() 之前使用 RNGversion() 函数(例如:RNGversion("3.5.2")).

1 Note: The output of set.seed() differs between R >3.6.0 and previous versions. Specify which R version you used for the random process, and don't be surprised if you get slightly different results when following old questions. To get the same result in such cases, you can use the RNGversion()-function before set.seed() (e.g.: RNGversion("3.5.2")).

这篇关于如何制作出色的 R 可重现示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆