你如何结合“修订控制"?带有“工作流"对于 R? [英] How do you combine "Revision Control" with "Workflow" for R?

查看:28
本文介绍了你如何结合“修订控制"?带有“工作流"对于 R?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我记得遇到过 R 用户写道他们使用修订控制"(例如:Source控制"),我很想知道:您如何将修订控制"与您的统计分析工作流程结合起来?

I remember coming across R users writing that they use "Revision control" (e.g: "Source control"), and I am curious to know: How do you combine "Revision control" with your statistical analysis workflow?

两个(非常)有趣的讨论讨论了如何处理工作流.但它们都没有引用修订控制元素:

Two (very) interesting discussions talk about how to deal with the workflow. But neither of them refer to the revision control element:

问题的长期更新:根据一些人的回答以及评论中的德克的问题,我想更直接地提出我的问题.

A Long Update To The Question: Following some of the people's answers, and Dirk's question in the comment, I would like to direct my question a bit more.

阅读关于修订控制"的维基文章后(我以前并不熟悉)与),我很清楚,在使用修订控制时,人们所做的是为他的代码构建一个开发结构.这种结构要么导致最终产品",要么导致多个分支.

After reading the Wiki article about "revision control" (which I was previously not familiar with), it was clear to me that when using revision control, what one does is to build a development structure of his code. This structure either leads to a "final product" or to several branches.

在构建类似网站时.您通常会开发一个最终产品(网站),并在此过程中提供一些原型.

When building something like, let's say, a website. There is usually one end product you work towards (the website), with some prototypes along the way.

但是在进行统计分析时,工作(在我看来)是不同的.有时你知道你想去哪里.但更多时候,你探索.探索清理数据集.探索不同的统计分析方法,并对您的数据提出各种问题(我写这篇文章,知道 Frank Harrell 和其他有经验的统计学家对 数据挖掘).

But when doing a statistical analysis, the work (to my view) is different. Sometimes you know where you want to get to. But more often, you explore. Explore cleaning the dataset. Explore different methods for statistical analysis, and ask various questions of your data (and I am writing this, knowing how Frank Harrell, and other experience statisticians feels about Data dredging).

这就是为什么统计编程的工作流问题(在我看来)是一个严肃而深刻的问题,引发了许多问题,较简单的问题是技术性的:

That is why the workflow question with statistical programming is (in my view) a serious and deep question, raising many issues, The simpler ones are technical:

  • 您使用哪种版本控制软件(以及为什么)?
  • 您使用哪个 IDE(以及为什么)?更有趣的问题是关于工作流程:
  • 你如何组织你的文件?
  • 什么是单独的文件,什么是修订版?或者以不同的方式询问 - 在您的代码中,什么应该是分支",什么应该是子项目"?例如:当开始探索您的数据时,是否应该创建一个图,然后将其删除,因为它没有引导任何地方(但保留为修订版),还是应该有该路径的备份文件?

如何解决这种紧张关系是我最初的好奇心.第二个问题是我可能会遗漏什么?".为了避免使用版本控制进行统计编程的常见陷阱,应该遵循哪些(经验法则)规则?

How you solve this tension was my initial curiosity. The second question is "what might I be missing?". What rules (of thumb) should one follow so to avoid common pitfalls doing statistical programming with version control?

在我的直觉中,我觉得统计编程与软件开发本质上是不同的(我在写这篇文章时并不是真正的统计编程专家,在软件开发方面更是如此).这就是我不确定我在这里读到的关于版本控制的哪些课程是适用的.

In my intuition, I feel that statistical programming is inherently different then software development (I am writing this without being a real expert in statistical programming, and even less so in software development). That's way I am unsure which of the lessons I have read here about version control would be applicable.

非常感谢,塔尔

推荐答案

我的工作流程与 Bernd 的没有什么不同.我通常有一个主目录,我将所有 *.R 代码文件放在其中.一旦文本文件中的行超过 5 行,我就开始版本控制,在我的例子中是 git.我的大部分工作都不是在团队环境中进行的,这意味着我是唯一更改代码的人.一旦我做出实质性的改变(是的,这是主观的),我就会进行检查.我同意 Dirk 的观点,即这个过程与工作流程是正交的.

My workflow is not that different than Bernd's. I usually have a main directory where I put all my *.R code files. As soon as I have more than about 5 lines in a text file I start version control, in my case git. Most of my work is not in a team context meaning that I'm the only one changing my code. As soon as I make a substantive change (yes that is subjective) I do a check in. I agree with Dirk that this process is orthogonal to the workflow.

我使用 Eclipse + StatET,虽然 Eclipse 中有一个 git 插件(EGit 和可能是其他人),我不使用它.我在 Windows 中,只是在 Windows 上使用 git-gui.这是更多选项.

I use Eclipse + StatET and while there is a plugin for git in Eclipse (EGit and probably others), I don't use it. I'm in Windows and just use git-gui for Windows. Here's some more options.

在版本控制方面有很多个人特质的空间,但我建议将此提示作为最佳实践:如果您向其他人报告结果(即期刊文章、您的团队、您公司的管理层)始终 在运行分发给其他人的结果之前进行版本控制检查.总是在 3 个月后,有人会查看您的结果并询问有关代码的一些问题,除非您在生成这些结果时知道代码的确切状态,否则您无法回答.因此,请实践并添加注释这是我用于第 4 季度财务的代码版本"或任何您的用例.

There's a lot of room for personal idiosyncrasies in version control, but I recommend this one tip as a best practice: if you report results to others (i.e. journal article, your team, management in your firm) ALWAYS do a version control check in right before running results that go out to others. Invariably, 3 months later someone will look at your results and ask some question about the code which you can't answer unless you know the EXACT state of the code when you produced those results. So make it a practice and put in the comments "this is the version of the code that I used for 4th quarter financials" or whatever your use case is.

还要记住,版本控制不能替代良好的备份计划.我的座右铭是:3 份.2 个地域.1 个心安."

Also keep in mind that version control is no replacement for a good backup plan. My motto is: "3 copies. 2 geographies. 1 mind at peace."

编辑(2010 年 2 月 24 日):Stack Overflow 的创始人之一 Joel Spolsky 刚刚发布了一个 高度直观且非常酷的 Mercurial 介绍.如果您还没有选择版本控制系统,那么仅本教程就可能是采用 Mercurial 的理由.我认为当谈到 Git 与 Mercurial 时,最重要的建议是选择一个并使用它.也许使用你的朋友/同事使用的或者使用最好的教程.但只需使用一个!;)

EDIT (Feb 24, 2010): Joel Spolsky, one of the founders of Stack Overflow, just released a highly visual and very cool intro to Mercurial. This tutorial alone may be reason to adopt Mercurial if you have not already chosen a revision control system. I think when it comes to Git vs. Mercurial the most important advice is to chose one and use it. Maybe use what your friends/coworkers use or use the one with the best tutorial. But just use one already! ;)

这篇关于你如何结合“修订控制"?带有“工作流"对于 R?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆