单独数据分析师的 R 和版本控制 [英] R and version control for the solo data analyst

查看:18
本文介绍了单独数据分析师的 R 和版本控制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尊敬的许多数据分析师都使用版本控制.例如:

不过,我正在评估是否值得采用 git 等版本控制系统.

简要概述:我是一名社会科学家,使用 R 分析研究出版物的数据.我目前不生产 R 包.我的项目 R 代码通常包括几千行代码,用于数据输入、清理、操作、分析和输出生成.出版物通常使用 LaTeX 编写.

关于版本控制有很多好处,我已经读到过,但它们似乎与单独的数据分析师不太相关.

  • 备份:我已经有一个备份系统.
  • 分叉和倒带:我从来没有觉得有必要这样做,但我可以看出它是如何有用的(例如,您正在准备多个基于相同数据集的期刊文章;你正在准备一份报告每月更新一次,等等)
  • 协作:大部分时间我都是自己分析数据,因此,我不会得到合作版本控制的好处.

采用版本控制还涉及若干潜在成本:

  • 是时候评估和学习版本控制系统了
  • 我当前的文件管理系统可能会增加复杂性

但是,我仍然觉得我错过了一些东西.版本控制的一般指南似乎更适合计算机科学家而不是数据分析师.

因此,特别是与数据分析师相关在与上述情况类似的情况下:

  1. 版本控制值得付出努力吗?
  2. 采用版本控制的主要优点和缺点是什么?
  3. 开始使用版本控制的好策略是什么使用 R 进行数据分析(例如,示例、工作流程想法、软件、指南链接)?

解决方案

我觉得你的问题的答案是肯定的 - 使用版本控制系统管理文件的好处远远超过实施这样一个系统的成本.

我会尽量详细回应你提出的一些观点:

<块引用>
  • 备份:我已经有一个备份系统.

是的,我也是.但是,关于依赖通用备份系统来充分跟踪与您的工作相关的重要和活动文件的适当性,需要考虑一些问题.在性能方面:

  • 您的备份系统以什么时间间隔拍摄快照?
  • 构建快照需要多长时间?
  • 拍摄快照时是否必须对整个硬盘进行映像,或者是否可以很容易地告诉它只备份刚刚收到关键更新的两个文件?
  • 您的备份系统能否准确无误地显示文本文件从一个备份到下一个备份的变化?

最重要的是:

  • 备份保存在多少个位置?它们是否与您的计算机位于同一物理位置?
  • 从备份系统恢复单个文件的给定版本有多容易?

例如,有一台 Mac 并使用 Time Machine 备份到我电脑中的另一个硬盘.Time Machine 非常适合恢复奇怪的文件或在出现问题时恢复我的系统.然而,它根本不具备让我的重要工作得到信任的条件:

  • 备份时,Time Machine 必须对整个硬盘进行映像,这需要花费大量时间.如果我继续工作,则无法保证我的文件会以我启动备份时的状态被捕获.我也可能会在第一次备份完成之前到达另一个我想保存的点.

  • 保存我的 Time Machine 备份的硬盘驱动器位于我的机器中 - 这使我的数据容易受到盗窃、火灾和其他灾难的影响.

使用像 Git 这样的版本控制系统,我可以启动特定文件的备份,而无需在文本编辑器中请求保存更多的工作 - 并且文件被即时成像和存储.此外,Git 是分布式的,所以我工作的每台计算机都有一个完整的存储库副本.

这相当于让我的工作在四台不同的计算机上进行镜像 - 天灾无法摧毁我的文件和数据,在这一点上我可能不会太在意.

<块引用>
  • 分叉和倒带:我从来没有觉得有必要这样做,但我可以看到它是如何有用的(例如,您正在准备基于相同数据集的多篇期刊文章;您正在准备每月更新的报告等)

作为独奏者,我也不会分太多.然而,我通过选择倒带而节省的时间,单枪匹马地多次回报了我学习版本控制系统的投资.你说你从来没有觉得有必要这样做 - 但在你当前的备份系统下倒带任何文件真的是一个轻松、可行的选择吗?

有时报告在 45 分钟、一小时或两天前看起来更好.

<块引用>
  • 协作:大部分时间我都是自己分析数据,因此,我不会得到合作版本控制的好处.

是的,但是如果您最终在项目上与他人合作,您将学习一种可能被证明是必不可少的工具.

<块引用>
  • 是时候评估和学习版本控制系统了

不要太担心这个.版本控制系统就像编程语言——它们有一些需要学习的关键概念,其余的只是语法糖.基本上,您学习的第一个版本控制系统需要投入最多的时间——切换到另一个版本控制系统只需要了解新系统如何表达关键概念.

选择一个流行的系统并开始使用它!

<块引用>
  • 我当前的文件管理系统可能会增加复杂性

您是否有一个文件夹,例如 Projects,其中包含与您的数据分析活动相关的所有文件夹和文件?如果是这样,那么对其进行版本控制将会使文件系统的复杂性增加 0.如果您的项目散布在您的计算机上 - 那么您应该在应用版本控制之前将它们集中起来,这最终会降低管理文件的复杂性 - 这就是我们拥有 Documents 毕竟是文件夹.

<块引用>

  1. 版本控制值得付出努力吗?

是的!它为您提供了一个巨大的撤消按钮,让您可以轻松地将工作从一台机器转移到另一台机器,而无需担心丢失 USB 驱动器之类的事情.

<块引用>

2 采用版本控制的主要优缺点是什么?

我能想到的唯一缺点是文件大小略有增加 - 但现代版本控制系统可以通过压缩和选择性保存来做绝对惊人的事情,所以这几乎是一个有争议的问题.

<块引用>

3 开始使用 R 进行数据分析的版本控制的好策略是什么(例如,示例、工作流想法、软件、指南链接)?

将生成数据或报告的文件保留在版本控制之下,要有选择性.如果您正在使用诸如 Sweave 之类的东西,请存储您的 .Rnw 文件,而不是从它们生成的 .tex 文件.如果重新获取很麻烦,请存储原始数据.如果可能,编写并存储一个脚本来获取您的数据,另一个用于清理或修改数据,而不是存储对原始数据的更改.

至于学习版本控制系统,我强烈推荐 Git 和 本指南 到它.

这些网站还有一些与使用 Git 执行特定操作相关的不错的提示和技巧:

Many data analysts that I respect use version control. For example:

However, I'm evaluating whether adopting a version control system such as git would be worthwhile.

A brief overview: I'm a social scientist who uses R to analyse data for research publications. I don't currently produce R packages. My R code for a project typically includes a few thousand lines of code for data input, cleaning, manipulation, analyses, and output generation. Publications are typically written using LaTeX.

With regards to version control there are many benefits which I have read about, yet they seem to be less relevant to the solo data analyst.

  • Backup: I have a backup system already in place.
  • Forking and rewinding: I've never felt the need to do this, but I can see how it could be useful (e.g., you are preparing multiple journal articles based on the same dataset; you are preparing a report that is updated monthly, etc)
  • Collaboration: Most of the time I am analysing data myself, thus, I wouldn't get the collaboration benefits of version control.

There are also several potential costs involved with adopting version control:

  • Time to evaluate and learn a version control system
  • A possible increase in complexity over my current file management system

However, I still have the feeling that I'm missing something. General guides on version control seem to be addressed more towards computer scientists than data analysts.

Thus, specifically in relation to data analysts in circumstances similar to those listed above:

  1. Is version control worth the effort?
  2. What are the main pros and cons of adopting version control?
  3. What is a good strategy for getting started with version control for data analysis with R (e.g., examples, workflow ideas, software, links to guides)?

解决方案

I feel the answer to your question is a resounding yes- the benefits of managing your files with a version control system far outweigh the costs of implementing such a system.

I will try to respond in detail to some of the points you raised:

  • Backup: I have a backup system already in place.

Yes, and so do I. However, there are some questions to consider regarding the appropriateness of relying on a general purpose backup system to adequately track important and active files relating to your work. On the performance side:

  • At what interval does your backup system take snapshots?
  • How long does it take to build a snapshot?
  • Does it have to image your entire hard drive when taking a snapshot, or could it be easily told to just back up two files that just received critical updates?
  • Can your backup system show you, with pinpoint accuracy, what changed in your text files from one backup to the next?

And most importantly:

  • How many locations are the backups saved in? Are they in the same physical location as your computer?
  • How easy is it to restore a given version of a single file from your backup system?

For example, have a Mac and use Time Machine to backup to another hard drive in my computer. Time Machine is great for recovering the odd file or restoring my system if things get messed up. However it simply doesn't have what it takes to be trusted with my important work:

  • When backing up, Time Machine has to image the whole hard drive which takes a considerable amount of time. If I continue working, there is no guarantee that my file will be captured in the state that it was when I initiated the backup. I also may reach another point I would like to save before the first backup finishes.

  • The hard drive to which my Time Machine backups are saved is located in my machine- this makes my data vulnerable to theft, fire and other disasters.

With a version control system like Git, I can initiate a backup of specific files with no more effort that requesting a save in a text editor- and the file is imaged and stored instantaneously. Furthermore, Git is distributed so each computer that I work at has a full copy of the repository.

This amounts to having my work mirrored across four different computers- nothing short of an act of god could destroy my files and data, at which point I probably wouldn't care too much anyway.

  • Forking and rewinding: I've never felt the need to do this, but I can see how it could be useful (e.g., you are preparing multiple journal articles based on the same dataset; you are preparing a report that is updated monthly, etc)

As a soloist, I don't fork that much either. However, the time I have saved by having the option to rewind has single-handedly paid back my investment in learning a version control system many, many times. You say you have never felt the need to do this- but has rewinding any file under your current backup system really been a painless, feasible option?

Sometimes the report just looked better 45 minutes, an hour or two days ago.

  • Collaboration: Most of the time I am analysing data myself, thus, I wouldn't get the collaboration benefits of version control.

Yes, but you would learn a tool that may prove to be indispensable if you do end up collaborating with others on a project.

  • Time to evaluate and learn a version control system

Don't worry too much about this. Version control systems are like programming languages- they have a few key concepts that need to be learned and the rest is just syntactic sugar. Basically, the first version control system you learn will require investing the most time- switching to another one just requires learning how the new system expresses key concepts.

Pick a popular system and go for it!

  • A possible increase in complexity over my current file management system

Do you have one folder, say Projects that contains all the folders and files related to your data analysis activities? If so then slapping version control on it is going to increase the complexity of your file system by exactly 0. If your projects are strewn about your computer- then you should centralize them before applying version control and this will end up decreasing the complexity of managing your files- that's why we have a Documents folder after all.

  1. Is version control worth the effort?

Yes! It gives you a huge undo button and allows you to easily transfer work from machine to machine without worrying about things like losing your USB drive.

2 What are the main pros and cons of adopting version control?

The only con I can think of is a slight increase in file size- but modern version control systems can do absolutely amazing things with compression and selective saving so this is pretty much a moot point.

3 What is a good strategy for getting started with version control for data analysis with R (e.g., examples, workflow ideas, software, links to guides)?

Keep files that generate data or reports under version control, be selective. If you are using something like Sweave, store your .Rnw files and not the .tex files that get produced from them. Store raw data if it would be a pain to re-acquire. If possible, write and store a script that acquires your data and another that cleans or modifies it rather than storing changes to raw data.

As for learning a version control system, I highly recommend Git and this guide to it.

These websites also have some nice tips and tricks related to performing specific actions with Git:

这篇关于单独数据分析师的 R 和版本控制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆