自动记录数据集 [英] Automatic documentation of datasets

查看:119
本文介绍了自动记录数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我现在正在一个项目上,在这里我一直在慢慢积累来自许多不同来源的许多不同变量.作为一个有点聪明的人,我为每个主"original_data"目录下的目录创建了一个不同的子目录,并包括一个带有URL的.txt文件以及其他从中获取数据的描述符.这些.txt文件不够聪明,没有任何结构.

现在,我面临着编译方法部分的任务,该部分记录了所有不同的数据源.我愿意仔细研究并为数据添加结构,但是随后我将需要找到或构建一个报告工具来扫描目录并提取信息.

这似乎是ProjectTemplate应该已经存在的东西,但是我似乎在那儿找不到该功能.

是否存在这样的工具?

如果没有,应该考虑哪些因素以提供最大的灵活性?一些初步的想法:

  1. 应使用标记语言(YAML?)
  2. 所有子目录都应扫描
  3. 为方便起见(2),应使用数据集描述符的标准扩展名
  4. 至关重要的是,要使此功能最有用,需要某种方式将变量描述符与它们最终采用的名称进行匹配.因此,要么必须在源文件中完成所有变量的重命名,而不是在清理步骤中(不理想),要么必须由文档引擎完成一些代码解析以跟踪变量名的更改(ugh!),要么应该使用更简单的混合方式,例如允许在标记文件中指定变量重命名.
  5. 理想情况下,报告也将被模板化(例如我们在[date]时从[dset]数据集中提取了[var]变量."),并且可能链接到了Sweave.
  6. 该工具应具有足够的灵活性,以免造成过多负担.这意味着最少的文档仅是数据集名称.

解决方案

这是一个很好的问题:人们应该非常关注数据收集,聚集,转换等的所有顺序,这些顺序构成了统计结果.不幸的是,这种做法并未得到广泛应用.

在解决您的问题之前,我想强调一下,这似乎与管理数据出处的总体目标密切相关.我不妨给您一个 Google链接以了解更多信息. :)您会发现很多资源,例如调查,软件工具(例如Wikipedia条目中列出的一些工具),各种研究项目(例如

欢迎大家的噩梦. :)

现在,我面临着编译方法部分的任务,该部分记录了所有不同的数据源.我愿意仔细研究并为数据添加结构,但是随后我将需要找到或构建一个报告工具来扫描目录并提取信息.

没问题. list.files(...,recursive = TRUE)可能会成为好朋友;另请参见R.utils中的listDirectory().

值得注意的是,在数据来源中填写方法部分是一个狭窄的应用程序.实际上,很不幸,关于可再现研究的CRAN任务视图仅关注文档.以我的经验,数据来源的目的是可重复研究的子集,数据操作和结果的文档记录是数据来源的子集.因此,关于可重复性研究,该任务视图仍处于起步阶段.它可能对您的目标很有用,但最终将使它变得毫无用处. :)

是否存在这样的工具?

是的.这些工具是什么?周一...一般来说,它是非常以应用程序为中心的.在R中,我认为这些工具没有得到太多关注(*见下文).真是不幸的是-我丢失了某些东西,或者R社区缺少了我们应该使用的东西.

对于您所描述的基本过程,我通常使用JSON(请参阅此答案以评论我要做什么).在我的大部分工作中,我将其表示为数据流模型"(顺便说一句,该术语可能是模棱两可的,尤其是在计算环境中,但我是从统计分析的角度来看).在许多情况下,此流程是通过JSON描述的,因此从JSON中提取序列以解决特定结果的产生并不困难.

对于更复杂或受监管的项目,JSON是不够的,我使用数据库来定义如何收集,转换数据等.对于受监管的项目,数据库可能包含许多身份验证,日志记录以及更多内容,确保数据来源得到充分记录.我怀疑这种数据库远远超出了您的兴趣,所以让我们继续...

1.应该使用标记语言(YAML?)

坦率地说,描述数据流所需的一切都足够了.大多数时候,我发现拥有良好的JSON,良好的数据目录布局和良好的脚本顺序就足够了.

2.应该扫描所有子目录

完成:listDirectory()

3.为方便(2),应使用数据集描述符的标准扩展名

常用语:.json". ;-)或".SecretSauce"也可以.

4.至关重要的是,要使此功能最有用,需要某种方式将变量描述符与它们最终采用的名称进行匹配.因此,要么必须在源文件中完成所有变量的重命名,而不是在清理步骤中(不理想),要么必须由文档引擎完成一些代码解析以跟踪变量名的更改(ugh!),要么应该使用更简单的混合方式,例如允许在标记文件中指定变量重命名.

如上所述,这不太有意义.假设我使用var1var2,并创建了var3var4.也许var4只是var2与其分位数的映射,而var3var1var2的观察方式最大值;或者我可能会通过截断极值从var2创建var4.如果这样做,是否保留var2的名称?另一方面,如果您指的是简单地将长名称"与简单名称"(即R变量的文本描述符)进行匹配,那么这只是您可以做的.如果您具有非常结构化的数据,则不难创建与变量名匹配的文本名称列表;或者,您可以创建可以在其上执行字符串替换的令牌.我认为创建一个将变量名与描述符匹配的CSV(或者更好的是JSON ;-)并不困难.简单地检查所有变量是否具有匹配的描述符字符串,并在完成后停止.

5.理想情况下,报告也应模板化(例如我们在[日期]从[dset]数据集中提取了[var]变量."),并可能链接到了Sweave.

这是其他人对roxygenroxygen2的建议可以应用的地方.

6.该工具应足够灵活,以免造成过多负担.这意味着最少的文档仅是数据集名称.

嗯,我被困在这里. :)

(*)顺便说一句,如果您想要一个与此相关的FOSS项目,请查看 Taverna .它已与R集成,在多个地方都有记录.目前,这可能无法满足您的需求,但是值得作为一个成熟的工作流系统的示例进行研究.


注1:因为我经常将bigmemory用于大型数据集,所以必须命名每个矩阵的列.它们存储在每个二进制文件的描述符文件中.该过程鼓励创建将变量名(和矩阵)与描述符匹配的描述符.如果将数据存储在数据库中或支持随机访问和多个R/W访问的其他外部文件(例如,内存映射文件,HDF5文件,除.rdat文件之外的任何文件)中,您可能会发现添加描述符已成为第二天性.

I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.

Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.

This seems like something that ProjectTemplate would have already, but I can't seem to find that functionality there.

Does such a tool exist?

If it does not, what considerations should be taken into account to provide maximum flexibility? Some preliminary thoughts:

  1. A markup language should be used (YAML?)
  2. All sub-directories should be scanned
  3. To facilitate (2), a standard extension for a dataset descriptor should be used
  4. Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.
  5. Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.
  6. The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.

解决方案

This is a very good question: people should be very concerned about all of the sequences of data collection, aggregation, transformation, etc., that form the basis for statistical results. Unfortunately, this is not widely practiced.

Before addressing your questions, I want to emphasize that this appears quite related to the general aim of managing data provenance. I might as well give you a Google link to read more. :) There are a bunch of resources that you'll find, such as the surveys, software tools (e.g. some listed in the Wikipedia entry), various research projects (e.g. the Provenance Challenge), and more.

That's a conceptual start, now to address practical issues:

I'm working on a project right now where I have been slowly accumulating a bunch of different variables from a bunch of different sources. Being a somewhat clever person, I created a different sub-directory for each under a main "original_data" directory, and included a .txt file with the URL and other descriptors of where I got the data from. Being an insufficiently clever person, these .txt files have no structure.

Welcome to everyone's nightmare. :)

Now I am faced with the task of compiling a methods section which documents all the different data sources. I am willing to go through and add structure to the data, but then I would need to find or build a reporting tool to scan through the directories and extract the information.

No problem. list.files(...,recursive = TRUE) might become a good friend; see also listDirectory() in R.utils.

It's worth noting that filling in a methods section on data sources is a narrow application within data provenance. In fact, it's rather unfortunate that the CRAN Task View on Reproducible Research focuses only on documentation. The aims of data provenance are, in my experience, a subset of reproducible research, and documentation of data manipulation and results are a subset of data provenance. Thus, this task view is still in its infancy regarding reproducible research. It might be useful for your aims, but you'll eventually outgrow it. :)

Does such a tool exist?

Yes. What are such tools? Mon dieu... it is very application-centric in general. Within R, I think that these tools are not given much attention (* see below). That's rather unfortunate - either I'm missing something, or else the R community is missing something that we should be using.

For the basic process that you've described, I typically use JSON (see this answer and this answer for comments on what I'm up to). For much of my work, I represent this as a "data flow model" (that term can be ambiguous, by the way, especially in the context of computing, but I mean it from a statistical analyses perspective). In many cases, this flow is described via JSON, so it is not hard to extract the sequence from JSON to address how particular results arose.

For more complex or regulated projects, JSON is not enough, and I use databases to define how data was collected, transformed, etc. For regulated projects, the database may have lots of authentication, logging, and more in it, to ensure that data provenance is well documented. I suspect that that kind of DB is well beyond your interest, so let's move on...

1. A markup language should be used (YAML?)

Frankly, whatever you need to describe your data flow will be adequate. Most of the time, I find it adequate to have good JSON, good data directory layouts, and good sequencing of scripts.

2. All sub-directories should be scanned

Done: listDirectory()

3. To facilitate (2), a standard extension for a dataset descriptor should be used

Trivial: ".json". ;-) Or ".SecretSauce" works, too.

4. Critically, to make this most useful there needs to be some way to match variable descriptors with the name that they ultimately take on. Therefore either all renaming of variables has to be done in the source files rather than in a cleaning step (less than ideal), some code-parsing has to be done by the documentation engine to track variable name changes (ugh!), or some simpler hybrid such as allowing the variable renames to be specified in the markup file should be used.

As stated, this doesn't quite make sense. Suppose that I take var1 and var2, and create var3 and var4. Perhaps var4 is just a mapping of var2 to its quantiles and var3 is the observation-wise maximum of var1 and var2; or I might create var4 from var2 by truncating extreme values. If I do so, do I retain the name of var2? On the other hand, if you're referring to simply matching "long names" with "simple names" (i.e. text descriptors to R variables), then this is something only you can do. If you have very structured data, it's not hard to create a list of text names matching variable names; alternatively, you could create tokens upon which string substitution could be performed. I don't think it's hard to create a CSV (or, better yet, JSON ;-)) that matches variable name to descriptor. Simply keep checking that all variables have matching descriptor strings, and stop once that's done.

5. Ideally the report would be templated as well (e.g. "We pulled the [var] variable from [dset] dataset on [date]."), and possibly linked to Sweave.

That's where others' suggestions of roxygen and roxygen2 can apply.

6. The tool should be flexible enough to not be overly burdensome. This means that minimal documentation would simply be a dataset name.

Hmm, I'm stumped here. :)

(*) By the way, if you want one FOSS project that relates to this, check out Taverna. It has been integrated with R as documented in several places. This may be overkill for your needs at this time, but it's worth investigating as an example of a decently mature workflow system.


Note 1: Because I frequently use bigmemory for large data sets, I have to name the columns of each matrix. These are stored in a descriptor file for each binary file. That process encourages the creation of descriptors matching variable names (and matrices) to descriptors. If you store your data in a database or other external files supporting random access and multiple R/W access (e.g. memory mapped files, HDF5 files, anything but .rdat files), you will likely find that adding descriptors becomes second nature.

这篇关于自动记录数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆