如何在R中创建,构造,维护和更新数据码本? [英] How to create, structure, maintain and update data codebooks in R?
问题描述
出于复制的目的,我希望保留一个代码簿,其中包含每个数据帧的元数据.数据密码本是:
In the interest of replication I like to keep a codebook with meta data for each data frame. A data codebook is:
书面或计算机化的清单,提供了将要包含在数据库中的变量的清晰而全面的描述. Marczyk等人( 2010 )
我喜欢记录变量的以下属性:
I like to document the following attributes of a variable:
- 名称
- 说明(标签,格式,比例等)
- 来源(例如世界银行)
- 源媒体(访问的URL和日期,CD和ISBN或其他内容)
- 磁盘上源数据的文件名(合并码本时有帮助)
- 笔记
- name
- description (label, format, scale, etc)
- source (e.g. World bank)
- source media (url and date accessed, CD and ISBN, or whatever)
- file name of the source data on disk (helps when merging codebooks)
- notes
例如,这就是我要实现的文档,其中包含8个变量的数据帧 mydata1 :
For example, this is what I am implementing to document the variables in data frame mydata1 with 8 variables:
code.book.mydata1 <- data.frame(variable.name=c(names(mydata1)),
label=c("Label 1",
"State name",
"Personal identifier",
"Income per capita, thousand of US$, constant year 2000 prices",
"Unique id",
"Calendar year",
"blah",
"bah"),
source=rep("unknown",length(mydata1)),
source_media=rep("unknown",length(mydata1)),
filename = rep("unknown",length(mydata1)),
notes = rep("unknown",length(mydata1))
)
我为读取的每个数据集编写了不同的密码本.当我合并数据帧时,我还将合并其关联的代码簿的相关方面,以记录最终数据库.为此,我基本上复制了上面的代码并更改了参数.
I write a different codebook for each data set I read. When I merge data frames I will also merge the relevant aspects of their associated codebook, to document the final database. I do this by essentially copy pasting the code above and changing the arguments.
推荐答案
您可以使用attr
函数将任何特殊属性添加到任何R对象.例如:
You could add any special attribute to any R object with the attr
function. E.g.:
x <- cars
attr(x,"source") <- "Ezekiel, M. (1930) _Methods of Correlation Analysis_. Wiley."
并在对象的结构中查看给定的属性:
And see the given attribute in the structure of the object:
> str(x)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
- attr(*, "source")= chr "Ezekiel, M. (1930) _Methods of Correlation Analysis_. Wiley."
并且还可以使用相同的attr
函数加载指定的属性:
And could also load the specified attribute with the same attr
function:
> attr(x, "source")
[1] "Ezekiel, M. (1930) _Methods of Correlation Analysis_. Wiley."
如果仅将新案例添加到数据框中,则给定属性将不受影响(请参阅:str(rbind(x,x))
,而更改结构会重新赋予给定属性(请参见:str(cbind(x,x))
).
If you only add new cases to your data frame, the given attribute will not be affected (see: str(rbind(x,x))
while altering the structure will erease the given attributes (see: str(cbind(x,x))
).
更新:基于评论
如果要列出所有非标准属性,请检查以下内容:
If you want to list all non-standard attributes, check the following:
setdiff(names(attributes(x)),c("names","row.names","class"))
这将列出所有非标准属性(标准是:名称,行名称,数据框中的类).
This will list all non-standard attributes (standard are: names, row.names, class in data frames).
基于此,您可以编写一个简短函数以列出所有非标准属性以及值.下面的方法可以工作,尽管不是很整齐.您可以对其进行改进并组成一个函数:)
Based on that, you could write a short function to list all non-standard attributes and also the values. The following does work, though not in a neat way... You could improve it and make up a function :)
首先,定义uniqe(非标准)属性:
First, define the uniqe (=non standard) attributes:
uniqueattrs <- setdiff(names(attributes(x)),c("names","row.names","class"))
并创建一个包含名称和值的矩阵:
And make a matrix which will hold the names and values:
attribs <- matrix(0,0,2)
浏览非标准属性,并将名称和值保存在矩阵中:
Loop through the non-standard attributes and save in the matrix the names and values:
for (i in 1:length(uniqueattrs)) {
attribs <- rbind(attribs, c(uniqueattrs[i], attr(x,uniqueattrs[i])))
}
将矩阵转换为数据框并命名列:
Convert the matrix to a data frame and name the columns:
attribs <- as.data.frame(attribs)
names(attribs) <- c('name', 'value')
并以任何格式保存,例如:
And save in any format, eg.:
write.csv(attribs, 'foo.csv')
对于有关变量标签的问题,请检查软件包 foreign 中的read.spss
函数,因为它完全满足您的需要:将值标签保存在attrs部分中.主要思想是attr可以是数据框或其他对象,因此您不需要为每个变量都创建唯一的"attr",而只需创建一个(例如,命名为"varable labels")并在那里保存所有信息.您可以这样调用:attr(x, "variable.labels")['foo']
其中'foo'代表所需的变量名.但是请检查上面引用的功能以及导入的数据框的属性以了解更多详细信息.
To your question about the variable labels, check the read.spss
function from package foreign, as it does exactly what you need: saves the value labels in the attrs section. The main idea is that an attr could be a data frame or other object, so you do not need to make a unique "attr" for every variable, but make only one (e.g. named to "varable labels") and save all information there. You could call like: attr(x, "variable.labels")['foo']
where 'foo' stands for the required variable name. But check the function cited above and also the imported data frames' attributes for more details.
我希望这些方法可以帮助您以比我上面尝试的方法更加整洁的方式编写所需的函数! :)
I hope these could help you to write the required functions in a lot neater way than I tried above! :)
这篇关于如何在R中创建,构造,维护和更新数据码本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!