重塑凌乱的纵向调查数据,包含多个不同的变量,从宽到长 [英] Reshape messy longitudinal survey data containing multiple different variables, wide to long
问题描述
我希望我不是在重新创建轮子,也不要认为使用 reshape
可以回答以下问题.
I hope that I'm not recreating the wheel, and do not think that the following can be answered using reshape
.
我有凌乱的纵向调查数据,我想将其从宽格式转换为长格式.凌乱我的意思是:
I have messy longitudinal survey data, that I want to convert from wide to long format. By messy I mean:
- 我有多种变量类型(数字、因子、逻辑)
- 并非在每个时间点都收集了所有变量.
例如:
data <- read.table(header=T, text='
id inlove.1 inlove.2 income.2 income.3 mood.1 mood.3 random
1 TRUE FALSE 87717.76 82281.25 happy happy filler
2 TRUE TRUE 70795.53 54995.19 so-so happy filler
3 FALSE FALSE 48012.77 47650.47 sad so-so filler
')
我无法弄清楚如何使用 reshape
来重塑数据,并不断收到错误消息 'times' is wrong length
.我认为这是因为并非每个变量都在每种情况下都被记录下来.此外,我不认为 reshape2
中的 melt
和 cast
会起作用,因为它要求所有测量的变量都属于同一类型.
I could not work out how to reshape the data using reshape
, and keep getting the error message 'times' is wrong length
. Which I assume is because not every variable is recorded on every occasion. Also I don't think melt
and cast
from reshape2
will work as it requires all measured variables to be of the same type.
我想出了以下可能对其他人有所帮助的解决方案.它按时间点选择变量,重命名它们,然后使用 plyr
中的 rbind.fill
将它们连接在一起.但我想知道 reshape
是否遗漏了一些东西,或者是否可以使用 tidyr
或其他包更容易地做到这一点?
I came up with the following solution which may help others. It selects variables by timepoint, renames them, and then uses rbind.fill
from plyr
to concatenate them together. But I wonder if I'm missing something with reshape
or if this can be done easier using tidyr
or another package?
reshapeLong2 <- function(data, varying = NULL, timevar = "time", idvar = "id", sep = ".", patterns = NULL) {
require(plyr)
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
if (is.null(varying))
varying <- names(data)[! names(data) %in% idvar]
# Create pattern if not specified, guesses by taking numbers given at end of variable names.
if (is.null(patterns)) {
times <- unique(na.omit(as.numeric(substrRight(varying, 1))))
times <- times[order = times]
patterns <- paste0(sep, times)
}
# Create list of datasets by study time
ls.df <- lapply(patterns, function(pattern) {
var.old <- grep(pattern, x = varying, value = TRUE)
var.new <- gsub(pattern, "", x = var.old)
df <- data[, c(idvar, var.old)]
names(df) <- c(idvar, var.new)
df[, timevar] <- match(pattern, patterns)
return(df)
})
# Concatenate datasets together
dfs <- rbind.fill(ls.df)
return(dfs)
}
> reshapeLong2(df.test)
id inlove mood time income
1 1 FALSE sad 1 NA
2 2 TRUE so-so 1 NA
3 3 TRUE sad 1 NA
4 1 TRUE <NA> 2 27766.13
5 2 FALSE <NA> 2 74395.30
6 3 TRUE <NA> 2 89004.95
7 1 NA sad 3 27270.07
8 2 NA so-so 3 36971.64
9 3 NA so-so 3 85986.96
Warning message:
In na.omit(as.numeric(substrRight(varying, 1))) :
NAs introduced by coercion
注意,警告消息表明有一些变量被丢弃(在这种情况下是随机").如果所有变量都列为 idvar 或变量,则不会显示警告.
Note, warning message indicates that there are some variables that are dropped (in this case "random"). Warning not shown if all variables are listed as either idvar or varying.
推荐答案
如果您将 varname.TIME
列中的所有缺失次数都填写为 NA
,则可以只是 reshape
就像:
If you fill in varname.TIME
columns as NA
for all the missing times, you can then just reshape
like:
uniqnames <- c("inlove","income","mood")
allnames <- make.unique(rep(uniqnames,4))[-(seq_along(uniqnames))]
#[1] "inlove.1" "income.1" "mood.1" "inlove.2" "income.2" "mood.2" ...
data[setdiff(allnames, names(data)[-1])] <- NA
# id inlove.1 inlove.2 income.2 income.3 mood.1 mood.3 random income.1 mood.2 inlove.3
#1 1 TRUE FALSE 87717.76 82281.25 happy happy filler NA NA NA
#2 2 TRUE TRUE 70795.53 54995.19 so-so happy filler NA NA NA
#3 3 FALSE FALSE 48012.77 47650.47 sad so-so filler NA NA NA
reshape(data, idvar="id", direction="long", sep=".", varying=allnames)
# id random time inlove income mood
#1.1 1 filler 1 TRUE NA happy
#2.1 2 filler 1 TRUE NA so-so
#3.1 3 filler 1 FALSE NA sad
#1.2 1 filler 2 FALSE 87717.76 <NA>
#2.2 2 filler 2 TRUE 70795.53 <NA>
#3.2 3 filler 2 FALSE 48012.77 <NA>
#1.3 1 filler 3 NA 82281.25 happy
#2.3 2 filler 3 NA 54995.19 happy
#3.3 3 filler 3 NA 47650.47 so-so
这篇关于重塑凌乱的纵向调查数据,包含多个不同的变量,从宽到长的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!