从 Wide 转换为 Long 而不对列进行排序 [英] Transform from Wide to Long without sorting columns

查看:25
本文介绍了从 Wide 转换为 Long 而不对列进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将数据帧从宽格式转换为长格式.

I want to convert a dataframe from wide format to long format.

这是一个玩具示例:

mydata <- data.frame(ID=1:5, ZA_1=1:5, 
            ZA_2=5:1,BB_1=rep(3,5),BB_2=rep(6,5),CC_7=6:2)

ID ZA_1 ZA_2 BB_1 BB_2 CC_7
1    1    5    3    6    6
2    2    4    3    6    5
3    3    3    3    6    4
4    4    2    3    6    3
5    5    1    3    6    2

有些变量会保持原样(这里只有 ID),有些会被转换为长格式(这里是所有其他变量,都以 _1、_2 或 _7 结尾)

There are some variables that will remain as is (here only ID) and some that will be transformed to long format (here all other variables, all ending with _1, _2 or _7)

为了将其转换为长格式,我使用了 data.table melt 和 dcast,这是一种能够自动检测变量的通用方法.也欢迎其他解决方案.

In order to transform it to long format I'm using data.table melt and dcast, a generic way able to detect the variables automatically. Other solutions are welcome too.

library(data.table)
setDT(mydata)
idvars =  grep("_[1-7]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)  
nuevo <- dcast(
  temp[, `:=`(var = sub("_[1-7]$", '', variable),
  measure = sub('.*_', '', variable), variable = NULL)],  
  ... ~ var, value.var='value') 



ID measure BB  CC  ZA
 1      1   3  NA   1
 1      2   6  NA   5
 1      7  NA   6  NA
 2      1   3  NA   2
 2      2   6  NA   4
 2      7  NA   5  NA
 3      1   3  NA   3
 3      2   6  NA   3
 3      7  NA   4  NA
 4      1   3  NA   4
 4      2   6  NA   2
 4      7  NA   3  NA
 5      1   3  NA   5
 5      2   6  NA   1
 5      7  NA   2  NA

如您所见,列按字母顺序重新排序,但我希望尽可能保持原始顺序,例如考虑变量首次出现的顺序.

As you can see the columns are reoredered alphabetically, but I would prefer to keep the original order as far as possible, for example taking into account the order of the first appearance of the variable.

ID ZA_1 ZA_2 BB_1 BB_2 CC_7

ID ZA_1 ZA_2 BB_1 BB_2 CC_7

应该是

ID ZA BB CC

我不介意 idvars 列在开始时是一起出现的,还是它们也保持在原来的位置.

I don't mind if the idvars columns come alltogether at the beginning or if they also stay in their original position.

ID ZA_1 ZA_2 TEMP BB_1 BB_2 CC_2 CC_1

ID ZA_1 ZA_2 TEMP BB_1 BB_2 CC_2 CC_1

ID ZA TEMP BB CC

ID TEMP ZA BB CC

我更喜欢最后一个选项.

I prefer the last option.

另一个问题是一切都变成了角色.

Another problem is that everything gets transformed to character.

推荐答案

OP 更新了他对他自己的问题的回答,抱怨中间 melt() 步骤的内存消耗,当一半列是 id.vars.他要求 data.table 需要一种直接的方式来做到这一点,而不需要创建巨大的中间步骤.

The OP has updated his answer to his own question complaining about the memory consumption of the intermediate melt() step when half of the columns are id.vars. He requested that data.table needs a direct way to do it without creating giant middle steps.

嗯,data.table 已经有这个能力了,叫做join.

Well, data.table already does have that ability, it's called join.

给定来自 Q 的样本数据,整个操作可以通过仅使用一个 id.var 进行整形,然后将整形后的结果与原始 data.table 连接,从而以更少的内存消耗方式实现:

Given the sample data from the Q, the whole operation can be implemented in a less memory consuming way by reshaping with only one id.var and later joining the reshaped result with the original data.table:

setDT(mydata)

# add unique row number to join on later 
# (leave `ID` col as placeholder for all other id.vars)
mydata[, rn := seq_len(.N)]

# define columns to be reshaped
measure_cols <- stringr::str_subset(names(mydata), "_\d$")

# melt with only one id.vars column
molten <- melt(mydata, id.vars = "rn", measure.vars = measure_cols)

# split column names of measure.vars
# Note that "variable" is reused to save memory 
molten[, c("variable", "measure") := tstrsplit(variable, "_")]

# coerce names to factors in the same order as the columns appeared in mydata
molten[, variable := forcats::fct_inorder(variable)]

# remove columns no longer needed in mydata _before_ joining to save memory
mydata[, (measure_cols) := NULL]

# final dcast and right join
result <- mydata[dcast(molten, ... ~ variable), on = "rn"]
result
#    ID rn measure ZA BB CC
# 1:  1  1       1  1  3 NA
# 2:  1  1       2  5  6 NA
# 3:  1  1       7 NA NA  6
# 4:  2  2       1  2  3 NA
# 5:  2  2       2  4  6 NA
# 6:  2  2       7 NA NA  5
# 7:  3  3       1  3  3 NA
# 8:  3  3       2  3  6 NA
# 9:  3  3       7 NA NA  4
#10:  4  4       1  4  3 NA
#11:  4  4       2  2  6 NA
#12:  4  4       7 NA NA  3
#13:  5  5       1  5  3 NA
#14:  5  5       2  1  6 NA
#15:  5  5       7 NA NA  2

最后,如果 result[, rn := NULL] 不再需要,您可以删除行号.

Finally, you may remove the row number if no longer needed by result[, rn := NULL].

此外,您可以通过 rm(molten) 删除中间的 molten.

Furthermore, you can remove the intermediate molten by rm(molten).

我们从 data.table 开始,其中包含 1 个 id 列、5 个度量值列和 5 行.重塑后的结果有 1 个 id 列、3 个度量值列和 15 行.因此,存储在 id 列中的数据量有效地增加了两倍.但是,中间步骤只需要一个 id.var rn.

We have started with a data.table consisting of 1 id column, 5 measure cols and 5 rows. The reshaped result has 1 id column, 3 measure cols, and 15 rows. So, the data volume stored in id columns effectively has tripled. However, the intermediate step needed only one id.var rn.

编辑如果内存消耗至关重要,可能值得考虑将 id.vars 和 measure.vars 保存在两个单独的 data.tables 中并加入只有必要的 id.var 列和 measure.vars 按需.

EDIT If memory consumption is crucial, it might be worthwhile to consider to keep the id.vars and the measure.vars in two separate data.tables and to join only the necessary id.var columns with the measure.vars on demand.

请注意,melt()measure.vars 参数允许使用特殊函数 patterns().有了这个,对 melt() 的调用也可以写成

Note that the measure.vars parameter to melt()allows for a special function patterns(). With this the call to melt() could have been written as well as

molten <- melt(mydata, id.vars = "rn", measure.vars = patterns("_\d$"))

这篇关于从 Wide 转换为 Long 而不对列进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆