从宽转换为长而无需对列进行排序 [英] Transform from Wide to Long without sorting columns

查看:75
本文介绍了从宽转换为长而无需对列进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将数据帧从宽格式转换为长格式。

I want to convert a dataframe from wide format to long format.

这里是一个玩具示例:

mydata <- data.frame(ID=1:5, ZA_1=1:5, 
            ZA_2=5:1,BB_1=rep(3,5),BB_2=rep(6,5),CC_7=6:2)

ID ZA_1 ZA_2 BB_1 BB_2 CC_7
1    1    5    3    6    6
2    2    4    3    6    5
3    3    3    3    6    4
4    4    2    3    6    3
5    5    1    3    6    2

有些变量将保持不变(此处仅是ID),有些变量将转换为长格式(此处的所有其他变量均以_1,_2或_7结尾)

There are some variables that will remain as is (here only ID) and some that will be transformed to long format (here all other variables, all ending with _1, _2 or _7)

为了将其转换为长格式,我使用了data.tablemelt和dcast,这是一种能够自动检测变量的通用方法。也欢迎使用其他解决方案。

In order to transform it to long format I'm using data.table melt and dcast, a generic way able to detect the variables automatically. Other solutions are welcome too.

library(data.table)
setDT(mydata)
idvars =  grep("_[1-7]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)  
nuevo <- dcast(
  temp[, `:=`(var = sub("_[1-7]$", '', variable),
  measure = sub('.*_', '', variable), variable = NULL)],  
  ... ~ var, value.var='value') 



ID measure BB  CC  ZA
 1      1   3  NA   1
 1      2   6  NA   5
 1      7  NA   6  NA
 2      1   3  NA   2
 2      2   6  NA   4
 2      7  NA   5  NA
 3      1   3  NA   3
 3      2   6  NA   3
 3      7  NA   4  NA
 4      1   3  NA   4
 4      2   6  NA   2
 4      7  NA   3  NA
 5      1   3  NA   5
 5      2   6  NA   1
 5      7  NA   2  NA

如您所见,列按字母顺序重新排列,但我希望尽可能保持原始顺序,例如例子中考虑了变量首次出现的顺序。

As you can see the columns are reoredered alphabetically, but I would prefer to keep the original order as far as possible, for example taking into account the order of the first appearance of the variable.


ID ZA_1 ZA_2 BB_1 BB_2 CC_7

ID ZA_1 ZA_2 BB_1 BB_2 CC_7

应该

ID ZA BB CC

我不请注意,如果idvars列开始时全部在一起,或者它们是否也保持原始位置。

I don't mind if the idvars columns come alltogether at the beginning or if they also stay in their original position.


ID ZA_1 > ZA_2 TEMP BB_1 BB_2 CC_2 CC_1

ID ZA_1 ZA_2 TEMP BB_1 BB_2 CC_2 CC_1

将是

ID ZA TEMP BB CC

ID TEMP ZA BB CC

我更喜欢最后一个选择。

I prefer the last option.

另一个问题是,所有内容都转换为字符。

Another problem is that everything gets transformed to character.

推荐答案

OP更新了他对自己的问题的回答,抱怨内存消耗在中间的 melt()步骤中,当一半的列为 id.vars 时。他要求 data.table 需要一种直接的方法来做到这一点,而又不会产生巨大的中间步骤

The OP has updated his answer to his own question complaining about the memory consumption of the intermediate melt() step when half of the columns are id.vars. He requested that data.table needs a direct way to do it without creating giant middle steps.

好吧, data.table 已经具备了此功能,称为 join

Well, data.table already does have that ability, it's called join.

考虑到来自Q的样本数据,可以通过仅使用一个id.var进行整形,然后将整形后的结果与原始数据结合起来,从而以较少的内存消耗方式实现整个操作。

Given the sample data from the Q, the whole operation can be implemented in a less memory consuming way by reshaping with only one id.var and later joining the reshaped result with the original data.table:

setDT(mydata)

# add unique row number to join on later 
# (leave `ID` col as placeholder for all other id.vars)
mydata[, rn := seq_len(.N)]

# define columns to be reshaped
measure_cols <- stringr::str_subset(names(mydata), "_\\d$")

# melt with only one id.vars column
molten <- melt(mydata, id.vars = "rn", measure.vars = measure_cols)

# split column names of measure.vars
# Note that "variable" is reused to save memory 
molten[, c("variable", "measure") := tstrsplit(variable, "_")]

# coerce names to factors in the same order as the columns appeared in mydata
molten[, variable := forcats::fct_inorder(variable)]

# remove columns no longer needed in mydata _before_ joining to save memory
mydata[, (measure_cols) := NULL]

# final dcast and right join
result <- mydata[dcast(molten, ... ~ variable), on = "rn"]
result
#    ID rn measure ZA BB CC
# 1:  1  1       1  1  3 NA
# 2:  1  1       2  5  6 NA
# 3:  1  1       7 NA NA  6
# 4:  2  2       1  2  3 NA
# 5:  2  2       2  4  6 NA
# 6:  2  2       7 NA NA  5
# 7:  3  3       1  3  3 NA
# 8:  3  3       2  3  6 NA
# 9:  3  3       7 NA NA  4
#10:  4  4       1  4  3 NA
#11:  4  4       2  2  6 NA
#12:  4  4       7 NA NA  3
#13:  5  5       1  5  3 NA
#14:  5  5       2  1  6 NA
#15:  5  5       7 NA NA  2

Finall y,如果 result [,rn:= NULL] 不再需要,则可以删除行号。

Finally, you may remove the row number if no longer needed by result[, rn := NULL].

此外,您可以通过 rm(molten)除去中间 molten

Furthermore, you can remove the intermediate molten by rm(molten).

我们从 data.table 开始,它由1个id列,5个测量列和5行组成。调整后的结果具有1个id列,3个度量cols和15行。因此,存储在id列中的数据量实际上增加了两倍。但是,中间步骤仅需要一个id.var rn

We have started with a data.table consisting of 1 id column, 5 measure cols and 5 rows. The reshaped result has 1 id column, 3 measure cols, and 15 rows. So, the data volume stored in id columns effectively has tripled. However, the intermediate step needed only one id.var rn.

EDIT 内存消耗至关重要,可能有必要考虑将id.vars和measure.vars保留在两个单独的data.tables中,并仅将必要的id.var列与measure.vars连接起来

EDIT If memory consumption is crucial, it might be worthwhile to consider to keep the id.vars and the measure.vars in two separate data.tables and to join only the necessary id.var columns with the measure.vars on demand.

请注意, melt()的 measure.vars 参数允许使用特殊功能 patterns()。这样,可以编写对 melt()的调用以及

Note that the measure.vars parameter to melt()allows for a special function patterns(). With this the call to melt() could have been written as well as

molten <- melt(mydata, id.vars = "rn", measure.vars = patterns("_\\d$"))

这篇关于从宽转换为长而无需对列进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆