R:从宽到长转换,不排序列 [英] R: Transform from Wide to Long without sorting columns

查看:110
本文介绍了R:从宽到长转换,不排序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将数据格式从宽格式转换为长格式。



这里是一个玩具示例:

  mydata<  -  data.frame(ID = 1:5,ZA_1 = 1:5,
ZA_2 = 5:1,BB_1 = rep ,BB_2 = rep(6,5),CC_7 = 6:2)

ID ZA_1 ZA_2 BB_1 BB_2 CC_7
1 1 5 3 6 6
2 2 4 3 6 5
3 3 3 3 6 4
4 4 2 3 6 3
5 5 1 3 6 2

有一些变量将保持原样(这里只有ID),有些变量将被转换为长格式(这里所有其他变量,以_1,_2或_7结尾)



为了将其转换为长格式我使用data.table熔化和dcast,一个通用的方法能够自动检测变量。

  library(data.table)
setDT(mydata)
idvars = grep(_ [1-7] $,names(mydata),invert = TRUE)
temp< - melt(mydata,id.vars = idvars)
nuevo < b $ b temp [,`:=`(var = sub(_ [1-7] $,'',variable),
measure = sub('。* _','' ,variable = NULL)],
...〜var,value.var ='value')



ID测量BB CC ZA
1 1 3 NA 1
1 2 6 NA 5
1 7 NA 6 NA
2 1 3 NA 2
2 2 6 NA 4
2 7 NA 5 NA
3 1 3 NA 3
3 2 6 NA 3
3 7 NA 4 NA
4 1 3 NA 4
4 2 6 NA 2
4 7 NA 3 NA
5 1 3 NA 5
5 2 6 NA 1
5 7 NA 2 NA

正如你所看到的,列按字母顺序排列,但我希望尽可能保持原始顺序,例如考虑变量第一次出现的顺序。 / p>


ID ZA_1 ZA_2 BB_1 BB_2 CC_7 p>

应为

  ID ZA BB CC 

我不介意如果idvars列在开头全部一起, ZA_2 TEMP BB_1 BB_2 CC_2 CC_1 p>

将是

  ID ZA TEMP BB CC 

  ID TEMP ZA BB CC 

我更喜欢最后一个选项。


解决方案

另外一个问题是,一切都变成了字符。找到方式,修改我的初始解决方案

  mydata<  -  data.table(ID = 1:5,ZA_2001 = 5,ZA_2002 = 5:1,
BB_2001 = rep(3,5),BB_2002 = rep(6,5),CC_2007 = 6:2)

idvars = grep [0-9] [0-9] $,names(mydata),invert = TRUE)
temp < - melt(mydata,id.vars = idvars)
temp [,`:= `(var = sub(_ 20 [0-9] [0-9] $,'',variable),
measure = sub('。* _','',variable) )]
temp [,var:= factor(var,levels = unique(var))]
dcast(temp,...〜var,value.var ='value')

它给你正确的度量值。
无论如何,这个解决方案需要大量的内存。



诀窍是将var变量转换为factor,如mtoto所示。
mtoto解决方案很好,因为它不需要转换和融化,只有融化,但不能在我更新的示例中工作,只有当每个单词的数字变体数相同时才有效。


I want to convert a dataframe from wide format to long format.

Here it is a toy example:

mydata <- data.frame(ID=1:5, ZA_1=1:5, 
            ZA_2=5:1,BB_1=rep(3,5),BB_2=rep(6,5),CC_7=6:2)

ID ZA_1 ZA_2 BB_1 BB_2 CC_7
1    1    5    3    6    6
2    2    4    3    6    5
3    3    3    3    6    4
4    4    2    3    6    3
5    5    1    3    6    2

There are some variables that will remain as is (here only ID) and some that will be transformed to long format (here all other variables, all ending with _1, _2 or _7)

In order to transform it to long format I'm using data.table melt and dcast, a generic way able to detect the variables automatically. Other solutions are welcome too.

library(data.table)
setDT(mydata)
idvars =  grep("_[1-7]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)  
nuevo <- dcast(
  temp[, `:=`(var = sub("_[1-7]$", '', variable),
  measure = sub('.*_', '', variable), variable = NULL)],  
  ... ~ var, value.var='value') 



ID measure BB  CC  ZA
 1      1   3  NA   1
 1      2   6  NA   5
 1      7  NA   6  NA
 2      1   3  NA   2
 2      2   6  NA   4
 2      7  NA   5  NA
 3      1   3  NA   3
 3      2   6  NA   3
 3      7  NA   4  NA
 4      1   3  NA   4
 4      2   6  NA   2
 4      7  NA   3  NA
 5      1   3  NA   5
 5      2   6  NA   1
 5      7  NA   2  NA

As you can see the columns are reoredered alphabetically, but I would prefer to keep the original order as far as possible, for example taking into account the order of the first appearance of the variable.

ID ZA_1 ZA_2 BB_1 BB_2 CC_7

Should be

ID ZA BB CC

I don't mind if the idvars columns come alltogether at the beginning or if they also stay in their original position.

ID ZA_1 ZA_2 TEMP BB_1 BB_2 CC_2 CC_1

would be

ID ZA TEMP BB CC

or

ID TEMP ZA BB CC

I prefer the last option.

Another problem is that everything gets transformed to character.

解决方案

Finally I've found the way, modifying my initial solution

mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1,
BB_2001=rep(3,5),BB_2002=rep(6,5),CC_2007=6:2)

idvars =  grep("_20[0-9][0-9]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)  
temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable), 
measure = sub('.*_', '', variable), variable = NULL)]  
temp[,var:=factor(var, levels=unique(var))]
dcast( temp,   ... ~ var, value.var='value' )

And it gives you the proper measure values. Anyway this solution needs a lot of memory.

The trick was converting the var variable to factor specifying the order I want with levels, as mtoto did. mtoto solution is nice because it doesn't need to cast and melt, only melt, but doesn't work in my updated example, only works when there are the same number of number variations for each word.

这篇关于R:从宽到长转换,不排序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆