R:从宽到长的转换,不排序列 [英] R: Transform from Wide to Long without sorting columns

查看:129
本文介绍了R:从宽到长的转换,不排序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将数据帧从宽格式转换为长格式。

I want to convert a dataframe from wide format to long format.

这里是一个玩具示例:

mydata <- data.frame(ID=1:5, ZA_1=1:5, 
            ZA_2=5:1,BB_1=rep(3,5),BB_2=rep(6,5),CC_7=6:2)

ID ZA_1 ZA_2 BB_1 BB_2 CC_7
1    1    5    3    6    6
2    2    4    3    6    5
3    3    3    3    6    4
4    4    2    3    6    3
5    5    1    3    6    2

有一些变量将保持原样(这里只有ID),一些变量将被转换为长格式(这里所有其他变量,全部以_1,_2或_7结尾)

There are some variables that will remain as is (here only ID) and some that will be transformed to long format (here all other variables, all ending with _1, _2 or _7)

为了将其转换为长格式,我使用data.table melt和dcast,一种能够自动检测变量的通用方法。其他解决方案也是受欢迎的。

In order to transform it to long format I'm using data.table melt and dcast, a generic way able to detect the variables automatically. Other solutions are welcome too.

library(data.table)
setDT(mydata)
idvars =  grep("_[1-7]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)  
nuevo <- dcast(
  temp[, `:=`(var = sub("_[1-7]$", '', variable),
  measure = sub('.*_', '', variable), variable = NULL)],  
  ... ~ var, value.var='value') 



ID measure BB  CC  ZA
 1      1   3  NA   1
 1      2   6  NA   5
 1      7  NA   6  NA
 2      1   3  NA   2
 2      2   6  NA   4
 2      7  NA   5  NA
 3      1   3  NA   3
 3      2   6  NA   3
 3      7  NA   4  NA
 4      1   3  NA   4
 4      2   6  NA   2
 4      7  NA   3  NA
 5      1   3  NA   5
 5      2   6  NA   1
 5      7  NA   2  NA

正如你可以看到列按字母顺序重新编号,但我宁愿保持原来的顺序为远例如,考虑到变量第一次出现的顺序。

As you can see the columns are reoredered alphabetically, but I would prefer to keep the original order as far as possible, for example taking into account the order of the first appearance of the variable.


ID ZA_1 ZA_2 BB_1 BB_2 CC_7

ID ZA_1 ZA_2 BB_1 BB_2 CC_7

应该是

ID ZA BB CC

我不要介意,如果idvars列始终在一起,或者他们也保持原来的位置。

I don't mind if the idvars columns come alltogether at the beginning or if they also stay in their original position.


ID ZA_1 ZA_2 TEMP BB_1 BB_2 CC_2 CC_1

ID ZA_1 ZA_2 TEMP BB_1 BB_2 CC_2 CC_1

将是

ID ZA TEMP BB CC

ID TEMP ZA BB CC

我喜欢最后一个选项。

另一个问题是所有内容都转换为字符。

Another problem is that everything gets transformed to character.

推荐答案

最后我找到了修改我的初始解决方案的方法。
$ b

Finally I've found the way, modifying my initial solution

mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1,
BB_2001=rep(3,5),BB_2002=rep(6,5),CC_2007=6:2)

idvars =  grep("_20[0-9][0-9]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)  
temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable), 
measure = sub('.*_', '', variable), variable = NULL)]  
temp[,var:=factor(var, levels=unique(var))]
dcast( temp,   ... ~ var, value.var='value' )

它为您提供了适当的度量值。
无论如何,这个解决方案需要大量的内存。

And it gives you the proper measure values. Anyway this solution needs a lot of memory.

这个技巧是将var变量转换为用mtoto来指定我想要的级别的顺序。
mtoto解决方案是很好的,因为它不需要投射和融化,只能融化,但在我更新的例子中不起作用,只有当每个单词都有相同数量的变体形式时才有效。

The trick was converting the var variable to factor specifying the order I want with levels, as mtoto did. mtoto solution is nice because it doesn't need to cast and melt, only melt, but doesn't work in my updated example, only works when there are the same number of number variations for each word.

PD:
我正在逐步解析,发现在使用大型数据表时,熔化步骤可能是一个大问题。如果你有一个data.table只有100000行x 1000列,并使用一半列作为id.vars输出约50000000×500,太多了继续下一步。
data.table需要一个直接的方式来做,而不需要创建巨大的中间步骤。

PD: I've being parsing every step and found that the melt step could be a big problem when working with large datatables. If you have a data.table with just 100000 rows x 1000 columns and use half of the columns as id.vars the output is approx 50000000 x 500, just too much to continue with the next step. data.table needs a direct way to do it without creating giant middle steps.

这篇关于R:从宽到长的转换,不排序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆