使用多个度量列将数据从长格式转换为宽格式 [英] Convert data from long format to wide format with multiple measure columns

查看:22
本文介绍了使用多个度量列将数据从长格式转换为宽格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我想要携带多个度量变量时,我无法找出将数据从长格式切换到宽格式的最优雅、最灵活的方法.

I am having trouble figuring out the most elegant and flexible way to switch data from long format to wide format when I have more than one measure variable I want to bring along.

例如,这是一个长格式的简单数据框.ID 是主体,TIME 是时间变量,XY 的测量值在 TIME 的 ID:

For example, here's a simple data frame in long format. ID is the subject, TIME is a time variable, and X and Y are measurements made of ID at TIME:

> my.df <- data.frame(ID=rep(c("A","B","C"), 5), TIME=rep(1:5, each=3), X=1:15, Y=16:30)
> my.df

   ID TIME  X  Y
1   A    1  1 16
2   B    1  2 17
3   C    1  3 18
4   A    2  4 19
5   B    2  5 20
6   C    2  6 21
7   A    3  7 22
8   B    3  8 23
9   C    3  9 24
10  A    4 10 25
11  B    4 11 26
12  C    4 12 27
13  A    5 13 28
14  B    5 14 29
15  C    5 15 30

如果我只想将 TIME 的值转换为包含包含 X 的列标题,我知道我可以使用 cast()来自 reshape 包(或来自 reshape2dcast()):

If I just wanted to turn the values of TIME into column headers containing the include X, I know I can use cast() from the reshape package (or dcast() from reshape2):

> cast(my.df, ID ~ TIME, value="X")
  ID 1 2 3  4  5
1  A 1 4 7 10 13
2  B 2 5 8 11 14
3  C 3 6 9 12 15

但我真正想做的是将 Y 作为另一个度量变量,并让列名同时反映度量变量名称和时间值:

But what I really want to do is also bring along Y as another measure variable, and have the column names reflect both the measure variable name and the time value:

  ID X_1 X_2 X_3  X_4 X_5 Y_1 Y_2 Y_3 Y_4 Y_5
1  A   1   4   7   10  13  16  19  22  25  28
2  B   2   5   8   11  14  17  20  23  26  29
3  C   3   6   9   12  15  18  21  24  27  30

(FWIW,我真的不在乎所有 X 是否首先跟在 Y 之后,或者它们是否作为 交错X_1Y_1X_2Y_2 等)

(FWIW, I don't really care if all the X's are first followed by the Y's, or if they are interleaved as X_1, Y_1, X_2, Y_2, etc.)

我可以通过 cast - 两次长数据并合并结果来接近这一点,尽管列名需要一些工作,如果我需要添加一个,我需要调整它除了 XY 之外的第三个或第四个变量:

I can get close to this by cast-ing the long data twice and merging the results, though the column names need some work, and I would need to tweak it if I needed to add a 3rd or 4th variable in addition to X and Y:

merge(
cast(my.df, ID ~ TIME, value="X"),
cast(my.df, ID ~ TIME, value="Y"),
by="ID", suffixes=c("_X","_Y")
)

似乎 reshape2 和/或 plyr 中的一些函数组合应该能够比我的尝试更优雅地做到这一点,以及更干净地处理多个度量变量.类似于 cast(my.df, ID ~ TIME, value=c("X","Y")),这是无效的.但我一直无法弄清楚.

Seems like some combination of functions in reshape2 and/or plyr should be able to do this more elegantly that my attempt, as well as handling multiple measure variables more cleanly. Something like cast(my.df, ID ~ TIME, value=c("X","Y")), which isn't valid. But I haven't been able to figure it out.

推荐答案

为了像你想要的那样处理多个变量,你需要在转换之前melt你拥有的数据.

In order to handle multiple variables like you want, you need to melt the data you have before casting it.

library("reshape2")

dcast(melt(my.df, id.vars=c("ID", "TIME")), ID~variable+TIME)

给出

  ID X_1 X_2 X_3 X_4 X_5 Y_1 Y_2 Y_3 Y_4 Y_5
1  A   1   4   7  10  13  16  19  22  25  28
2  B   2   5   8  11  14  17  20  23  26  29
3  C   3   6   9  12  15  18  21  24  27  30

<小时>

根据评论


EDIT based on comment:

数据框

num.id = 10 
num.time=10 
my.df <- data.frame(ID=rep(LETTERS[1:num.id], num.time), 
                    TIME=rep(1:num.time, each=num.id), 
                    X=1:(num.id*num.time), 
                    Y=(num.id*num.time)+1:(2*length(1:(num.id*num.time))))

给出不同的结果(所有条目均为 2),因为 ID/TIME 组合并不表示唯一的行.实际上,每个ID/TIME 组合都有两行.reshape2 假设变量的每个可能组合都有一个值,如果有多个条目,将应用汇总函数来创建单个变量.这就是为什么有警告

gives a different result (all entries are 2) because the ID/TIME combination does not indicate a unique row. In fact, there are two rows with each ID/TIME combinations. reshape2 assumes a single value for each possible combination of the variables and will apply a summary function to create a single variable is there are multiple entries. That is why there is the warning

Aggregation function missing: defaulting to length

如果您添加另一个打破冗余的变量,您可以获得一些有用的东西.

You can get something that works if you add another variable which breaks that redundancy.

my.df$cycle <- rep(1:2, each=num.id*num.time)
dcast(melt(my.df, id.vars=c("cycle", "ID", "TIME")), cycle+ID~variable+TIME)

这是有效的,因为 cycle/ID/time 现在唯一地定义了 my.df 中的一行.

This works because cycle/ID/time now uniquely defines a row in my.df.

这篇关于使用多个度量列将数据从长格式转换为宽格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆