用tidyr传播两列数据框 [英] Spreading a two column data frame with tidyr
本文介绍了用tidyr传播两列数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个数据框,如下所示:
ab
1 x 8
2 x 6
3 y 3
4 y 4
5 z 5
6 z 6
,我想把它变成这样:
xyz
1 8 3 5
2 6 4 6
但打电话
pre>
library(tidyr)
df< - data.frame(
a = c(x,x,y y,z,z),
b = c(8,6,3,4,5,6)
)
df%>%spread(a,b )
返回
code> xyz
1 8 NA NA
2 6 NA NA
3 NA 3 NA
4 NA 4 NA
5 NA NA 5
6 NA NA 6
我做错了什么?
解决方案
虽然我知道你在 tidyr
, base
在这种情况下有一个解决方案:
unpack(df,b〜a)
它也有点快一点:
单位:微秒
expr min lq均值中位数uq max neval
df%>%spread a,b)657.699 679.508 717.7725 690.484 724.9795 1648.381 100
unfack(df,b〜a)309.891 335.264 349.4812 341.9635 351.6565 639.738 100
由于受欢迎的需求,更大的东西
我没有包含 data.table
解决方案,因为我不知道如果通过引用将是 microbenchmark
的问题。
库(微基准)
库(tidyr)
库(magrittr)
nlevels< - 3
#确保所有级别具有相同数量的元素
nrow< - 1e6 - 1e6 %% nlevels
df< - data.frame(a = sample(rep(c(x,y,z ),length.out = nrow)),
b = sample.int(9,nrow,replace = TRUE))
微基准(df%>%spread(a,b) ,拆分(df,b〜a),data.frame(split(df $ b,df $ a)), do.call(cbind,split(df $ b,df $ a)))
即使在1百万,拆包更快。值得注意的是, split
解决方案也非常快。
单位:毫秒
expr min lq mean median uq max neval
df%>%spread(a,b)366.24426 414.46913 450.78504 453.75258 486.1113 542.03722 100
unfack(df,b〜a)47.07663 51.17663 61.24411 53.05315 56.1114 102.71562 100
data.frame(split(df $ b,df $ a))19.44173 19.74379 22.28060 20.18726 22.1372 67.53844 100
do.call(cbind,split(df $ b,df $ a)) 26.99798 27.41594 31.27944 27.93225 31.2565 79.93624 100
I have a data frame that looks like this:
a b
1 x 8
2 x 6
3 y 3
4 y 4
5 z 5
6 z 6
and I want to turn it into this:
x y z
1 8 3 5
2 6 4 6
But calling
library(tidyr)
df <- data.frame(
a = c("x", "x", "y", "y", "z", "z"),
b = c(8, 6, 3, 4, 5, 6)
)
df %>% spread(a, b)
returns
x y z
1 8 NA NA
2 6 NA NA
3 NA 3 NA
4 NA 4 NA
5 NA NA 5
6 NA NA 6
What am I doing wrong?
解决方案
While I'm aware you're after tidyr
, base
has a solution in this case:
unstack(df, b~a)
It's also a little bit faster:
Unit: microseconds
expr min lq mean median uq max neval
df %>% spread(a, b) 657.699 679.508 717.7725 690.484 724.9795 1648.381 100
unstack(df, b ~ a) 309.891 335.264 349.4812 341.9635 351.6565 639.738 100
By popular demand, with something bigger
I haven't included the data.table
solution as I'm not sure if pass by reference would be a problem for microbenchmark
.
library(microbenchmark)
library(tidyr)
library(magrittr)
nlevels <- 3
#Ensure that all levels have the same number of elements
nrow <- 1e6 - 1e6 %% nlevels
df <- data.frame(a=sample(rep(c("x", "y", "z"), length.out=nrow)),
b=sample.int(9, nrow, replace=TRUE))
microbenchmark(df %>% spread(a, b), unstack(df, b ~ a), data.frame(split(df$b,df$a)), do.call(cbind,split(df$b,df$a)))
Even on 1 million, unstack is faster. Notably, the split
solution is also very fast.
Unit: milliseconds
expr min lq mean median uq max neval
df %>% spread(a, b) 366.24426 414.46913 450.78504 453.75258 486.1113 542.03722 100
unstack(df, b ~ a) 47.07663 51.17663 61.24411 53.05315 56.1114 102.71562 100
data.frame(split(df$b, df$a)) 19.44173 19.74379 22.28060 20.18726 22.1372 67.53844 100
do.call(cbind, split(df$b, df$a)) 26.99798 27.41594 31.27944 27.93225 31.2565 79.93624 100
这篇关于用tidyr传播两列数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文