如何使用dplyr融合和投射数据帧? [英] How to melt and cast dataframes using dplyr?
问题描述
最近,我正在使用dplyr进行所有数据操作,这是一个非常好的工具。但是,我无法使用dplyr融化或投射数据帧。有什么办法吗?现在,我正在为此目的使用reshape2。
Recently I am doing all my data manipulations using dplyr and it is an excellent tool for that. However I am unable to melt or cast a data frame using dplyr. Is there any way to do that? Right now I am using reshape2 for this purpose.
我想要 dplyr解决方案:
I want 'dplyr' solution for:
require(reshape2)
data(iris)
dat <- melt(iris,id.vars="Species")
推荐答案
reshape2
的后继者是 tidyr
。 melt()
和 dcast()
的等效值为 gather()
和 spread()
。这样,与您的代码等效的就是
The successor to reshape2
is tidyr
. The equivalent of melt()
and dcast()
are gather()
and spread()
respectively. The equivalent to your code would then be
library(tidyr)
data(iris)
dat <- gather(iris, variable, value, -Species)
如果您有 magrittr
导入后,您可以像 dplyr
中那样使用管道运算符,即,写
If you have magrittr
imported you can use the pipe operator like in dplyr
, i.e. write
dat <- iris %>% gather(variable, value, -Species)
请注意,与 melt()
不同,您需要显式指定变量和值名称。我发现 gather()
的语法非常方便,因为您可以只指定要转换为长格式的列,也可以指定要保留在列中的列。新数据框的前缀是-(就像上面的物种一样),其键入速度比 melt()
快一点。但是,我注意到至少在我的计算机上, tidyr
可能比 reshape2
慢得多。
Note that you need to specify the variable and value names explicitly, unlike in melt()
. I find the syntax of gather()
quite convenient, because you can just specify the columns you want to be converted to long format, or specify the ones you want to remain in the new data frame by prefixing them with '-' (just like for Species above), which is a bit faster to type than in melt()
. However, I've noticed that on my machine at least, tidyr
can be noticeably slower than reshape2
.
编辑为回复@hadley在下面的评论,我在PC上发布了一些比较这两个功能的计时信息。
Edit In reply to @hadley 's comment below, I'm posting some timing info comparing the two functions on my PC.
library(microbenchmark)
microbenchmark(
melt = melt(iris,id.vars="Species"),
gather = gather(iris, variable, value, -Species)
)
# Unit: microseconds
# expr min lq median uq max neval
# melt 278.829 290.7420 295.797 320.5730 389.626 100
# gather 536.974 552.2515 567.395 683.2515 1488.229 100
set.seed(1)
iris1 <- iris[sample(1:nrow(iris), 1e6, replace = T), ]
system.time(melt(iris1,id.vars="Species"))
# user system elapsed
# 0.012 0.024 0.036
system.time(gather(iris1, variable, value, -Species))
# user system elapsed
# 0.364 0.024 0.387
sessionInfo()
# R version 3.1.1 (2014-07-10)
# Platform: x86_64-pc-linux-gnu (64-bit)
#
# locale:
# [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
# [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
# [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
# [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
# [9] LC_ADDRESS=C LC_TELEPHONE=C
# [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] reshape2_1.4 microbenchmark_1.3-0 magrittr_1.0.1
# [4] tidyr_0.1
#
# loaded via a namespace (and not attached):
# [1] assertthat_0.1 dplyr_0.2 parallel_3.1.1 plyr_1.8.1 Rcpp_0.11.2
# [6] stringr_0.6.2 tools_3.1.1
这篇关于如何使用dplyr融合和投射数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!