如何使用dplyr融合和投射数据帧? [英] How to melt and cast dataframes using dplyr?

查看:80
本文介绍了如何使用dplyr融合和投射数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我正在使用dplyr进行所有数据操作,这是一个非常好的工具。但是,我无法使用dplyr融化或投射数据帧。有什么办法吗?现在,我正在为此目的使用reshape2。

Recently I am doing all my data manipulations using dplyr and it is an excellent tool for that. However I am unable to melt or cast a data frame using dplyr. Is there any way to do that? Right now I am using reshape2 for this purpose.

我想要 dplyr解决方案:

I want 'dplyr' solution for:

require(reshape2)
data(iris)
dat <- melt(iris,id.vars="Species")


推荐答案

reshape2 的后继者是 tidyr melt() dcast()的等效值为 gather() spread()。这样,与您的代码等效的就是

The successor to reshape2 is tidyr. The equivalent of melt() and dcast() are gather() and spread() respectively. The equivalent to your code would then be

library(tidyr)
data(iris)
dat <- gather(iris, variable, value, -Species)

如果您有 magrittr 导入后,您可以像 dplyr 中那样使用管道运算符,即,写

If you have magrittr imported you can use the pipe operator like in dplyr, i.e. write

dat <- iris %>% gather(variable, value, -Species)

请注意,与 melt()不同,您需要显式指定变量和值名称。我发现 gather()的语法非常方便,因为您可以只指定要转换为长格式的列,也可以指定要保留在列中的列。新数据框的前缀是-(就像上面的物种一样),其键入速度比 melt()快一点。但是,我注意到至少在我的计算机上, tidyr 可能比 reshape2 慢得多。

Note that you need to specify the variable and value names explicitly, unlike in melt(). I find the syntax of gather() quite convenient, because you can just specify the columns you want to be converted to long format, or specify the ones you want to remain in the new data frame by prefixing them with '-' (just like for Species above), which is a bit faster to type than in melt(). However, I've noticed that on my machine at least, tidyr can be noticeably slower than reshape2.

编辑为回复@hadley在下面的评论,我在PC上发布了一些比较这两个功能的计时信息。

Edit In reply to @hadley 's comment below, I'm posting some timing info comparing the two functions on my PC.

library(microbenchmark)
microbenchmark(
    melt = melt(iris,id.vars="Species"), 
    gather = gather(iris, variable, value, -Species)
)
# Unit: microseconds
#    expr     min       lq  median       uq      max neval
#    melt 278.829 290.7420 295.797 320.5730  389.626   100
#  gather 536.974 552.2515 567.395 683.2515 1488.229   100

set.seed(1)
iris1 <- iris[sample(1:nrow(iris), 1e6, replace = T), ] 
system.time(melt(iris1,id.vars="Species"))
#    user  system elapsed 
#   0.012   0.024   0.036 
system.time(gather(iris1, variable, value, -Species))
#    user  system elapsed 
#   0.364   0.024   0.387 

sessionInfo()
# R version 3.1.1 (2014-07-10)
# Platform: x86_64-pc-linux-gnu (64-bit)
# 
# locale:
#  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] reshape2_1.4         microbenchmark_1.3-0 magrittr_1.0.1      
# [4] tidyr_0.1           
# 
# loaded via a namespace (and not attached):
# [1] assertthat_0.1 dplyr_0.2      parallel_3.1.1 plyr_1.8.1     Rcpp_0.11.2   
# [6] stringr_0.6.2  tools_3.1.1   

这篇关于如何使用dplyr融合和投射数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆