R中“选择”和“ $”之间的区别 [英] Difference between 'select' and '$' in R

查看:93
本文介绍了R中“选择”和“ $”之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解 select $ 到R中子集列之间的速度差异(这很感谢它们不要返回完全相同的东西,而是都执行概念性的 get-me-a-column 操作)。我想了解哪种情况最合适。

I want to understand the speed difference between select and $ to subset columns in R (whilst appreciating that they do not return exactly the same things, rather both perform the conceptual get-me-a-column operation). I would like to understand when either is most appropriate.

具体来说,以下 select 语句在什么条件下比相应的 $ 语句快吗?

Specifically, under what conditions would the following select statement be faster than the corresponding $ statement?

语法是:

select(df, colName1, colName2, ...)
df$colName


推荐答案

总的来说,当开发速度快,易于理解或易用时,应使用 dplyr 维护是最重要的。

In summary, you should use dplyr when speed of development, ease of understanding or ease of maintenance is most important.


  • 以下基准显示,使用 dplyr 进行操作所需的时间要比基本R等价。

  • dplyr 返回一个不同的对象(更复杂)。

  • Base R $ 和类似的操作可以更快地执行,但会带来额外的风险(例如,部分匹配的行为);可能难以阅读和/或维护;返回一个(最小的)矢量对象,该对象可能缺少数据框的某些上下文丰富性。

  • Benchmarks below show that the operation takes longer with dplyr than base R equivalents.
  • dplyr returns a different (more complex) object.
  • Base R $ and similar operations can be faster to execute, but come with additional risks (e.g. partial matching behaviour); may be harder to read and/to maintain; return a (minimal) vector object, which might be missing some of the contextual richness of a data frame.

这也可能有助于梳理(如果不想避免查看程序包的源代码),则 dplyr 正在做 alot 的工作来定位列。这也是一个不公平的测试,因为我们得到了不同的结果,但是所有操作都是给我本专栏文章操作,因此请在这种情况下阅读:

This might also help tease out (if one is wont to avoid looking at source code of packages) that dplyr is doing alot of work under the hood to target columns. It's also an unfair test since we get back different things, but all the ops are "give me this column" ops, so read it with that context:

library(dplyr)

microbenchmark::microbenchmark(
  base1 = mtcars$cyl, # returns a vector
  base2 = mtcars[['cyl', exact = TRUE]], # returns a vector
  base2a = mtcars[['cyl', exact = FALSE]], # returns a vector
  base3 = mtcars[,"cyl"], # returns a vector
  base4 = subset(mtcars, select = cyl), # returns a 1 column data frame
  dplyr1 = dplyr::select(mtcars, cyl), # returns a 1 column data frame
  dplyr2 = dplyr::select(mtcars, "cyl"), # returns a 1 column data frame
  dplyr3 = dplyr::pull(mtcars, cyl), # returns a vector
  dplyr4 = dplyr::pull(mtcars, "cyl") # returns a vector
)
## Unit: microseconds
##    expr     min       lq       mean   median        uq      max neval
##   base1   4.682   6.3860    9.23727   7.7125   10.6050   25.397   100
##   base2   4.224   5.9905    9.53136   7.7590   11.1095   27.329   100
##  base2a   3.710   5.5380    7.92479   7.0845   10.1045   16.026   100
##   base3   6.312  10.9935   13.99914  13.1740   16.2715   37.765   100
##   base4  51.084  70.3740   92.03134  76.7350   95.9365  662.395   100
##  dplyr1 698.954 742.9615  978.71306 784.8050 1154.6750 3568.188   100
##  dplyr2 711.925 749.2365 1076.32244 808.9615 1146.1705 7875.388   100
##  dplyr3  64.299  78.3745  126.97205  85.3110  112.1000 2383.731   100
##  dplyr4  63.235  73.0450   99.28021  85.1080  114.8465  263.219   100

但是,如果我们有 alot 列:

# Make a wider version of mtcars
do.call(
  cbind.data.frame,
  lapply(1:20, function(i) setNames(mtcars, sprintf("%s_%d", colnames(mtcars), i)))
) -> mtcars_manycols

# I randomly chose to get "cyl_4"
microbenchmark::microbenchmark(
  base1 = mtcars_manycols$cyl_4, # returns a vector
  base2 = mtcars_manycols[['cyl_4', exact = TRUE]], # returns a vector
  base2a = mtcars_manycols[['cyl_4', exact = FALSE]], # returns a vector
  base3 = mtcars_manycols[,"cyl_4"], # returns a vector
  base4 = subset(mtcars_manycols, select = cyl_4), # returns a 1 column data frame
  dplyr1 = dplyr::select(mtcars_manycols, cyl_4), # returns a 1 column data frame
  dplyr2 = dplyr::select(mtcars_manycols, "cyl_4"), # returns a 1 column data frame
  dplyr3 = dplyr::pull(mtcars_manycols, cyl_4), # returns a vector
  dplyr4 = dplyr::pull(mtcars_manycols, "cyl_4") # returns a vector
)
## Unit: microseconds
##    expr      min        lq       mean    median        uq       max neval
##   base1    4.534    6.8535   12.15802    8.7865   13.1775    75.095   100
##   base2    4.150    6.5390   11.59937    9.3005   13.2220    73.332   100
##  base2a    3.904    5.9755   10.73095    7.5820   11.2715    61.687   100
##   base3    6.255   11.5270   16.42439   13.6385   18.6910    70.106   100
##   base4   66.175   89.8560  118.37694   99.6480  122.9650   340.653   100
##  dplyr1 1970.706 2155.4170 3051.18823 2443.1130 3656.1705  9354.698   100
##  dplyr2 1995.165 2169.9520 3191.28939 2554.2680 3765.9420 11550.716   100
##  dplyr3  124.295  142.9535  216.89692  166.7115  209.1550  1138.368   100
##  dplyr4  127.280  150.0575  195.21398  169.5285  209.0480   488.199   100

对于大量项目, dplyr 是个不错的选择。但是,执行速度通常不是 tidyverse的属性,但是开发速度和表达速度通常会超过速度差异。

For a ton of projects, dplyr is a great choice. Speed of execution, however, is very often not an attribute of the "tidyverse" but the speed of development and expressiveness usually outweigh the speed difference.

注意: dplyr 动词可能比 subset()和-更可取,而我懒惰地使用 $ 由于默认的部分匹配行为也很危险,如 [[]] 而没有 exact = TRUE 。进入其中的一个好习惯(IMO)是在所有您不故意依赖此行为的项目中设置 options(warnPartialMatchDollar = TRUE)

NOTE: dplyr verbs are likely better candidates than subset() and — while I lazily use $ it's also a tad dangerous due to default partial matching behaviour as is [[]] without exact=TRUE. A good habit (IMO) to get into is setting options(warnPartialMatchDollar = TRUE) in all your projects where you aren't knowingly counting on this behaviour.

这篇关于R中“选择”和“ $”之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆