R中“选择”和“ $”之间的区别 [英] Difference between 'select' and '$' in R
问题描述
我想了解 select
和 $
到R中子集列之间的速度差异(这很感谢它们不要返回完全相同的东西,而是都执行概念性的 get-me-a-column
操作)。我想了解哪种情况最合适。
I want to understand the speed difference between select
and $
to subset columns in R (whilst appreciating that they do not return exactly the same things, rather both perform the conceptual get-me-a-column
operation). I would like to understand when either is most appropriate.
具体来说,以下 select
语句在什么条件下比相应的 $
语句快吗?
Specifically, under what conditions would the following select
statement be faster than the corresponding $
statement?
语法是:
select(df, colName1, colName2, ...)
df$colName
推荐答案
总的来说,当开发速度快,易于理解或易用时,应使用 dplyr
维护是最重要的。
In summary, you should use dplyr
when speed of development, ease of understanding or ease of maintenance is most important.
- 以下基准显示,使用
dplyr
进行操作所需的时间要比基本R等价。 -
dplyr
返回一个不同的对象(更复杂)。 - Base R
$
和类似的操作可以更快地执行,但会带来额外的风险(例如,部分匹配的行为);可能难以阅读和/或维护;返回一个(最小的)矢量对象,该对象可能缺少数据框的某些上下文丰富性。
- Benchmarks below show that the operation takes longer with
dplyr
than base R equivalents. dplyr
returns a different (more complex) object.- Base R
$
and similar operations can be faster to execute, but come with additional risks (e.g. partial matching behaviour); may be harder to read and/to maintain; return a (minimal) vector object, which might be missing some of the contextual richness of a data frame.
这也可能有助于梳理(如果不想避免查看程序包的源代码),则 dplyr
正在做 alot 的工作来定位列。这也是一个不公平的测试,因为我们得到了不同的结果,但是所有操作都是给我本专栏文章操作,因此请在这种情况下阅读:
This might also help tease out (if one is wont to avoid looking at source code of packages) that dplyr
is doing alot of work under the hood to target columns. It's also an unfair test since we get back different things, but all the ops are "give me this column" ops, so read it with that context:
library(dplyr)
microbenchmark::microbenchmark(
base1 = mtcars$cyl, # returns a vector
base2 = mtcars[['cyl', exact = TRUE]], # returns a vector
base2a = mtcars[['cyl', exact = FALSE]], # returns a vector
base3 = mtcars[,"cyl"], # returns a vector
base4 = subset(mtcars, select = cyl), # returns a 1 column data frame
dplyr1 = dplyr::select(mtcars, cyl), # returns a 1 column data frame
dplyr2 = dplyr::select(mtcars, "cyl"), # returns a 1 column data frame
dplyr3 = dplyr::pull(mtcars, cyl), # returns a vector
dplyr4 = dplyr::pull(mtcars, "cyl") # returns a vector
)
## Unit: microseconds
## expr min lq mean median uq max neval
## base1 4.682 6.3860 9.23727 7.7125 10.6050 25.397 100
## base2 4.224 5.9905 9.53136 7.7590 11.1095 27.329 100
## base2a 3.710 5.5380 7.92479 7.0845 10.1045 16.026 100
## base3 6.312 10.9935 13.99914 13.1740 16.2715 37.765 100
## base4 51.084 70.3740 92.03134 76.7350 95.9365 662.395 100
## dplyr1 698.954 742.9615 978.71306 784.8050 1154.6750 3568.188 100
## dplyr2 711.925 749.2365 1076.32244 808.9615 1146.1705 7875.388 100
## dplyr3 64.299 78.3745 126.97205 85.3110 112.1000 2383.731 100
## dplyr4 63.235 73.0450 99.28021 85.1080 114.8465 263.219 100
但是,如果我们有 alot 列:
# Make a wider version of mtcars
do.call(
cbind.data.frame,
lapply(1:20, function(i) setNames(mtcars, sprintf("%s_%d", colnames(mtcars), i)))
) -> mtcars_manycols
# I randomly chose to get "cyl_4"
microbenchmark::microbenchmark(
base1 = mtcars_manycols$cyl_4, # returns a vector
base2 = mtcars_manycols[['cyl_4', exact = TRUE]], # returns a vector
base2a = mtcars_manycols[['cyl_4', exact = FALSE]], # returns a vector
base3 = mtcars_manycols[,"cyl_4"], # returns a vector
base4 = subset(mtcars_manycols, select = cyl_4), # returns a 1 column data frame
dplyr1 = dplyr::select(mtcars_manycols, cyl_4), # returns a 1 column data frame
dplyr2 = dplyr::select(mtcars_manycols, "cyl_4"), # returns a 1 column data frame
dplyr3 = dplyr::pull(mtcars_manycols, cyl_4), # returns a vector
dplyr4 = dplyr::pull(mtcars_manycols, "cyl_4") # returns a vector
)
## Unit: microseconds
## expr min lq mean median uq max neval
## base1 4.534 6.8535 12.15802 8.7865 13.1775 75.095 100
## base2 4.150 6.5390 11.59937 9.3005 13.2220 73.332 100
## base2a 3.904 5.9755 10.73095 7.5820 11.2715 61.687 100
## base3 6.255 11.5270 16.42439 13.6385 18.6910 70.106 100
## base4 66.175 89.8560 118.37694 99.6480 122.9650 340.653 100
## dplyr1 1970.706 2155.4170 3051.18823 2443.1130 3656.1705 9354.698 100
## dplyr2 1995.165 2169.9520 3191.28939 2554.2680 3765.9420 11550.716 100
## dplyr3 124.295 142.9535 216.89692 166.7115 209.1550 1138.368 100
## dplyr4 127.280 150.0575 195.21398 169.5285 209.0480 488.199 100
对于大量项目, dplyr
是个不错的选择。但是,执行速度通常不是 tidyverse的属性,但是开发速度和表达速度通常会超过速度差异。
For a ton of projects, dplyr
is a great choice. Speed of execution, however, is very often not an attribute of the "tidyverse" but the speed of development and expressiveness usually outweigh the speed difference.
注意: dplyr
动词可能比 subset()
和-更可取,而我懒惰地使用 $
由于默认的部分匹配行为也很危险,如 [[]]
而没有 exact = TRUE
。进入其中的一个好习惯(IMO)是在所有您不故意依赖此行为的项目中设置 options(warnPartialMatchDollar = TRUE)
。
NOTE: dplyr
verbs are likely better candidates than subset()
and — while I lazily use $
it's also a tad dangerous due to default partial matching behaviour as is [[]]
without exact=TRUE
. A good habit (IMO) to get into is setting options(warnPartialMatchDollar = TRUE)
in all your projects where you aren't knowingly counting on this behaviour.
这篇关于R中“选择”和“ $”之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!