用于寻址数据帧的元素的索引序列 [英] Indexing sequence to use for addressing an element of a data frame
问题描述
有几种方法可以使用括号( []
)和美元符号()来访问数据框架中的特定元素。 $
)。
There a several ways to access a specific element in a data frame, using various combinations of brackets ([ ]
), and dollar signs ($
). In time-sensitive functions, which one to use can be important?
对某些可能的组合进行基准化:
Benchmarking some of the possible combinations:
library(microbenchmark)
df <- data.frame(a=1:6,b=1:6,c=1:6,d=1:6,e=1:6,f=1:6)
microbenchmark(df$c[3],
df[3,]$c,
df[3,3],
df[3,][3],
df[3,][[3]],
df[,3][3],
times=1e3)
会生成以下时间:
Unit: microseconds
expr min lq mean median uq max neval
df$c[3] 9.836 11.4505 14.03068 12.2015 12.9280 1252.854 1000
df[3, ]$c 77.204 89.5750 100.18752 92.2445 98.6395 1351.521 1000
df[3, 3] 15.719 18.9850 21.04074 19.6010 20.7400 82.519 1000
df[3, ][3] 88.599 100.5920 110.59009 104.0415 110.5435 409.050 1000
df[3, ][[3]] 75.856 87.2200 98.67104 89.9360 96.1695 1391.299 1000
df[, 3][3] 11.639 13.4225 14.77493 13.9510 14.6905 55.172 1000
我们看到 df $ c [3]
是最快的,紧接着是 df [,3] 3]
。
Where we see that df$c[3]
is fastest, closely followed by df[,3][3]
. Others are much slower.
在时间敏感的应用中,我经常使用数据表而不是框架,因为排序 em>操作通常快得多。然而,寻址操作可能慢得多,因为我们看到如果我们重复上面的 data.table
:
In time sensitive appplications, I often use data tables rather than frames, because sorting and subsetting operations are typically much faster. However, addressing operations can be much slower, as we see if we repeat the above for a data.table
:
library(data.table)
dt <- as.data.table(df)
microbenchmark(dt$c[3],
dt[3,]$c,
dt[3,3],
dt[3,][[3]],
times=1e3)
Unit: microseconds
expr min lq mean median uq max neval
dt$c[3] 9.503 11.4020 14.90066 12.6820 13.8950 1336.407 1000
dt[3, ]$c 417.756 437.0495 480.26532 448.8625 463.6350 2909.038 1000
dt[3, 3] 205.115 218.9590 238.78000 227.9575 239.1265 1554.503 1000
dt[3, ][[3]] 414.378 435.2115 470.76853 447.1505 461.3310 1906.432 1000
我的问题是: $ []
保证始终是最快的寻址方法,作为数据帧(或表)中的数据类型,平台(OS)或构建版本?如果任何人可以解释时间差异的原因和/或各种方法的利弊,这也是有用的。
My question is this: Is $[ ]
guaranteed to always be the fastest addressing method, or can this depend on factors such as the types of data in the data frame (or table), the platform (OS), or the build version? If anyone can explain the reasons underlying differences in timing, and/or the pros/cons of various approaches, that would be also useful.
UPDATE
按照 42- 答案中的建议此处使用更多行以及来自 42- 的更多语法选项以及在 A.Webb 谁建议df [[3,3]]为最快的。 (注意:我也尝试过相同的测试,但访问更高的行数,但时间似乎独立于选择哪一行)。
Following the suggestions in the answer from 42- the test is repeated here using more rows and with the additional syntax options from both 42- and also in the comment by A.Webb who suggested df[[3,3]] as the fastest. (note: I also tried the same test but accessing higher row numbers, but timing seems to be independent of which row is selected).
df <- data.frame(a=1:1000,b=1:1000,c=1:1000,d=1:1000,e=1:1000,f=1:1000)
Unit: microseconds
expr min lq mean median uq max neval
df$c[3] 8.314 9.7610 12.870667 10.6260 12.0950 1250.339 1000
df[["c"]][3] 6.932 8.0670 9.652672 8.7075 9.9445 26.512 1000
(df[3, ])$c 72.395 77.2390 90.893724 79.8320 95.8540 256.082 1000
df[3, 3] 14.871 16.2625 19.377482 17.1180 20.1720 47.720 1000
df[3, ][3] 82.446 86.7680 102.462603 89.9660 107.7965 232.685 1000
df[3, ][[3]] 70.559 75.2140 93.581394 78.3385 93.4235 1507.933 1000
df[, 3][3] 9.933 11.4770 13.430309 12.1090 14.0900 38.213 1000
df[[3, 3]] 6.465 7.8355 9.236773 8.4500 9.6355 29.833 1000
所以它看起来像df [[i,j]]是最快的,紧接着是df [[colname]] [j]。其中哪些使用可能取决于您是否需要使用列名称或数字。
So it looks like df[[i,j]] is fastest, followed extremely closely by df[["colname"]][j]. Which of these to use would probably depend on whether you need to use column names or numbers.
问题仍然是开放的,如果我们可以假设这总是这样所有平台和所有数据类型。
The question is still open if we can assume that this is always the case on all platforms and for all data types.
推荐答案
如我在注释中所述, df $ c [3 ]
实际上解析为'[['(df,'c')[3]
,因此跳过解析过程结果并不奇怪在更快的执行。除了使用 $
,这不是真正的data.table函数时,data.table比较大多是非等效的。
As stated in my comments, df$c[3]
is actually parsed to '[['(df, 'c')[3]
, so it's not surprising that skipping the parsing process results in faster execution. The data.table comparisons are mostly non-equivalent except when using $
which is not really a data.table function..
Unit: microseconds
expr min lq mean median uq max neval cld
df$c[3] 16.035 16.8245 17.63600 17.3090 17.9400 31.158 1000 ab
df[["c"]][3] 13.008 13.9090 14.60883 14.2775 14.8355 121.634 1000 a
(df[3, ])$c 137.376 140.4895 143.57778 141.6055 143.8310 175.180 1000 d
df[3, 3] 29.316 30.5715 31.25617 30.9040 31.3165 49.764 1000 c
df[3, ][3] 156.524 159.4180 167.99243 160.3910 162.3120 2636.693 1000 e
df[3, ][[3]] 134.975 137.3945 142.92265 138.3810 140.2370 2675.090 1000 d
df[, 3][3] 20.108 21.2860 21.94357 21.5810 21.8640 59.057 1000 b
I承认我感到惊讶的是,我写的代码:'[['(df,'c')[3],
解析为 df [[c]] [3]
,而是对某些结果感到困惑,但一般规则是选择列首先跟在结果向量中的位置通常快得多。
I admit to being surprised at the fact that the code I wrote: '[['(df, 'c')[3],
was unparsed as df[["c"]][3]
and rather puzzled by some of the results, but the general rule is selecting columns first followed by positions in the resulting vector is generally much faster.
此外:这需要使用较大的对象进行测试。有行>> cols
Also: this needs to be tested with larger objects. Ones with rows >> cols
这篇关于用于寻址数据帧的元素的索引序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!