为什么R的重复项在分类数据上表现更好? [英] Why does R's duplicated perform better on sorted data?
问题描述
While comparing the efficiency of two functions in an answer to Check if list contains another list in R, I stumbled upon an interesting result. Sorting greatly increases the efficiency of duplicated
when the vector is large. This came as a surprise as I had never noticed a considerable difference in my own work using duplicated
. Indeed, for sizes that I work with everyday, there isn't a difference. Observe:
set.seed(1007)
s1 <- sample(10^2, 10^3, replace = TRUE)
s1_sort <- sort(s1)
library(microbenchmark)
microbenchmark(dp=duplicated(s1), dp_sort=duplicated(s1_sort), times=1000)
Unit: microseconds
expr min lq mean median uq max neval cld
dp 16.459 16.9425 22.06371 17.2965 22.5050 1541.137 1000 a
dp_sort 17.007 17.5005 25.54953 17.8200 23.3655 1549.198 1000 a
如您所见,对向量进行排序的时间没有明显差异。但是,在很大的向量上,结果相差很大。观察:
As you can see, there is no noticeable difference in timings when the vector is sorted. However, on very large vectors, the results are much different. Observe:
s2 <- sample(10^6, 10^7, replace = TRUE)
s2_sort <- sort(s2)
microbenchmark(dp=duplicated(s2), dp_sort=duplicated(s2_sort), times=100)
Unit: milliseconds
expr min lq mean median uq max neval cld
dp 816.6883 847.9231 869.6829 861.8210 882.3978 1019.6339 100 b
dp_sort 287.6779 305.4779 322.8830 315.1198 324.9249 449.1734 100 a
快3倍!!!这把我引向了兔子洞,从这里开始: r-source ... / duplicated.R 。从这里我们看到重复的调用 .Internal(duplicated(x,...))
。然后使用函数 pryr :: show_c_source(.Internal(duplicated(x)))
和解决方法,由@joran建议( show_c_source
当前正在提供问题。请参阅 show_c_source()很烦吗?),我们看到重复的
调用了 do_duplicated 。最后, heart of <$显示c $ c> duplicate (它从667行开始,到988结尾)。似乎整个向量都被循环,然后发生了一些散列:
Almost 3x faster!!! This led me down the rabbit hole, which began here: r-source.../duplicated.R. From here we see that duplicated makes a call to .Internal(duplicated(x,...))
. Then using the function pryr::show_c_source(.Internal(duplicated(x)))
and the workaround suggested by @joran (show_c_source
is currently giving issues.. see Is 'show_c_source()' borken?), we see that duplicated
makes a call to do_duplicated. Finally, the heart of duplicated
is revealed (It starts at line 667 and ends at 988). It appears that the entire vector is looped over and then some hashing occurs:
724 /* count unique entries */
725 k = 0;
726 for (i = 0; i < n; i++)
727 if (LOGICAL(dup)[i] == 0)
728 k++;
776 /* Build a hash table, ignoring information on duplication */
777 static void DoHashing(SEXP table, HashData *d)
我不完全理解所有代码,但似乎排序并不重要。无论哪种情况(排序还是非排序),我们都会遍历整个向量,并最终调用各种哈希函数,这些哈希函数不应该取决于向量是否已排序。我最初的想法是正在进行某种分支预测(请参阅此问题),但是从更新到此答案,看来这些东西不再重要了。
I don't fully understand all of the code, but it seems like sorting shouldn't matter. We loop over the entire vector in either case (sorted vs. non-sorted) and ultimately call an assortment of hash functions, which shouldn't depend on whether a vector is sorted or not. My initial thought was that some sort of branch prediction was going on (see this question), but from the update to this answer, it seems that these things shouldn't matter any more.
怎么回事
差距随着矢量大小和重复数的增加而增大
The gap seems to increase as both the size of the vector and the number of duplicates increases.
set.seed(496)
s3 <- sample(10^6, 10^8, replace = TRUE)
s3_sort <- sort(s3)
microbenchmark(dp=duplicated(s3), dp_sort=duplicated(s3_sort), times = 10)
Unit: seconds
expr min lq mean median uq max neval cld
dp 12.149932 12.175665 12.848843 12.495599 12.719861 15.589190 10 b
dp_sort 2.395636 2.401837 2.706674 2.551375 2.677556 4.373653 10 a
正如@alexis_laz指出的那样,如果没有
As @alexis_laz pointed out, if there are no duplicates, the impact of sorting is greatly reduced.
s4 <- sample(10^8)
s4_sort <- sort(s4)
microbenchmark(dp=duplicated(s4), dp_sort=duplicated(s4_sort), times = 10)
Unit: seconds
expr min lq mean median uq max neval cld
dp 8.013995 8.130565 8.593626 8.197501 8.438703 10.639452 10 b
dp_sort 6.135788 6.158140 6.751101 6.256739 7.241381 8.913507 10 a
推荐答案
主要因素是CPU缓存未命中率,而且随着大小的增加,页面错误的代价也越来越高。通过参考简单的哈希表检查重复项。如果要查询的哈希表部分已经在高速内存高速缓存中,则这些查找要快得多。对于小向量,相应的哈希表将完全适合高速内存缓存,因此访问顺序并不重要,这是您在第一个基准测试中看到的。
The major factor is the rate of CPU cache misses, and as size scales, more expensive page faults. Duplication is checked by reference to a simple hash table. If the portion of the hash table being queried is already in the high speed memory cache, then these lookups are much faster. For small vectors, the corresponding hash table will entirely fit into the high speed memory cache, so the order of access is not significant, which is what you saw in your first benchmark.
对于较大的向量,在任何给定时间,仅哈希表的某些块将适合高速缓存。如果重复是连续的,则查找所需的哈希表部分将已经在高速缓存中,用于后续查找。这就是为什么对于较大向量,性能会随着重复项数量的增加而提高的原因。对于极大的向量,哈希表甚至可能无法完全放入可用的物理内存中,也无法分页到磁盘上,从而使差异更加明显。
For larger vectors, only some blocks of the hash table will fit into the cache at any given time. If duplicates are consecutive, then the portion of the hash table needed for lookup will already be in the cache for the subsequent lookups. This is why performance increases by number of duplicates for larger vectors. For extremely large vectors, the hash table may not even entirely fit into available physical memory and be paged out to the disk, making the difference even more noticeable.
要对此进行测试,让我们使用原始帖子的 s2
向量及其排序版本,但也要进行测试
To test this out, let's use the original post's s2
vector and its sorted version, but also test out just having the duplicates next to each other but otherwise unsorted.
# samples as in original post
s2 <- sample(10^6, 10^7, replace = TRUE)
s2_sort <- sort(s2)
# in the same order as s2, but with duplicates brought together
u2 <- unique(s2)
t2 <- rle(s2_sort)
s2_chunked <- rep(u2,times=t2$length[match(u2,t2$values)])
我们也考虑仅按哈希值排序。我将用R来近似哈希编码,但是我们在这里处理的是双精度值,而不是能够使用无符号的long,因此我们将无法使用按位运算。
Let's also consider just sorting by hash value. I'll approximate the hash coding in R, but we are dealing with double sized values here rather than being able to use unsigned longs so we won't be able to use bitwise ops.
# in the order of hash value
K <- ceiling(log2(length(s2)*2))
M <- 2^K
h <- ((3141592653 * s2) %% 2^32)/2^(32-K)
ho <- order(h)
s2_hashordered <- s2[ho]
我们希望看到的是 s2_sort
和 s2_chunked
,对于 s2_hashordered
甚至更好。在每种情况下,我们都尝试将缓存未命中率降至最低。
What we expect to see is that performance is similar for s2_sort
and s2_chunked
and even better for s2_hashordered
. In each of these cases we've attempted to minimize cache misses.
microbenchmark(
duplicated(s2),
duplicated(s2_sort),
duplicated(s2_chunked),
duplicated(s2_hashordered),
times=10)
Unit: milliseconds
expr min lq mean median uq max neval cld
duplicated(s2) 664.5652 677.9340 690.0001 692.3104 703.8312 711.1538 10 c
duplicated(s2_sort) 245.6511 251.3861 268.7433 276.2330 279.2518 284.6589 10 b
duplicated(s2_chunked) 240.0688 243.0151 255.3857 248.1327 276.3141 283.4298 10 b
duplicated(s2_hashordered) 166.8814 169.9423 185.9345 185.1822 202.7478 209.0383 10 a
这篇关于为什么R的重复项在分类数据上表现更好?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!