为什么我的 Rcpp 实现用于查找唯一项的数量比基本 R 慢? [英] Why is my Rcpp implementation for finding the number of unique items slower than base R?
问题描述
我正在尝试编写一个函数来计算字符串向量中唯一项的数量(我的问题稍微复杂一些,但这是可重现的.我根据我为 C++ 找到的答案进行了此操作.这是我的代码:
I'm trying to write a function to count the number of unique items in a string vector (my problem is slightly more complex, but this is reproducible. I did this based on answers I found for C++. Here is my code:
C++
int unique_sort(vector<string> x) {
sort(x.begin(), x.end());
return unique(x.begin(), x.end()) - x.begin();
}
int unique_set(vector<string> x) {
unordered_set<string> tab(x.begin(), x.end());
return tab.size();
}
R:
x <- paste0("x", sample(1:1e5, 1e7, replace=T))
microbenchmark(length(unique(x)),unique_sort(x), unique_set(x), times=3)
结果:
Unit: milliseconds
expr min lq mean median uq
length(unique(x)) 365.0213 373.4018 406.0209 381.7823 426.5206
unique_sort(x) 10732.1918 10847.0532 10907.6882 10961.9146 10995.4363
unique_set(x) 1948.6517 2230.3383 2334.4040 2512.0249 2527.2802
查看 unique
函数的 R 源代码(有点难以理解),它似乎在数组上使用循环将唯一元素添加到哈希并检查该哈希如果元素已经存在.
Looking at the R source code for the unique
function (it is a bit hard to follow), it seems to use a loop over the array to add unique elements to a hash and checks that hash if element already exist.
因此,我认为它应该等同于 unordered_set 方法.我不明白为什么 unordered_set 方法要慢 5 倍.
Therefore, I thought it should be equivalent to the unordered_set approach. I don't understand why the unordered_set approach is 5 times slower.
TLDR:为什么我的 C++ 代码很慢?
TLDR: why is my C++ code slow?
推荐答案
首先,请让示例具有可重现性.上面缺少 Rcpp 属性、C++11 插件和必要的头文件导入.
First, please make examples reproducible. The above lacks Rcpp Attributes, C++11 plugin, and necessary header imports.
第二,这里显示的问题是将R中的数据深度复制到C++的成本结构体.基准测试中的大部分时间都花在复制过程中.这个过程是通过选择使用 std::vector
而不是 Rcpp::CharacterVector
来触发的,它保存了 SEXP
、s-expression 或指向数据的指针.通过否定 Rcpp 对象提供的代理模型,它只执行浅拷贝,将数据导入 C++ 的直接成本将远大于其中描述的微秒为什么这个简单的 cpp 函数版本更慢?.
Second, the issue that is being displayed here is the cost of performing a deep copy of the data out of R to a C++ structure. The majority of time in your benchmark is being spent in the copy process. This process is triggered by the choice of using an std::vector<std::string>
instead of an Rcpp::CharacterVector
, which holds the SEXP
, s-expression, or a pointer to the data. By negating the proxy model offered by Rcpp objects, which only performs shallow copies, there will be an immediate cost to data importation into C++ that is much larger than mere microseconds described within Why is this simplistic cpp function version slower?.
说了这么多,我们来说说如何修改上面的例子来使用Rcpp对象.首先,请注意 Rcpp 对象有一个名为 .sort()
的成员函数,它可以准确地对具有缺失值的 Rcpp::CharacterVector
进行排序(请参阅 Rcpp FAQ 第 5.5 节:在 CharacterVector 上使用 STL 排序会产生有问题的结果 有关详细信息,这假定没有大写或特殊语言环境).其次,即使数据作为 Rcpp::CharacterVector
导入,SEXP
类型也可以用作构造 std::unordered_set
的方法>.这些修改可以在声明中带有native"的 C++ 函数中找到.
Having said that, let's talk how to modify the above example to use Rcpp objects. First, note that Rcpp objects have a member function called .sort()
that accurately sorts an Rcpp::CharacterVector
with missing values (see Rcpp FAQ Section 5.5: Sorting with STL on a CharacterVector produces problematic results for details this assumes no capitalize or special locale). Secondly, the SEXP
type can be used as a way to construct the std::unordered_set
even with the data being imported as an Rcpp::CharacterVector
. These modifications can be found in the C++ functions with "native" in their declaration.
#include <Rcpp.h>
#include <unordered_set>
#include <algorithm>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
int unique_sort(std::vector<std::string> x) {
sort(x.begin(), x.end());
return unique(x.begin(), x.end()) - x.begin();
}
// [[Rcpp::export]]
int unique_set(std::vector<std::string> x) {
std::unordered_set<std::string> tab(x.begin(), x.end());
return tab.size();
}
// [[Rcpp::export]]
int unique_sort_native(Rcpp::CharacterVector x) {
x.sort();
return std::unique(x.begin(), x.end()) - x.begin();
}
// [[Rcpp::export]]
int unique_set_native(Rcpp::CharacterVector x) {
std::unordered_set<SEXP> tab(x.begin(), x.end());
return tab.size();
}
测试代码:
# install.packages(c("microbenchmark"))
# Note, it is more efficient to supply an integer rather than a vector
# in sample()'s first parameter.
x <- paste0("x", sample(1e5, 1e7, replace=T))
# Run a microbenchmark
microbenchmark::microbenchmark(
length(unique(x)),
length(unique.default(x)),
unique_sort(x),
unique_set(x),
unique_sort_native(x),
unique_set_native(x),
times = 10
)
输出:
Unit: milliseconds
expr min lq mean median uq max neval
length(unique(x)) 208.0 235.3 235.7 237.2 240.2 247.4 10
length(unique.default(x)) 230.9 232.8 238.8 233.7 241.8 266.6 10
unique_sort(x) 12759.4 12877.1 12993.8 12920.1 13043.2 13416.7 10
unique_set(x) 2528.1 2545.3 2590.1 2590.3 2631.3 2670.1 10
unique_sort_native(x) 7452.6 7482.4 7568.5 7509.0 7563.6 7917.8 10
unique_set_native(x) 175.8 176.9 179.2 178.3 182.3 183.4 10
因此,当使用 Rcpp 对象避免深拷贝时,unique_set_native
函数击败了 length(unique())
调用大约 30 毫秒.
So, when avoiding a deep copy by using Rcpp objects, the unique_set_native
function beats out the length(unique())
call by around 30 milliseconds.
这篇关于为什么我的 Rcpp 实现用于查找唯一项的数量比基本 R 慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!