为什么 std::count 和 std::find 没有针对使用 memchr 进行优化? [英] Why aren't std::count and std::find optimised to use memchr?
问题描述
我正在阅读 sehe's answer 到 这个问题 并且惊讶地发现 sehe 发现使用 std::memchr
的手写循环比使用 快 3 倍以上>std::count
(见评论).使用 std::count
的代码可以在编辑 2 中看到,但基本上归结为:
I was reading sehe's answer to this question and was surprised to see sehe found using a hand written loop using std::memchr
to be over 3 times faster than using std::count
(see comments). The code using std::count
can be seen in edit 2, but it basically boils down to:
const auto num_lines = std::count(f, l, '\n');
对比
uintmax_t num_lines = 0;
while (f && f != l)
if ((f = static_cast<const char*>(memchr(f, '\n', l - f))))
num_lines++, f++;
我希望 std::count
版本至少与 std::memchr
版本一样快 - 原因与 使用std::copy
应该至少和 std::memcpy
一样快.
I would have expected the std::count
version to be at least as fast as the std::memchr
one - for a similar reason to why using std::copy
should be at least as fast as std::memcpy
.
我检查了我的标准库 (libc++) 的 std::count
实现,并且没有尝试优化 char
输入类型(std 同上)::查找
).
I checked my standard library's (libc++) implementation of std::count
, and there is no attempt to optimise for char
input types (ditto for std::find
).
这是为什么?如果提供了 char*
迭代器和 char
值,实现能否不分派到 std::memchr
?
Why is this? Could the implementations not dispatch to std::memchr
if provided with char*
iterators, and a char
value?
推荐答案
如果比赛之间的平均距离不小,则使用对 memchr
的实际函数调用才是胜利.
Using an actual function call to memchr
is only a win if the average distance between matches is not small.
特别是对于 count
,调用 memchr
可能会慢很多,如果你计算 t
个字符,当它们平均每 2 个或可能出现一次时每 4 个.(例如,使用 ACGT 字母表的 DNA 碱基对).
Especially for count
, calling memchr
could be a lot slower if you were counting t
characters when they appear on average every 2 or maybe every 4. (e.g. with DNA base-pairs using an alphabet of ACGT).
我对使用 memchr
循环作为 std::count
超过 char
数组的默认实现持怀疑态度.还有更多可能的其他方法来调整源代码,使其编译为更好的 asm.
I'd be skeptical of using a memchr
loop as the default implementation for std::count
over char
arrays. There are more likely other ways to tweak the source so it compiles to better asm.
对于 find
来说,它会更有意义,即使它可能会显着增加开销,而如果在前几个字节中有命中,则与简单的一次字节循环相比.
For find
it would make more sense, even though it does potentially increase the overhead significantly vs. a simple byte-at-a-time loop if there's a hit in the first couple bytes.
您也可以将其视为编译器遗漏优化.如果编译器为 std::count
和 std::find
中的循环编写了更好的代码,那么调用手写 asm 库函数的收益就会减少.
You could also look at this as a compiler missed-optimization. If compilers made better code for the loops in std::count
and std::find
, there'd be less to gain from calling hand-written asm library functions.
gcc 和 clang 在进入循环之前不知道行程计数时从不自动矢量化循环.(即它们不进行搜索循环,这是对小到字节的元素大小的主要遗漏优化).ICC 没有这个限制,并且可以矢量化搜索循环.不过,我还没有研究它如何处理 libc++ 的 std::count 或 find.
gcc and clang never auto-vectorize loops when the trip-count isn't known before entering the loop. (i.e. they don't do search loops, which is a major missed optimization for element sizes as small as bytes). ICC doesn't have this limitation, and can vectorize search loops. I haven't looked at how it does with libc++'s std::count or find, though.
std::count
必须检查每个元素,因此它应该自动矢量化.但是如果 gcc 或 clang 甚至没有 -O3
,那就很不幸了.它应该在 x86 上使用 pcmpeqb
(压缩比较字节)很好地矢量化,然后 paddb
0/-1 比较结果.(至少每 255 次迭代,psadbw
对零以对字节元素进行水平求和).
std::count
has to check every element, so it should auto-vectorize. But if gcc or clang don't even with -O3
, then that's unfortunate. It should vectorize very well on x86 with pcmpeqb
(packed compare bytes), and then paddb
the 0/-1 compare results. (every 255 iterations at least, psadbw
against zero to horizontally sum the byte elements).
库函数调用开销至少是间接调用来自内存的函数指针(可能缓存未命中).在带有动态链接的 Linux 上,通常还有一个额外的 jmp
通过 PLT(除非你用 -fno-plt
编译).memchr
比 strchr
更容易优化,启动开销低,因为您可以快速检查 16B 向量加载是否可以越过末尾(相对于对齐 strchr
或 strlen
以避免跨越页面或缓存行边界)
Library function call overhead is at least an indirect call with a function pointer from memory (which can cache miss). On Linux with dynamic linking there's usually an extra jmp
through the PLT as well (unless you compiled with -fno-plt
). memchr
is easier to optimize with low startup overhead than strchr
, because you can quickly check whether a 16B vector load can go past the end (vs. aligning the pointer for strchr
or strlen
to avoid crossing a page or cache-line boundary)
如果调用 memchr
是在 asm 中实现某些东西的最佳方式,那么理论上这就是编译器应该发出的.gcc/clang 已经根据目标选项 (-march=
) 优化了大型复制循环以调用 libc memcpy
.例如当副本足够大以至于 libc 版本可能决定在 x86 上使用 NT 存储时.
If calling memchr
is the best way to implement something in asm, then in theory that's what the compiler should emit. gcc/clang already optimize large copy loops to calls to libc memcpy
, depending on target options (-march=
). e.g. when the copy is large enough that the libc version may decide to use NT stores on x86.
这篇关于为什么 std::count 和 std::find 没有针对使用 memchr 进行优化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!