为什么 std::count 和 std::find 没有针对使用 memchr 进行优化? [英] Why aren't std::count and std::find optimised to use memchr?

查看:58
本文介绍了为什么 std::count 和 std::find 没有针对使用 memchr 进行优化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读 sehe's answer这个问题 并且惊讶地发现 sehe 发现使用 std::memchr 的手写循环比使用 快 3 倍以上>std::count(见评论).使用 std::count 的代码可以在编辑 2 中看到,但基本上归结为:

I was reading sehe's answer to this question and was surprised to see sehe found using a hand written loop using std::memchr to be over 3 times faster than using std::count (see comments). The code using std::count can be seen in edit 2, but it basically boils down to:

const auto num_lines = std::count(f, l, '\n');

对比

uintmax_t num_lines = 0;
while (f && f != l)
    if ((f = static_cast<const char*>(memchr(f, '\n', l - f))))
        num_lines++, f++;

我希望 std::count 版本至少与 std::memchr 版本一样快 - 原因与 使用std::copy 应该至少和 std::memcpy 一样快.

I would have expected the std::count version to be at least as fast as the std::memchr one - for a similar reason to why using std::copy should be at least as fast as std::memcpy.

我检查了我的标准库 (libc++) 的 std::count 实现,并且没有尝试优化 char 输入类型(std 同上)::查找).

I checked my standard library's (libc++) implementation of std::count, and there is no attempt to optimise for char input types (ditto for std::find).

这是为什么?如果提供了 char* 迭代器和 char 值,实现能否不分派到 std::memchr?

Why is this? Could the implementations not dispatch to std::memchr if provided with char* iterators, and a char value?

推荐答案

如果比赛之间的平均距离不小,则使用对 memchr 的实际函数调用才是胜利.

Using an actual function call to memchr is only a win if the average distance between matches is not small.

特别是对于 count,调用 memchr 可能会慢很多,如果你计算 t 个字符,当它们平均每 2 个或可能出现一次时每 4 个.(例如,使用 ACGT 字母表的 DNA 碱基对).

Especially for count, calling memchr could be a lot slower if you were counting t characters when they appear on average every 2 or maybe every 4. (e.g. with DNA base-pairs using an alphabet of ACGT).

我对使用 memchr 循环作为 std::count 超过 char 数组的默认实现持怀疑态度.还有更多可能的其他方法来调整源代码,使其编译为更好的 asm.

I'd be skeptical of using a memchr loop as the default implementation for std::count over char arrays. There are more likely other ways to tweak the source so it compiles to better asm.

对于 find 来说,它会更有意义,即使它可能会显着增加开销,而如果在前几个字节中有命中,则与简单的一次字节循环相比.

For find it would make more sense, even though it does potentially increase the overhead significantly vs. a simple byte-at-a-time loop if there's a hit in the first couple bytes.

您也可以将其视为编译器遗漏优化.如果编译器为 std::countstd::find 中的循环编写了更好的代码,那么调用手写 asm 库函数的收益就会减少.

You could also look at this as a compiler missed-optimization. If compilers made better code for the loops in std::count and std::find, there'd be less to gain from calling hand-written asm library functions.

gcc 和 clang 在进入循环之前不知道行程计数时从不自动矢量化循环.(即它们不进行搜索循环,这是对小到字节的元素大小的主要遗漏优化).ICC 没有这个限制,并且可以矢量化搜索循环.不过,我还没有研究它如何处理 libc++ 的 std::count 或 find.

gcc and clang never auto-vectorize loops when the trip-count isn't known before entering the loop. (i.e. they don't do search loops, which is a major missed optimization for element sizes as small as bytes). ICC doesn't have this limitation, and can vectorize search loops. I haven't looked at how it does with libc++'s std::count or find, though.

std::count 必须检查每个元素,因此它应该自动矢量化.但是如果 gcc 或 clang 甚至没有 -O3 ,那就很不幸了.它应该在 x86 上使用 pcmpeqb(压缩比较字节)很好地矢量化,然后 paddb 0/-1 比较结果.(至少每 255 次迭代,psadbw 对零以对字节元素进行水平求和).

std::count has to check every element, so it should auto-vectorize. But if gcc or clang don't even with -O3, then that's unfortunate. It should vectorize very well on x86 with pcmpeqb (packed compare bytes), and then paddb the 0/-1 compare results. (every 255 iterations at least, psadbw against zero to horizontally sum the byte elements).

库函数调用开销至少是间接调用来自内存的函数指针(可能缓存未命中).在带有动态链接的 Linux 上,通常还有一个额外的 jmp 通过 PLT(除非你用 -fno-plt 编译).memchrstrchr 更容易优化,启动开销低,因为您可以快速检查 16B 向量加载是否可以越过末尾(相对于对齐 strchrstrlen 以避免跨越页面或缓存行边界)

Library function call overhead is at least an indirect call with a function pointer from memory (which can cache miss). On Linux with dynamic linking there's usually an extra jmp through the PLT as well (unless you compiled with -fno-plt). memchr is easier to optimize with low startup overhead than strchr, because you can quickly check whether a 16B vector load can go past the end (vs. aligning the pointer for strchr or strlen to avoid crossing a page or cache-line boundary)

如果调用 memchr 是在 asm 中实现某些东西的最佳方式,那么理论上这就是编译器应该发出的.gcc/clang 已经根据目标选项 (-march=) 优化了大型复制循环以调用 libc memcpy.例如当副本足够大以至于 libc 版本可能决定在 x86 上使用 NT 存储时.

If calling memchr is the best way to implement something in asm, then in theory that's what the compiler should emit. gcc/clang already optimize large copy loops to calls to libc memcpy, depending on target options (-march=). e.g. when the copy is large enough that the libc version may decide to use NT stores on x86.

这篇关于为什么 std::count 和 std::find 没有针对使用 memchr 进行优化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆