性能报告显示此功能"__memset_avx2_unaligned_erms",有开销.这是否意味着内存未对齐? [英] perf report shows this function "__memset_avx2_unaligned_erms" has overhead. does this mean memory is unaligned?
问题描述
我正在尝试使用perf工具分析我的C ++代码.实现包含带有SSE/AVX/AVX2指令的代码.除此以外,代码还使用-O3 -mavx2 -march=native
标志进行编译.我相信__memset_avx2_unaligned_erms
函数是memset
的libc实现. perf表明此功能有相当大的开销.函数名称指示内存未对齐,但是在代码中,我正在使用GCC内置宏__attribute__((aligned (x)))
显式对齐内存,这可能是此函数产生大量开销的原因,以及为何尽管内存已对齐但调用未对齐版本的原因明确地?
I am trying to profile my C++ code using perf tool. Implementation contains code with SSE/AVX/AVX2 instructions. In addition to that code is compiled with -O3 -mavx2 -march=native
flags. I believe __memset_avx2_unaligned_erms
function is a libc implementation of memset
. perf shows that this function has considerable overhead. Function name indicates that memory is unaligned, however in the code I am explicitly aligning the memory using GCC built-in macro __attribute__((aligned (x)))
What might be the reason for this function to have significant overhead and also why unaligned version is called although memory is aligned explicitly?
我已将示例报告作为图片随附.
I have attached the sample report as picture.
推荐答案
不,不是. 这意味着glibc在该硬件上选择的内存集策略是一种在小型情况下不会完全避免对齐访问的策略. (glibc在动态链接器符号解析时选择了一个memset实现,因此它在第一次调用后就可以在没有额外开销的情况下进行运行时调度.)
No, it doesn't. It means the memset strategy chosen by glibc on that hardware is one that doesn't try to avoid aligned accesses entirely, in the small-size cases. (glibc selects a memset implementation at dynamic linker symbol resolution time, so it gets runtime dispatching with no extra overhead after the first call.)
如果缓冲区实际上是对齐的,并且大小是向量宽度的倍数,则所有访问都将对齐,并且基本上没有开销. (将vmovdqu
与恰好在运行时对齐的指针一起使用,在所有支持AVX的CPU上完全等同于vmovdqa
.)
If your buffer is in fact aligned and the size is a multiple of the vector width, all the accesses will be aligned and there's essentially no overhead. (Using vmovdqu
with a pointer that happens to be aligned at runtime is exactly equivalent to vmovdqa
on all CPUs that support AVX.)
对于大型缓冲区,如果指针未对齐,它仍会在主循环之前对齐指针,这要花一些额外的指令,而对于仅适用于32字节的实现对齐的指针. (但是,如果要完全指向rep stosb
,则看起来它使用了rep stosb
而不对齐指针.)
For large buffers, it still aligns the pointer before the main loop in case it isn't aligned, at the cost of a couple extra instructions vs. an implementation that only worked for 32-byte aligned pointers. (But it looks like it uses rep stosb
without aligning the pointer, if it's going to rep stosb
at all.)
gcc + glibc没有仅使用对齐的指针调用的特殊版本的memset. (或针对不同对齐保证的多个特殊版本). GLIBC的AVX2不对齐实现对于对齐和不对齐输入都可以很好地工作.
gcc+glibc doesn't have a special version of memset that's only called with aligned pointers. (Or multiple special versions for different alignment guarantees). GLIBC's AVX2-unaligned implementation works nicely for both aligned and unaligned inputs.
它在 glibc/sysdeps/x86_64/multiarch/memset-avx2-unaligned-erms.S
,它定义了几个宏(例如将矢量大小定义为32),然后#includes .
源代码中的注释显示:
/* memset is implemented as:
1. Use overlapping store to avoid branch.
2. If size is less than VEC, use integer register stores.
3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
5. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
4 VEC stores and store 4 * VEC at a time until done. */
主循环之前的实际对齐是在某些vmovdqu
向量存储之后进行的(如果对实际上对齐的数据使用,则不会造成任何损失: https://agner.org/optimize/):
The actual alignment before the main loop is done after some vmovdqu
vector stores (which have no penalty if used on data that is in fact aligned: https://agner.org/optimize/):
L(loop_start):
leaq (VEC_SIZE * 4)(%rdi), %rcx # rcx = input pointer + 4*VEC_SIZE
VMOVU %VEC(0), (%rdi) # store the first vector
andq $-(VEC_SIZE * 4), %rcx # align the pointer
... some more vector stores
... and stuff, including storing the last few vectors I think
addq %rdi, %rdx # size += start, giving an end-pointer
andq $-(VEC_SIZE * 4), %rdx # align the end-pointer
L(loop): # THE MAIN LOOP
VMOVA %VEC(0), (%rcx) # vmovdqa = alignment required
VMOVA %VEC(0), VEC_SIZE(%rcx)
VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx)
VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx)
addq $(VEC_SIZE * 4), %rcx
cmpq %rcx, %rdx
jne L(loop)
因此,在VEC_SIZE = 32的情况下,它将指针对齐128.缓存行是64个字节,实际上只要对齐向量宽度就可以了.
So with VEC_SIZE = 32, it aligns the pointer by 128. This is overkill; cache lines are 64 bytes, and really just aligning to the vector width should be fine.
在具有ERMSB的CPU上,如果启用了rep stos
且缓冲区大小大于2kiB,它也具有使用阈值. (针对mcpy的增强型REP MOVSB ).
It also has a threshold for using rep stos
if enabled and the buffer size is > 2kiB, on CPUs with ERMSB. (Enhanced REP MOVSB for memcpy).
这篇关于性能报告显示此功能"__memset_avx2_unaligned_erms",有开销.这是否意味着内存未对齐?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!