这个 memcpy 实现中缺少什么/次优? [英] What's missing/sub-optimal in this memcpy implementation?

查看:20
本文介绍了这个 memcpy 实现中缺少什么/次优?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对编写 memcpy() 作为一种教育练习产生了兴趣.我不会写一篇关于我做了什么和没有想到的论文,但这里是某人的实现:

I've become interested in writing a memcpy() as an educational exercise. I won't write a whole treatise of what I did and didn't think about, but here's some guy's implementation:

__forceinline   // Since Size is usually known,
                // most useless code will be optimized out
                // if the function is inlined.

void* myMemcpy(char* Dst, const char* Src, size_t Size)
{
        void* start = Dst;
        for ( ; Size >= sizeof(__m256i); Size -= sizeof(__m256i) )
        {
                __m256i ymm = _mm256_loadu_si256(((const __m256i* &)Src)++);
                _mm256_storeu_si256(((__m256i* &)Dst)++, ymm);
        }

#define CPY_1B *((uint8_t * &)Dst)++ = *((const uint8_t * &)Src)++
#define CPY_2B *((uint16_t* &)Dst)++ = *((const uint16_t* &)Src)++
#define CPY_4B *((uint32_t* &)Dst)++ = *((const uint32_t* &)Src)++
#if defined _M_X64 || defined _M_IA64 || defined __amd64
#define CPY_8B *((uint64_t* &)Dst)++ = *((const uint64_t* &)Src)++
#else
#define CPY_8B _mm_storel_epi64((__m128i *)Dst, _mm_loadu_si128((const __m128i *)Src)), ++(const uint64_t* &)Src, ++(uint64_t* &)Dst
#endif
#define CPY16B _mm_storeu_si128((__m128i *)Dst, _mm_loadu_si128((const __m128i *)Src)), ++(const __m128i* &)Src, ++(__m128i* &)Dst

    switch (Size) {
    case 0x00:                                                      break;
    case 0x01:      CPY_1B;                                         break;
    case 0x02:              CPY_2B;                                 break;
    case 0x03:      CPY_1B; CPY_2B;                                 break;
    case 0x04:                      CPY_4B;                         break;
    case 0x05:      CPY_1B;         CPY_4B;                         break;
    case 0x06:              CPY_2B; CPY_4B;                         break;
    case 0x07:      CPY_1B; CPY_2B; CPY_4B;                         break;
    case 0x08:                              CPY_8B;                 break;
    case 0x09:      CPY_1B;                 CPY_8B;                 break;
    case 0x0A:              CPY_2B;         CPY_8B;                 break;
    case 0x0B:      CPY_1B; CPY_2B;         CPY_8B;                 break;
    case 0x0C:                      CPY_4B; CPY_8B;                 break;
    case 0x0D:      CPY_1B;         CPY_4B; CPY_8B;                 break;
    case 0x0E:              CPY_2B; CPY_4B; CPY_8B;                 break;
    case 0x0F:      CPY_1B; CPY_2B; CPY_4B; CPY_8B;                 break;
    case 0x10:                                      CPY16B;         break;
    case 0x11:      CPY_1B;                         CPY16B;         break;
    case 0x12:              CPY_2B;                 CPY16B;         break;
    case 0x13:      CPY_1B; CPY_2B;                 CPY16B;         break;
    case 0x14:                      CPY_4B;         CPY16B;         break;
    case 0x15:      CPY_1B;         CPY_4B;         CPY16B;         break;
    case 0x16:              CPY_2B; CPY_4B;         CPY16B;         break;
    case 0x17:      CPY_1B; CPY_2B; CPY_4B;         CPY16B;         break;
    case 0x18:                              CPY_8B; CPY16B;         break;
    case 0x19:      CPY_1B;                 CPY_8B; CPY16B;         break;
    case 0x1A:              CPY_2B;         CPY_8B; CPY16B;         break;
    case 0x1B:      CPY_1B; CPY_2B;         CPY_8B; CPY16B;         break;
    case 0x1C:                      CPY_4B; CPY_8B; CPY16B;         break;
    case 0x1D:      CPY_1B;         CPY_4B; CPY_8B; CPY16B;         break;
    case 0x1E:              CPY_2B; CPY_4B; CPY_8B; CPY16B;         break;
    case 0x1F:      CPY_1B; CPY_2B; CPY_4B; CPY_8B; CPY16B;         break;
    }
#undef CPY_1B
#undef CPY_2B
#undef CPY_4B
#undef CPY_8B
#undef CPY16B
        return start;
}

注释翻译为大小通常被称为编译器可以优化内联出最无用的代码".

The comment translates as "Size is usually known as the compiler can optimize the code inline out most useless".

如果可能的话,我想改进这个实现 - 但也许没有太多需要改进的地方.我看到它使用 SSE/AVX 来处理更大的内存块,而不是在最后一个 < 上循环.32 字节相当于手动展开,并进行了一些调整.所以,这是我的问题:

I would like to improve, if possible, on this implementation - but maybe there isn't much to improve. I see it uses SSE/AVX for the larger chunks of memory, then instead of a loop over the last < 32 bytes does the equivalent of manual unrolling, with some tweaking. So, here are my questions:

  • 为什么展开最后几个字节的循环,而不是部分展开第一个(现在是单个)循环?
  • 对齐问题呢?他们不重要吗?我是否应该以不同的方式处理前几个字节直到某个对齐量,然后对对齐的字节序列执行 256 位操作?如果是这样,我如何确定合适的对齐量程?
  • 此实现中最重要的缺失功能是什么(如果有)?
  • 你应该__restrict__你的参数.(@chux)
  • 内存带宽是一个限制因素;对照它衡量你的实施.(@Zboson)
  • 对于小型阵列,您可以期望接近内存带宽;对于更大的阵列 - 没有那么多.(@Zboson)
  • 需要多个线程(可能是 | 是)来使内存带宽饱和.(@Zboson)
  • 针对大副本和小副本大小进行不同的优化可能是明智的.(@Zboson)
  • (对齐是否很重要?没有明确解决!)
  • 应该让编译器更明确地意识到它可以用于优化的明显事实"(例如,在第一个循环之后 Size <32 的事实).(@chux)
  • 展开您的 SSE/AVX 调用有一些论据(@BenJackson,此处),也有反对这样做的论据(@PaulR)
  • 非临时传输(告诉 CPU 您不需要它来缓存目标位置)对于复制较大的缓冲区应该很有用.(@Zboson)
  • You should __restrict__ your parameters. (@chux)
  • The memory bandwidth is a limiting factor; measure your implementation against it.(@Zboson)
  • For small arrays, you can expect to approach the memory bandwidth; for larger arrays - not as much. (@Zboson)
  • Multiple threads (may be | are) necessary to saturate the memory bandwidth. (@Zboson)
  • It is probably wise to optimize differently for large and small copy sizes. (@Zboson)
  • (Alignment is important? Not explicitly addressed!)
  • The compiler should be made more explicitly aware of "obvious facts" it can use for optimization (such as the fact that Size < 32 after the first loop). (@chux)
  • There are arguments for unrolling your SSE/AVX calls (@BenJackson, here), and arguments against doing so (@PaulR)
  • non-temporal transfers (with which you tell the CPU you don't need it to cache the target location) should be useful for copying larger buffers. (@Zboson)

推荐答案

我一直在研究通过各种操作测量 Intel 处理器的内存带宽,其中之一是 memcpy.我已经在 Core2、Ivy Bridge 和 Haswell 上做到了这一点.我使用带有内在函数的 C/C++ 进行了大部分测试(请参阅下面的代码 - 但我目前正在用汇编重写我的测试).

I have been studying measuring memory bandwidth for Intel processors with various operations and one of them is memcpy. I have done this on Core2, Ivy Bridge, and Haswell. I did most of my tests using C/C++ with intrinsics (see the code below - but I'm currently rewriting my tests in assembly).

要编写您自己的高效memcpy 函数,了解可能的绝对最佳带宽是很重要的.此带宽是将被复制的数组大小的函数,因此高效的 memcpy 函数需要针对小型和大型(可能介于两者之间)进行不同的优化.为简单起见,我针对 8192 字节的小数组和 1 GB 的大数组进行了优化.

To write your own efficient memcpy function it's important to know what the absolute best bandwidth possible is. This bandwidth is a function of the size of the arrays which will be copied and therefore an efficient memcpy function needs to optimize differently for small and big (and maybe in between). To keep things simple I have optimized for small arrays of 8192 bytes and large arrays of 1 GB.

对于小型阵列,每个内核的最大读写带宽为:

For small arrays the maximum read and write bandwidth for each core is:

Core2-Ivy Bridge             32 bytes/cycle
Haswell                      64 bytes/cycle

这是您应该针对小型阵列的基准.对于我的测试,我假设数组对齐到 64 字节并且数组大小是 8*sizeof(float)*unroll_factor 的倍数.这是我当前的 memcpy 结果,大小为 8192 字节(Ubuntu 14.04、GCC 4.9、EGLIBC 2.19):

This is the benchmark you should aim for small arrays. For my tests I assume the arrays are aligned to 64-bytes and that the array size is a multiple of 8*sizeof(float)*unroll_factor. Here are my current memcpy results for a size of 8192 bytes (Ubuntu 14.04, GCC 4.9, EGLIBC 2.19):

                             GB/s     efficiency
    Core2 (p9600@2.66 GHz)  
        builtin               35.2    41.3%
        eglibc                39.2    46.0%
        asmlib:               76.0    89.3%
        copy_unroll1:         39.1    46.0%
        copy_unroll8:         73.6    86.5%
    Ivy Bridge (E5-1620@3.6 GHz)                        
        builtin              102.2    88.7%
        eglibc:              107.0    92.9%
        asmlib:              107.6    93.4%
        copy_unroll1:        106.9    92.8%
        copy_unroll8:        111.3    96.6%
    Haswell (i5-4250U@1.3 GHz)
        builtin:              68.4    82.2%     
        eglibc:               39.7    47.7%
        asmlib:               73.2    87.6%
        copy_unroll1:         39.6    47.6%
        copy_unroll8:         81.9    98.4%

asmlibAgner Fog 的 asmlib.copy_unroll1copy_unroll8 函数定义如下.

The asmlib is Agner Fog's asmlib. The copy_unroll1 and copy_unroll8 functions are defined below.

从该表中我们可以看到 GCC 内置的 memcpy 在 Core2 上运行不佳,EGLIBC 中的 memcpy 在 Core2 或 Haswell 上运行效果不佳.我最近确实检查了 GLIBC 的头部版本,并且在 Haswell 上的性能要好得多.在所有情况下,展开都能获得最佳结果.

From this table we can see that the GCC builtin memcpy does not work well on Core2 and that memcpy in EGLIBC does not work well on Core2 or Haswell. I did check out a head version of GLIBC recently and the performance was much better on Haswell. In all cases unrolling gets the best result.

void copy_unroll1(const float *x, float *y, const int n) {
    for(int i=0; i<n/JUMP; i++) {
        VECNF().LOAD(&x[JUMP*(i+0)]).STORE(&y[JUMP*(i+0)]);
    }
}

void copy_unroll8(const float *x, float *y, const int n) {
for(int i=0; i<n/JUMP; i+=8) {
    VECNF().LOAD(&x[JUMP*(i+0)]).STORE(&y[JUMP*(i+0)]);
    VECNF().LOAD(&x[JUMP*(i+1)]).STORE(&y[JUMP*(i+1)]);
    VECNF().LOAD(&x[JUMP*(i+2)]).STORE(&y[JUMP*(i+2)]);
    VECNF().LOAD(&x[JUMP*(i+3)]).STORE(&y[JUMP*(i+3)]);
    VECNF().LOAD(&x[JUMP*(i+4)]).STORE(&y[JUMP*(i+4)]);
    VECNF().LOAD(&x[JUMP*(i+5)]).STORE(&y[JUMP*(i+5)]);
    VECNF().LOAD(&x[JUMP*(i+6)]).STORE(&y[JUMP*(i+6)]);
    VECNF().LOAD(&x[JUMP*(i+7)]).STORE(&y[JUMP*(i+7)]);
}

}

其中 VECNF().LOAD 是 SSE 的 _mm_load_ps() 或 AVX 的 _mm256_load_ps()VECNF().STORE_mm_store_ps() 对于 SSE 或 _mm256_store_ps() 对于 AVX,JUMP 是 4 对于 SSE 或 8 对于 AVX.

Where VECNF().LOADis _mm_load_ps() for SSE or _mm256_load_ps() for AVX, VECNF().STORE is _mm_store_ps() for SSE or _mm256_store_ps() for AVX, and JUMP is 4 for SSE or 8 for AVX.

对于大尺寸,通过使用非临时存储指令和使用多线程可以获得最佳结果.与许多人可能认为的相反 单线程不通常会使内存带宽饱和.

For the large size the best result is obtained by using non-temporal store instructions and by using multiple threads. Contrary to what many people may believe a single thread does NOT usually saturate the memory bandwidth.

void copy_stream(const float *x, float *y, const int n) {
    #pragma omp parallel for        
    for(int i=0; i<n/JUMP; i++) {
        VECNF v = VECNF().load_a(&x[JUMP*i]);
        stream(&y[JUMP*i], v);
    }
}

其中 stream 是 SSE 的 _mm_stream_ps() 或 AVX 的 _mm256_stream_ps()

Where stream is _mm_stream_ps() for SSE or _mm256_stream_ps() for AVX

以下是我的 E5-1620@3.6 GHz 上的 memcpy 结果,带有 1 GB 的四个线程和 最大主内存带宽 51.2 GB/s.

Here are the memcpy results on my E5-1620@3.6 GHz with four threads for 1 GB with a maximum main memory bandwidth of 51.2 GB/s.

                         GB/s     efficiency
    eglibc:              23.6     46%
    asmlib:              36.7     72%
    copy_stream:         36.7     72%

EGLIBC 再次表现不佳.这是因为它不使用非临时存储.

Once again EGLIBC performs poorly. This is because it does not use non-temporal stores.

我修改了 eglibcasmlib memcpy 函数来像这样并行运行

I modfied the eglibc and asmlib memcpy functions to run in parallel like this

void COPY(const float * __restrict x, float * __restrict y, const int n) {
    #pragma omp parallel
    {
        size_t my_start, my_size;
        int id = omp_get_thread_num();
        int num = omp_get_num_threads();
        my_start = (id*n)/num;
        my_size = ((id+1)*n)/num - my_start;
        memcpy(y+my_start, x+my_start, sizeof(float)*my_size);
    }
}

一般的 memcpy 函数需要考虑未对齐到 64 字节(甚至 32 或 16 字节)并且大小不是 32 字节的倍数或展开的数组因素.此外,必须决定何时使用非临时存储.一般的经验法则是仅对大于最大缓存级别(通常为 L3)一半的大小使用非临时存储.但这些是二阶"细节,我认为应该在针对大小的理想情况进行优化后处理.如果理想情况下的性能也很差,那么担心校正错位或非理想尺寸倍数就没有太大意义.

A general memcpy function needs to account for arrays which are not aligned to 64 bytes (or even to 32 or to 16 bytes) and where the size is not a multiple of 32 bytes or the unroll factor. Additionally, a decision has to be made as to when to use non-temporal stores. The general rule of thumb is to only use non-temporal stores for sizes larger than half the largest cache level (usually L3). But theses are "second order" details which I think should be dealt with after optimizing for ideal cases of large and small. There's not much point in worrying about correcting for misalignment or non-ideal size multiples if the ideal case performs poorly as well.

更新

根据 Stephen Canon 的评论,我了解到在 Ivy Bridge 和 Haswell 上使用 rep movsb 比使用 movntdqa(非临时存储指令)更有效.英特尔将此称为增强型代表 movsb (ERMSB).这在 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB)部分中的noreferrer">英特尔优化手册.

Based on comments by Stephen Canon I have learned that on Ivy Bridge and Haswell it's more efficient to use rep movsb than movntdqa (a non-temporal store instruction). Intel calls this enhanced rep movsb (ERMSB). This is described in the Intel Optimization manuals in the section 3.7.6 Enhanced REP MOVSB and STOSB operation (ERMSB).

此外,在 Agner Fog 的 Optimizing Subroutines in Assembly 手册中的 17.9 部分移动数据块(所有处理器) 他写道:

Additionally, in Agner Fog's Optimizing Subroutines in Assembly manual in section 17.9 Moving blocks of data (All processors) he writes:

移动大块数据的方法有多种.最常用的方法是:

"There are several ways of moving large blocks of data. The most common methods are:

  1. REP MOVS 指令.
  2. 如果数据对齐:以最大可用寄存器大小循环读取和写入.
  3. 如果大小不变:内联移动指令.
  4. 如果数据未对齐:首先根据需要移动尽可能多的字节以使目标对齐.然后在循环中读取未对齐和写入对齐,最大可用寄存器大小.
  5. 如果数据未对齐:读取对齐,移位以补偿未对齐并写入对齐.
  6. 如果数据太大而无法缓存,请使用非临时写入来绕过缓存.如有必要,移动以补偿错位."

一般的 memcpy 应该考虑这些点中的每一个.此外,对于 Ivy Bridge 和 Haswell,对于大型阵列,第 1 点似乎比第 6 点更好.英特尔和 AMD 以及每次技术迭代都需要不同的技术.我认为很明显,编写自己的通用高效 memcpy 函数可能非常复杂.但是在我看过的特殊情况下,我已经设法做得比 GCC 内置 memcpy 或 EGLIBC 中的更好,所以假设你不能比标准库做得更好是不正确的.

A general memcpy should consider each of these points. Additionally, with Ivy Bridge and Haswell it seems that point 1 is better than point 6 for large arrays. Different techniques are necessary for Intel and AMD and for each iteration of technology. I think it's clear that writing your own general efficient memcpyfunction can be quite complicated. But in the special cases I have looked at I have already managed to do better than the GCC builtin memcpy or the one in EGLIBC so the assumption that you can't do better than the standard libraries is incorrect.

这篇关于这个 memcpy 实现中缺少什么/次优?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆