Memcpy vs Memmove - 调试与发布 [英] Memcpy vs Memmove - Debug vs Release

查看:141
本文介绍了Memcpy vs Memmove - 调试与发布的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我的x64多线程应用程序,我得到了非常奇怪的行为。
在调试模式下的执行时间比在发布模式下的快。

I got really strange behavior for my x64 multithreading application. The execution time in debug mode is faster than in release mode.

我将问题解决,发现问题:
debug modus optimize (!注意优化是关闭!)memcpy到memmove,其速度更快。释放模式使用仍然memcpy(!注意优化打开)。

I break the problem down and found the issue: The debug modus optimize (!Note optimition is off!) the memcpy to memmove, which peforms faster. The release mode use still memcpy (!note optimition is on).

这个问题在发布模式下减慢了我的多线程应用程序。 :

This problem slows down my multithreading app in release mode. :(

有人有任何想法吗?

#include <time.h>
#include <iostream>

#define T_SIZE 1024*1024*2

int main()
{
    clock_t start, end;

    char data[T_SIZE];
    char store[100][T_SIZE];

    start = clock();
    for (int i = 0; i < 4000; i++) {
        memcpy(store[i % 100], data, T_SIZE);
    }
    // Debug > Release Time 1040 < 1620
    printf("memcpy: %d\n", clock() - start);

    start = clock();
    for (int i = 0; i < 4000; i++) {
        memmove(store[i % 100], data, T_SIZE);
    }
    // Debug > Release Time 1040 > 923
    printf("memmove: %d\n", clock() - start);
}


推荐答案

以下答案对VS2013有效 ONLY



陌生人只是 memcpy memmove 。这是一个内在优化实际上减缓了事情的情况。这个问题来自于VS2013内联的内存复制如下:

The following answer is valid for VS2013 ONLY

What we have here is actually stranger than just memcpy vs. memmove. It's a case of the intrinsic optimization actually slowing things down. The issue stems from the fact that VS2013 inlines memcopy like thus:

; 73   :        memcpy(store[i % 100], data, sizeof(data));

    mov eax, 1374389535             ; 51eb851fH
    mul esi
    shr edx, 5
    imul    eax, edx, 100               ; 00000064H
    mov ecx, esi
    sub ecx, eax
    movsxd  rcx, ecx
    shl rcx, 21
    add rcx, r14
    mov rdx, r13
    mov r8d, 16384              ; 00004000H
    npad    12
    $LL413@wmain:
    movups  xmm0, XMMWORD PTR [rdx]
    movups  XMMWORD PTR [rcx], xmm0
    movups  xmm1, XMMWORD PTR [rdx+16]
    movups  XMMWORD PTR [rcx+16], xmm1
    movups  xmm0, XMMWORD PTR [rdx+32]
    movups  XMMWORD PTR [rcx+32], xmm0
    movups  xmm1, XMMWORD PTR [rdx+48]
    movups  XMMWORD PTR [rcx+48], xmm1
    movups  xmm0, XMMWORD PTR [rdx+64]
    movups  XMMWORD PTR [rcx+64], xmm0
    movups  xmm1, XMMWORD PTR [rdx+80]
    movups  XMMWORD PTR [rcx+80], xmm1
    movups  xmm0, XMMWORD PTR [rdx+96]
    movups  XMMWORD PTR [rcx+96], xmm0
    lea rcx, QWORD PTR [rcx+128]
    movups  xmm1, XMMWORD PTR [rdx+112]
    movups  XMMWORD PTR [rcx-16], xmm1
    lea rdx, QWORD PTR [rdx+128]
    dec r8
    jne SHORT $LL413@wmain

这样做的问题是我们做的是未对齐的SSE加载和存储,实际上比使用标准C代码慢。我通过从visual studio中包含的源代码抓取CRTs实现,并使 my_memcpy

The issue with this is that we're doing unaligned SSE loads and stores which is actually slower than just using standard C code. I verified this by grabbing the CRTs implementation from the source code included in with visual studio and making a my_memcpy

As一种确保缓存在所有预热初始化所有数据期间的缓存的方法,但结果是:

As a way of ensuring that the cache was warm during all of this I had preinitialized all of data but the results were telling:


热身需时43ms

my_memcpy up 862ms

memmove up took 676ms

memcpy up 1331ms

Warm up took 43ms
my_memcpy up took 862ms
memmove up took 676ms
memcpy up took 1329ms

那么为什么 memmove 更快?因为它不尝试事先优化,因为它必须假设数据可以重叠。

So why is memmove faster? Because it doesn't try to prior optimize because it must assume the data can overlap.

对于那些好奇这是我的代码完整:

For those curious this is my code in full:

#include <cstdlib>
#include <cstring>
#include <chrono>
#include <iostream>
#include <random>
#include <functional>
#include <limits>

namespace {
    const auto t_size = 1024ULL * 1024ULL * 2ULL;
    __declspec(align(16 )) char data[t_size];
    __declspec(align(16 )) char store[100][t_size];
    void * __cdecl my_memcpy(
        void * dst,
        const void * src,
        size_t count
        )
    {
        void * ret = dst;

        /*
        * copy from lower addresses to higher addresses
        */
        while (count--) {
            *(char *)dst = *(char *)src;
            dst = (char *)dst + 1;
            src = (char *)src + 1;
        }

        return(ret);
    }
}

int wmain(int argc, wchar_t* argv[])
{
    using namespace std::chrono;

    std::mt19937 rd{ std::random_device()() };
    std::uniform_int_distribution<short> dist(std::numeric_limits<char>::min(), std::numeric_limits<char>::max());
    auto random = std::bind(dist, rd);

    auto start = steady_clock::now();
    // warms up the cache and initializes
    for (int i = 0; i < t_size; ++i)
            data[i] = static_cast<char>(random());

    auto stop = steady_clock::now();
    std::cout << "Warm up took " << duration_cast<milliseconds>(stop - start).count() << "ms\n";

    start = steady_clock::now();
    for (int i = 0; i < 4000; ++i)
        my_memcpy(store[i % 100], data, sizeof(data));

    stop = steady_clock::now();

    std::cout << "my_memcpy took " << duration_cast<milliseconds>(stop - start).count() << "ms\n";

    start = steady_clock::now();
    for (int i = 0; i < 4000; ++i)
        memmove(store[i % 100], data, sizeof(data));

    stop = steady_clock::now();

    std::cout << "memmove took " << duration_cast<milliseconds>(stop - start).count() << "ms\n";


    start = steady_clock::now();
    for (int i = 0; i < 4000; ++i)
        memcpy(store[i % 100], data, sizeof(data));

    stop = steady_clock::now();

    std::cout << "memcpy took " << duration_cast<milliseconds>(stop - start).count() << "ms\n";
    std::cin.ignore();
    return 0;
}



更新



在调试时,我发现编译器确实检测到我从CRT复制的代码是 memcpy ,但它将它链接到CRT本身的非内部版本使用 rep movs ,而不是上面的大规模SSE循环。似乎问题只在于内部版本。

Update

While debugging I've found that the compiler did detect that the code I copied from the CRT is memcpy, but it links it to the non-intrinsic version in the CRT itself which uses rep movs instead of the massive SSE loop above. It seems the issue is ONLY with the intrinsic version.

Per Z boson似乎这都是非常依赖于架构。在我的CPU rep movsb 更快,但在较旧的CPU上SSE或AVX实现有可能更快。这是根据英特尔优化手册。对于未对齐的数据, rep movsb 可能在旧硬件上受到高达25%的处罚。但是,这表明,对于绝大多数情况和架构 rep movsb 将平均击败SSE或AVX实现。

Per Z boson in the comments it seems that this is all very architecture dependent. On my CPU rep movsb is faster, but on older CPUs the SSE or AVX implementation has the potential to be faster. This is per the Intel Optimization Manual. For unaligned data as rep movsb can experience up to a 25% penalty on older hardware. However, that said, it appears that for the vast majority of cases and architectures rep movsb will on average beat the SSE or AVX implementation.

这篇关于Memcpy vs Memmove - 调试与发布的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆