如何增加的memcpy的性能 [英] How to increase performance of memcpy

查看:234
本文介绍了如何增加的memcpy的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

摘要:

的memcpy似乎无法在真实或测试应用程序传输超过2GB /秒我的系统上。我能做些什么,以获得更快的内存到内存拷贝?

memcpy seems unable to transfer over 2GB/sec on my system in a real or test application. What can I do to get faster memory-to-memory copies?

全部细节:

作为一个数据采集应用程序(使用一些专门的硬件)的一部分,我需要大约3 GB /秒,从临时缓冲区拷贝到主存储器。采集数据,我公司提供的硬件驱动程序与一系列的缓冲区(每个2MB)。硬件的DMA数据到每个缓冲器,然后通知我的程序时,每个缓冲区已满。我的程序清空缓存(memcpy的到另一个,更大的内存块),以及转播的处理缓冲区卡进行重新填充。我有memcpy的数据移动速度不够快的问题。看来,内存到内存中的副本应该足够快,支持3GB /秒上,我运行的硬件。 Lavalys EVEREST给了我一个9337MB /秒的内存拷贝基准测试结果,但我不能让任何地方的memcpy的速度接近,甚至在一个简单的测试程序。

As part of a data capture application (using some specialized hardware), I need to copy about 3 GB/sec from temporary buffers into main memory. To acquire data, I provide the hardware driver with a series of buffers (2MB each). The hardware DMAs data to each buffer, and then notifies my program when each buffer is full. My program empties the buffer (memcpy to another, larger block of RAM), and reposts the processed buffer to the card to be filled again. I am having issues with memcpy moving the data fast enough. It seems that the memory-to-memory copy should be fast enough to support 3GB/sec on the hardware that I am running on. Lavalys EVEREST gives me a 9337MB/sec memory copy benchmark result, but I can't get anywhere near those speeds with memcpy, even in a simple test program.

我已经通过添加/删除了缓冲区处理code里面的memcpy调用隔离性能问题。如果没有的memcpy,我可以运行约3GB /秒的全数据速率 - 。在启用的memcpy,我(使用当前编译器)限制在大约550MB /秒。

I have isolated the performance issue by adding/removing the memcpy call inside the buffer processing code. Without the memcpy, I can run full data rate- about 3GB/sec. With the memcpy enabled, I am limited to about 550Mb/sec (using current compiler).

在为我的系统上基准的memcpy,我写了一个单独的测试程序,它只是对一些数据块调用的memcpy。 (我已经张贴低于code)我虽然我不是目前使用的Visual Studio无论是在编译器/ IDE,我使用(美国国家仪器CVI)以及Visual Studio 2010中运行它,我愿意作出这样的转变是否会产生必要的性能。但是,一味地在移动之前,我想确保,这将解决我的memcpy的性能问题。

In order to benchmark memcpy on my system, I've written a separate test program that just calls memcpy on some blocks of data. (I've posted the code below) I've run this both in the compiler/IDE that I'm using (National Instruments CVI) as well as Visual Studio 2010. While I'm not currently using Visual Studio, I am willing to make the switch if it will yield the necessary performance. However, before blindly moving over, I wanted to make sure that it would solve my memcpy performance problems.

的Visual C ++ 2010:1900 MB /秒

Visual C++ 2010: 1900 MB/sec

NI CVI 2009:550 MB /秒

NI CVI 2009: 550 MB/sec

虽然我并不感到惊讶,CVI比Visual Studio的显著慢,我很惊讶的是,memcpy的性能是这种低。虽然我不知道这是否是直接的可比性,这比EVEREST基准带宽低得多。虽然我不需要的性能相当该水平,最低3GB /秒的是必要的。当然,标准库的实现不能这么多逊于任何EVEREST使用!

While I am not surprised that CVI is significantly slower than Visual Studio, I am surprised that the memcpy performance is this low. While I'm not sure if this is directly comparable, this is much lower than the EVEREST benchmark bandwidth. While I don't need quite that level of performance, a minimum of 3GB/sec is necessary. Surely the standard library implementation can't be this much worse than whatever EVEREST is using!

有什么东西,我可以做,使的memcpy在这种情况下更快?

What, if anything, can I do to make memcpy faster in this situation?

硬件详细信息:
AMD的Magny Cours- 4X八核心
128 GB DDR3
Windows Server 2003企业X64

Hardware details: AMD Magny Cours- 4x octal core 128 GB DDR3 Windows Server 2003 Enterprise X64

测试程序:

#include <windows.h>
#include <stdio.h>

const size_t NUM_ELEMENTS = 2*1024 * 1024;
const size_t ITERATIONS = 10000;

int main (int argc, char *argv[])
{
    LARGE_INTEGER start, stop, frequency;

    QueryPerformanceFrequency(&frequency);

    unsigned short * src = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);
    unsigned short * dest = (unsigned short *) malloc(sizeof(unsigned short) * NUM_ELEMENTS);

    for(int ctr = 0; ctr < NUM_ELEMENTS; ctr++)
    {
        src[ctr] = rand();
    }

    QueryPerformanceCounter(&start);

    for(int iter = 0; iter < ITERATIONS; iter++)
        memcpy(dest, src, NUM_ELEMENTS * sizeof(unsigned short));

    QueryPerformanceCounter(&stop);

    __int64 duration = stop.QuadPart - start.QuadPart;

    double duration_d = (double)duration / (double) frequency.QuadPart;

    double bytes_sec = (ITERATIONS * (NUM_ELEMENTS/1024/1024) * sizeof(unsigned short)) / duration_d;

    printf("Duration: %.5lfs for %d iterations, %.3lfMB/sec\n", duration_d, ITERATIONS, bytes_sec);

    free(src);
    free(dest);

    getchar();

    return 0;
}

编辑:如果你有一个额外的五分钟,想贡献,可以在您的机器上运行上述code和发表您的时间为注释

If you have an extra five minutes and want to contribute, can you run the above code on your machine and post your time as a comment?

推荐答案

我已经找到一种方法来提高速度在这种情况下。我写的memcpy的多线程版本,分裂线程之间要复制的区域。下面是一些性能扩展号码为一组的块大小,使用相同的定时code如上述发现。我不知道其性能,尤其是对块的这个小规模,将扩展到这么多线程。我怀疑这事做与本机上的大量存储控制器(16)。

I have found a way to increase speed in this situation. I wrote a multi-threaded version of memcpy, splitting the area to be copied between threads. Here are some performance scaling numbers for a set block size, using the same timing code as found above. I had no idea that the performance, especially for this small size of block, would scale to this many threads. I suspect that this has something to do with the large number of memory controllers (16) on this machine.

Performance (10000x 4MB block memcpy):

 1 thread :  1826 MB/sec
 2 threads:  3118 MB/sec
 3 threads:  4121 MB/sec
 4 threads: 10020 MB/sec
 5 threads: 12848 MB/sec
 6 threads: 14340 MB/sec
 8 threads: 17892 MB/sec
10 threads: 21781 MB/sec
12 threads: 25721 MB/sec
14 threads: 25318 MB/sec
16 threads: 19965 MB/sec
24 threads: 13158 MB/sec
32 threads: 12497 MB/sec

我不明白3和4线程之间的巨大的性能飞跃。什么会导致这样的跳跃?

I don't understand the huge performance jump between 3 and 4 threads. What would cause a jump like this?

我已经包括了的memcpy code,我在下面写了,可能会遇到同样的问题等。请注意,没有错误在这个$ C $检查C-这可能需要添加为您的应用程序。

I've included the memcpy code that I wrote below for other that may run into this same issue. Please note that there is no error checking in this code- this may need to be added for your application.

#define NUM_CPY_THREADS 4

HANDLE hCopyThreads[NUM_CPY_THREADS] = {0};
HANDLE hCopyStartSemaphores[NUM_CPY_THREADS] = {0};
HANDLE hCopyStopSemaphores[NUM_CPY_THREADS] = {0};
typedef struct
{
    int ct;
    void * src, * dest;
    size_t size;
} mt_cpy_t;

mt_cpy_t mtParamters[NUM_CPY_THREADS] = {0};

DWORD WINAPI thread_copy_proc(LPVOID param)
{
    mt_cpy_t * p = (mt_cpy_t * ) param;

    while(1)
    {
        WaitForSingleObject(hCopyStartSemaphores[p->ct], INFINITE);
        memcpy(p->dest, p->src, p->size);
        ReleaseSemaphore(hCopyStopSemaphores[p->ct], 1, NULL);
    }

    return 0;
}

int startCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        hCopyStartSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        hCopyStopSemaphores[ctr] = CreateSemaphore(NULL, 0, 1, NULL);
        mtParamters[ctr].ct = ctr;
        hCopyThreads[ctr] = CreateThread(0, 0, thread_copy_proc, &mtParamters[ctr], 0, NULL); 
    }

    return 0;
}

void * mt_memcpy(void * dest, void * src, size_t bytes)
{
    //set up parameters
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        mtParamters[ctr].dest = (char *) dest + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].src = (char *) src + ctr * bytes / NUM_CPY_THREADS;
        mtParamters[ctr].size = (ctr + 1) * bytes / NUM_CPY_THREADS - ctr * bytes / NUM_CPY_THREADS;
    }

    //release semaphores to start computation
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
        ReleaseSemaphore(hCopyStartSemaphores[ctr], 1, NULL);

    //wait for all threads to finish
    WaitForMultipleObjects(NUM_CPY_THREADS, hCopyStopSemaphores, TRUE, INFINITE);

    return dest;
}

int stopCopyThreads()
{
    for(int ctr = 0; ctr < NUM_CPY_THREADS; ctr++)
    {
        TerminateThread(hCopyThreads[ctr], 0);
        CloseHandle(hCopyStartSemaphores[ctr]);
        CloseHandle(hCopyStopSemaphores[ctr]);
    }
    return 0;
}

这篇关于如何增加的memcpy的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆