更快的方式零内存比memset的? [英] Faster way to zero memory than with memset?

查看:1276
本文介绍了更快的方式零内存比memset的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解到, memset的(PTR,0,为nbytes)实在是快,但有一个更快的方法(至少在x86上)?

I learned that memset(ptr, 0, nbytes) is really fast, but is there a faster way (at least on x86)?

我认为memset的用途 MOV ,归零内存大多数编译器使用 XOR ,因为它的速度更快,正确但是什么时候呢? EDIT1:错误的,因为GregS指出,只有登记工作。我当时在想什么?

I assume that memset uses mov, however when zeroing memory most compilers use xor as it's faster, correct? edit1: Wrong, as GregS pointed out that only works with registers. What was I thinking?

此外,我问,谁知道汇编的比我多看STDLIB一个人,他告诉我说,在x86 memset的未服用32位宽的寄存器中的优势。然而,在那个时候我非常累,所以我不太确定我理解正确的话。

Also I asked a person who knew of assembler more than me to look at the stdlib, and he told me that on x86 memset is not taking full advantage of the 32 bit wide registers. However at that time I was very tired, so I'm not quite sure I understood it correctly.

EDIT2
我重新审视这个问题,并做了一个小测试。以下是我的测试:

edit2: I revisited this issue and did a little testing. Here is what I tested:

    #include <stdio.h>
    #include <malloc.h>
    #include <string.h>
    #include <sys/time.h>

    #define TIME(body) do {                                                     \
        struct timeval t1, t2; double elapsed;                                  \
        gettimeofday(&t1, NULL);                                                \
        body                                                                    \
        gettimeofday(&t2, NULL);                                                \
        elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0 + (t2.tv_usec - t1.tv_usec) / 1000.0; \
        printf("%s\n --- %f ---\n", #body, elapsed); } while(0)                 \


    #define SIZE 0x1000000

    void zero_1(void* buff, size_t size)
    {
        size_t i;
        char* foo = buff;
        for (i = 0; i < size; i++)
            foo[i] = 0;

    }

    /* I foolishly assume size_t has register width */
    void zero_sizet(void* buff, size_t size)
    {
        size_t i;
        char* bar;
        size_t* foo = buff;
        for (i = 0; i < size / sizeof(size_t); i++)
            foo[i] = 0;

        // fixes bug pointed out by tristopia
        bar = (char*)buff + size - size % sizeof(size_t);
        for (i = 0; i < size % sizeof(size_t); i++)
            bar[i] = 0;
    }

    int main()
    {
        char* buffer = malloc(SIZE);
        TIME(
            memset(buffer, 0, SIZE);
        );
        TIME(
            zero_1(buffer, SIZE);
        );
        TIME(
            zero_sizet(buffer, SIZE);
        );
        return 0;
    }

结果:

<击>
zero_1是最慢的,除了-O3。 zero_sizet是最快的跨越-O1,-O2和-O3大致相等的性能。 memset的总是比zero_sizet慢。 (慢一倍的-O3)。感兴趣的一件事是,在-O3 zero_1也同样快如zero_sizet。然而,拆解功能有大约四倍的说明(我想通过循环展开造成的)。另外,我想进一步优化zero_sizet,但是编译器总是胜过了我,但这里并不令人意外。

目前memset的胜利,previous结果由CPU缓存扭曲。 (所有测试都是在Linux上运行),进一步的测试需要。我会尽力在下汇编:)

For now memset wins, previous results were distorted by CPU cache. (all tests were run on Linux) Further testing needed. I'll try assembler next :)

EDIT3:测试code修正了,测试结果不会受到影响。

edit3: fixed bug in test code, test results are not affected

edit4:虽然各地的拆解VS2010 C运行时戳,我注意到 memset的有一个SSE优化例程为零。这将是很难被击败这一点。

edit4: While poking around the disassembled VS2010 C runtime, I noticed that memset has a SSE optimized routine for zero. It will be hard to beat this.

推荐答案

86是相当广泛的设备。

x86 is rather broad range of devices.

有关完全通用的x86目标,以REP MOVSD组装块能32位在时间吼出零到内存。努力确保这项工作的大部分是DWORD对齐。

For totally generic x86 target, an assembly block with "rep movsd" could blast out zeros to memory 32-bits at time. Try to make sure the bulk of this work is DWORD aligned.

有关与MMX芯片,组装循环与MOVQ可以一次打64位。

For chips with mmx, an assembly loop with movq could hit 64bits at a time.

您也许能得到一个C / C ++编译器使用一个64位写的指针long long或_m64。目标必须为最佳性能8字节对齐的。

You might be able to get a C/C++ compiler to use a 64-bit write with a pointer to a long long or _m64. Target must be 8 byte aligned for the best performance.

与上证所芯片,MOVAPS是快,但只有当地址为16字节对齐,所以使用MOVSB​​直到对齐,然后完整填写清楚与MOVAPS的环

for chips with sse, movaps is fast, but only if the address is 16 byte aligned, so use a movsb until aligned, and then complete your clear with a loop of movaps

Win32的有ZeroMemory(),但我忘了如果多数民众赞成宏memset的,或实际'好'的实现。

Win32 has "ZeroMemory()", but I forget if thats a macro to memset, or an actual 'good' implementation.

这篇关于更快的方式零内存比memset的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆