正在访问静态或动态分配的内存速度更快? [英] Is accessing statically or dynamically allocated memory faster?

查看:154
本文介绍了正在访问静态或动态分配的内存速度更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有在C分配全局数组的2种方式:

There are 2 ways of allocating global array in C:


  1. 静态

  1. statically

char data[65536];


  • 动态

  • dynamically

    char *data;
    …
    data = (char*)malloc(65536);  /* or whatever size */
    


  • 现在的问题是,该方法具有更好的性能?并且由多少?

    The question is, which method has better performance? And by how much?

    由于理解,第一种方法应该会更快。

    As understand it, the first method should be faster.

    由于用第二种方法,来访问你每次访问时间必须取消引用元素的地址数组,像这样:

    Because with the second method, to access the array you have to dereference element's address each time it is accessed, like this:


    1. 阅读变量数据包含指向数组
    2. 的开始
    3. 计算偏移到特定的元素

    4. 访问元素

    1. read the variable data which contains the pointer to the beginning of the array
    2. calculate the offset to specific element
    3. access the element

    通过第一种方法,硬codeS的数据的地址变量到code,第一步是跳过了编译器,所以我们有

    With the first method, the compiler hard-codes the address of the data variable into the code, first step is skipped, so we have:


    1. 计算偏移量在编译时定义与固定地址的特定元素

    2. 访问数组的元素

    每个内存访问大约相当于40个CPU时钟周期,因此,使用动态分配的,专门为不频繁读取可能有显著的性能下降VS静态分配,因为数据变量可以从缓存一些更频繁访问的变量被清除。与此相反,解除引用静态分配全局变量的成本是0,因为它的地址已经在code硬$ C $光盘。

    Each memory access is equivalent to about 40 CPU clock cycles, so , using dynamic allocation, specially for not frequent reads can have significant performance decrease vs static allocation because the data variable may be purged from the cache by some more frequently accessed variable. On the contrary , the cost of dereferencing statically allocated global variable is 0, because its address is already hard-coded in the code.

    这是正确的?

    推荐答案

    一个人应该永远基准是肯定的。但是,忽略缓存一时的效果,效率可以取决于你如何偶尔访问这两个。在此,考虑字符data_s [65536] 的char * data_p =的malloc(65536)

    One should always benchmark to be sure. But, ignoring the effects of cache for the moment, the efficiency can depend on how sporadically you access the two. Herein, consider char data_s[65536] and char *data_p = malloc(65536)

    如果访问的是零星的静态/全球会的的速度:

    If the access is sporadic the static/global will be slightly faster:

    // slower because we must fetch data_p and then store
    void
    datasetp(int idx,char val)
    {
    
        data_p[idx] = val;
    }
    
    // faster because we can store directly
    void
    datasets(int idx,char val)
    {
    
        data_s[idx] = val;
    }
    

    现在,如果我们考虑缓存, datasetp 数据集将是差不多的[第一次访问后] ,因为取 data_p 将从缓存得到满足[无保证,但可能],因此时间差得多。

    Now, if we consider caching, datasetp and datasets will be about the same [after the first access], because the fetch of data_p will be fulfilled from cache [no guarantee, but likely], so the time difference is much less.

    然而,在紧密循环访问数据的时候,他们将是差不多的,因为编译器[优化]将在开始prefetch data_p 循环并把它放在一个寄存器:

    However, when accessing the data in a tight loop, they will be about the same, because the compiler [optimizer] will prefetch data_p at the start of the loop and put it in a register:

    void
    datasetalls(char val)
    {
        int idx;
    
        for (idx = 0;  idx < 65536;  ++idx)
            data_s[idx] = val;
    }
    
    void
    datasetallp(char val)
    {
        int idx;
    
        for (idx = 0;  idx < 65536;  ++idx)
            data_p[idx] = val;
    }
    
    void
    datasetallp_optimized(char val)
    {
        int idx;
        register char *reg;
    
        // the optimizer will generate the equivalent code to this
        reg = data_p;
    
        for (idx = 0;  idx < 65536;  ++idx)
            reg[idx] = val;
    }
    


    如果访问的是所以的零星的 data_p 会从缓存中驱逐出去的话,性能差别并不那么重要了,因为进入[任]数组是罕见的。因此,的的为code调整的目标。


    If the access is so sporadic that data_p gets evicted from the cache, then, the performance difference doesn't matter so much, because access to [either] array is infrequent. Thus, not a target for code tuning.

    如果发生这种驱逐,实际数据数组,最有可能的,被拆迁户为好。

    If such eviction occurs, the actual data array will, most likely, be evicted as well.

    一个更大的阵列,可能会有更多的效果(例如,如果代替 65536 中,我们有亿,那么仅仅遍历就会收回 data_p 和的时候,我们到达了数组的末尾,最左边的条目将已被逐出。

    A much larger array might have more of an effect (e.g. if instead of 65536, we had 100000000, then mere traversal will evict data_p and by the time we reached the end of the array, the leftmost entries would already be evicted.

    不过,在这种情况下,额外获取 data_p 将是0.000001%的开销。

    But, in that case, the extra fetch of data_p would be 0.000001% overhead.

    所以,它有助于要么基准[或模式]的特定用例/访问模式。

    So, it helps to either benchmark [or model] the particular use case/access pattern.

    更新:

    datasetallp 函数所做的的优化,以相当于 datasetallp_optimized 在某些条件下,由于严格的混淆因素。

    Based on some further experimentation [triggered by a comment by Peter], the datasetallp function does not optimize to the equivalent of datasetallp_optimized for certain conditions, due to strict aliasing considerations.

    由于该数组是字符 [或 unsigned char型],编译器生成一个上的每个的循环迭代data_p 取。请注意,如果数组的字符(如 INT ),优化的确实的发生和 data_p 是牵强只有一次,因为字符可以别名任何东西,但 INT 较为有限。

    Because the arrays are char [or unsigned char], the compiler generates a data_p fetch on each loop iteration. Note that if the arrays are not char (e.g. int), the optimization does occur and data_p is fetched only once, because char can alias anything but int is more limited.

    如果我们改变的char * data_p 的char *限制data_p 我们得到了最优化的行为。添加限制告诉编译器 data_p 将的的别名任何东西[甚至本身],所以它是安全的优化获取。

    If we change char *data_p to char *restrict data_p we get the optimized behavior. Adding restrict tells the compiler that data_p will not alias anything [even itself], so it's safe to optimize the fetch.

    个人注意:的虽然我理解这一点,对我来说,这似乎的傻瓜的是没有限制,编译器的必须的假设 data_p 可引用回的本身的。虽然我敢肯定还有其他[同样做作]例子中,唯一我能想到的 data_p 指向自身,或 data_p 是一个结构的一部分:

    Personal note: While I understand this, to me, it seems goofy that without restrict, the compiler must assume that data_p can alias back to itself. Although I'm sure there are other [equally contrived] examples, the only ones I can think of are data_p pointing to itself or that data_p is part of a struct:

    // simplest
    char *data_p = malloc(65536);
    data_p = (void *) &data_p;
    
    // closer to real world
    struct mystruct {
        ...
        char *data_p;
        ...
    };
    struct mystruct mystruct;
    mystruct.data_p = (void *) &mystruct;
    

    这些的是情况下获取优化将是错误的。但是,海事组织,这些都是我们一直在处理的简单情况区分。至少,该结构的版本。而且,如果一个程序员的的做第一个,国际海事组织,他们得到他们应得的[和编译器应该允许取优化。

    These would be cases where the fetch optimization would be wrong. But, IMO, these are distinguishable from the simple case we've been dealing with. At least, the struct version. And, if a programmer should do the first one, IMO, they get what they deserve [and the compiler should allow fetch optimization].

    有关我自己,我总是手code datasetallp_optimized 相当于[SANS 注册],所以我平时看不到的multifetch问题[如果你愿意]太多。我一直信奉让编译器一个有用的提示作为我的意图,所以我只是做这个公理。它告诉编译器的的另一个程序员的意图是取 data_p 只有一次。

    For myself, I always hand code the equivalent of datasetallp_optimized [sans register], so I usually don't see the multifetch "problem" [if you will] too much. I've always believed in "giving the compiler a helpful hint" as to my intent, so I just do this axiomatically. It tells the compiler and another programmer that the intent is "fetch data_p only once".

    此外,multifetch问题确实的的使用时,会出现 data_p 输入[因为我们不是的修改的什么,别名不考虑]:

    Also, the multifetch problem does not occur when using data_p for input [because we're not modifying anything, aliasing is not a consideration]:

    // does fetch of data_p once at loop start
    int
    datasumallp(void)
    {
        int idx;
        int sum;
    
        sum = 0;
        for (idx = 0;  idx < 65536;  ++idx)
            sum += data_p[idx];
    
        return sum;
    }
    


    但是,虽然它可以是相当普遍的,硬连线有一个明确的数组[基本数组操作函数的或者 data_s data_p ]往往比传递数组地址作为参数的那么有用。


    But, while it can be fairly common, "hardwiring" a primitive array manipulation function with an explicit array [either data_s or data_p] is often less useful than passing the array address as an argument.

    边注:将优化的一些使用功能 data_s memset的调用,因此,在实验过程中,我修改了例如code稍微prevent这一点。

    Side note: clang would optimize some of the functions using data_s into memset calls, so, during experimentation, I modified the example code slightly to prevent this.

    void
    dataincallx(array_t *data,int val)
    {
        int idx;
    
        for (idx = 0;  idx < 65536;  ++idx)
            data[idx] = val + idx;
    }
    

    这确实的的从multifetch问题的困扰。也就是说, dataincallx(data_s,17) dataincallx(data_p,37)工作差不多[与初始额外的 data_p 取。这更可能是什么人会使用一般[更好code再利用等。

    This does not suffer from the multifetch problem. That is, dataincallx(data_s,17) and dataincallx(data_p,37) work about the same [with the initial extra data_p fetch]. This is more likely to be what one might use in general [better code reuse, etc].

    因此​​, data_s data_p 之间的区别变得更加有争论的一点。再加上合理使用限制 [或使用除字符其他类型]时, data_p 获取开销可以最小化到它是不是真的那么多的考虑点。

    So, the distinction between data_s and data_p becomes a bit more of a moot point. Coupled with judicious use of restrict [or using types other than char], the data_p fetch overhead can be minimized to the point where it isn't really that much of a consideration.

    现在归结到更多选择一个固定大小的数组或动态分配一个建筑/设计选择。其他人指出的权衡。

    It now comes down more to architectural/design choices of choosing a fixed size array or dynamically allocating one. Others have pointed out the tradeoffs.

    这是用例相关的。

    如果我们有数组功能的数量有限,但大量的的不同的的阵列,传递数组地址的功能是一个明显的赢家。

    If we had a limited number of array functions, but a large number of different arrays, passing the array address to the functions is a clear winner.

    不过,如果我们有大量的数组操作功能和[说]一个阵列(如[2D]数组是一个游戏板或网格),它可能会更好每个函数引用全局[要么 data_s data_p ]直接。

    However, if we had a large number of array manipulation functions and [say] one array (e.g. the [2D] array is a game board or grid), it might be better that each function references the global [either data_s or data_p] directly.

    这篇关于正在访问静态或动态分配的内存速度更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆