正在访问静态或动态分配的内存速度更快? [英] Is accessing statically or dynamically allocated memory faster?
问题描述
有在C分配全局数组的2种方式:
There are 2 ways of allocating global array in C:
-
静态
statically
char data[65536];
动态
dynamically
char *data;
…
data = (char*)malloc(65536); /* or whatever size */
现在的问题是,该方法具有更好的性能?并且由多少?
The question is, which method has better performance? And by how much?
由于理解,第一种方法应该会更快。
As understand it, the first method should be faster.
由于用第二种方法,来访问你每次访问时间必须取消引用元素的地址数组,像这样:
Because with the second method, to access the array you have to dereference element's address each time it is accessed, like this:
- 阅读变量
数据
包含指向数组 的开始 - 计算偏移到特定的元素
- 访问元素
- read the variable
data
which contains the pointer to the beginning of the array - calculate the offset to specific element
- access the element
通过第一种方法,硬codeS的数据的地址
变量到code,第一步是跳过了编译器,所以我们有
With the first method, the compiler hard-codes the address of the data
variable into the code, first step is skipped, so we have:
- 计算偏移量在编译时定义与固定地址的特定元素
- 访问数组的元素
每个内存访问大约相当于40个CPU时钟周期,因此,使用动态分配的,专门为不频繁读取可能有显著的性能下降VS静态分配,因为数据
变量可以从缓存一些更频繁访问的变量被清除。与此相反,解除引用静态分配全局变量的成本是0,因为它的地址已经在code硬$ C $光盘。
Each memory access is equivalent to about 40 CPU clock cycles, so , using dynamic allocation, specially for not frequent reads can have significant performance decrease vs static allocation because the data
variable may be purged from the cache by some more frequently accessed variable. On the contrary , the cost of dereferencing statically allocated global variable is 0, because its address is already hard-coded in the code.
这是正确的?
推荐答案
一个人应该永远基准是肯定的。但是,忽略缓存一时的效果,效率可以取决于你如何偶尔访问这两个。在此,考虑字符data_s [65536]
和的char * data_p =的malloc(65536)
One should always benchmark to be sure. But, ignoring the effects of cache for the moment, the efficiency can depend on how sporadically you access the two. Herein, consider char data_s[65536]
and char *data_p = malloc(65536)
如果访问的是零星的静态/全球会的略的速度:
If the access is sporadic the static/global will be slightly faster:
// slower because we must fetch data_p and then store
void
datasetp(int idx,char val)
{
data_p[idx] = val;
}
// faster because we can store directly
void
datasets(int idx,char val)
{
data_s[idx] = val;
}
现在,如果我们考虑缓存, datasetp
和数据集
将是差不多的[第一次访问后] ,因为取 data_p
将从缓存得到满足[无保证,但可能],因此时间差得多。
Now, if we consider caching, datasetp
and datasets
will be about the same [after the first access], because the fetch of data_p
will be fulfilled from cache [no guarantee, but likely], so the time difference is much less.
然而,在紧密循环访问数据的时候,他们将是差不多的,因为编译器[优化]将在开始prefetch data_p
循环并把它放在一个寄存器:
However, when accessing the data in a tight loop, they will be about the same, because the compiler [optimizer] will prefetch data_p
at the start of the loop and put it in a register:
void
datasetalls(char val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data_s[idx] = val;
}
void
datasetallp(char val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data_p[idx] = val;
}
void
datasetallp_optimized(char val)
{
int idx;
register char *reg;
// the optimizer will generate the equivalent code to this
reg = data_p;
for (idx = 0; idx < 65536; ++idx)
reg[idx] = val;
}
如果访问的是所以的零星的 data_p
会从缓存中驱逐出去的话,性能差别并不那么重要了,因为进入[任]数组是罕见的。因此,的不的为code调整的目标。
If the access is so sporadic that data_p
gets evicted from the cache, then, the performance difference doesn't matter so much, because access to [either] array is infrequent. Thus, not a target for code tuning.
如果发生这种驱逐,实际数据数组,最有可能的,被拆迁户为好。
If such eviction occurs, the actual data array will, most likely, be evicted as well.
一个更大的阵列,可能会有更多的效果(例如,如果代替 65536
中,我们有亿
,那么仅仅遍历就会收回 data_p
和的时候,我们到达了数组的末尾,最左边的条目将已被逐出。
A much larger array might have more of an effect (e.g. if instead of 65536
, we had 100000000
, then mere traversal will evict data_p
and by the time we reached the end of the array, the leftmost entries would already be evicted.
不过,在这种情况下,额外获取 data_p
将是0.000001%的开销。
But, in that case, the extra fetch of data_p
would be 0.000001% overhead.
所以,它有助于要么基准[或模式]的特定用例/访问模式。
So, it helps to either benchmark [or model] the particular use case/access pattern.
更新:
在 datasetallp
函数所做的不的优化,以相当于 datasetallp_optimized
在某些条件下,由于严格的混淆因素。
Based on some further experimentation [triggered by a comment by Peter], the datasetallp
function does not optimize to the equivalent of datasetallp_optimized
for certain conditions, due to strict aliasing considerations.
由于该数组是字符
[或 unsigned char型
],编译器生成一个上的每个的循环迭代data_p
取。请注意,如果数组的不的字符
(如 INT
),优化的确实的发生和 data_p
是牵强只有一次,因为字符
可以别名任何东西,但 INT
较为有限。
Because the arrays are char
[or unsigned char
], the compiler generates a data_p
fetch on each loop iteration. Note that if the arrays are not char
(e.g. int
), the optimization does occur and data_p
is fetched only once, because char
can alias anything but int
is more limited.
如果我们改变的char * data_p
到的char *限制data_p
我们得到了最优化的行为。添加限制
告诉编译器 data_p
将的不的别名任何东西[甚至本身],所以它是安全的优化获取。
If we change char *data_p
to char *restrict data_p
we get the optimized behavior. Adding restrict
tells the compiler that data_p
will not alias anything [even itself], so it's safe to optimize the fetch.
的个人注意:的虽然我理解这一点,对我来说,这似乎的傻瓜的是没有的限制
,编译器的必须的假设 data_p
可引用回的本身的。虽然我敢肯定还有其他[同样做作]例子中,唯一我能想到的 data_p
指向自身,或 data_p
是一个结构的一部分:
Personal note: While I understand this, to me, it seems goofy that without restrict
, the compiler must assume that data_p
can alias back to itself. Although I'm sure there are other [equally contrived] examples, the only ones I can think of are data_p
pointing to itself or that data_p
is part of a struct:
// simplest
char *data_p = malloc(65536);
data_p = (void *) &data_p;
// closer to real world
struct mystruct {
...
char *data_p;
...
};
struct mystruct mystruct;
mystruct.data_p = (void *) &mystruct;
这些将的是情况下获取优化将是错误的。但是,海事组织,这些都是我们一直在处理的简单情况区分。至少,该结构的版本。而且,如果一个程序员的应的做第一个,国际海事组织,他们得到他们应得的[和编译器应该允许取优化。
These would be cases where the fetch optimization would be wrong. But, IMO, these are distinguishable from the simple case we've been dealing with. At least, the struct version. And, if a programmer should do the first one, IMO, they get what they deserve [and the compiler should allow fetch optimization].
有关我自己,我总是手code datasetallp_optimized
相当于[SANS 注册
],所以我平时看不到的multifetch问题[如果你愿意]太多。我一直信奉让编译器一个有用的提示作为我的意图,所以我只是做这个公理。它告诉编译器的和的另一个程序员的意图是取 data_p
只有一次。
For myself, I always hand code the equivalent of datasetallp_optimized
[sans register
], so I usually don't see the multifetch "problem" [if you will] too much. I've always believed in "giving the compiler a helpful hint" as to my intent, so I just do this axiomatically. It tells the compiler and another programmer that the intent is "fetch data_p
only once".
此外,multifetch问题确实的不的使用时,会出现 data_p
输入[因为我们不是的修改的什么,别名不考虑]:
Also, the multifetch problem does not occur when using data_p
for input [because we're not modifying anything, aliasing is not a consideration]:
// does fetch of data_p once at loop start
int
datasumallp(void)
{
int idx;
int sum;
sum = 0;
for (idx = 0; idx < 65536; ++idx)
sum += data_p[idx];
return sum;
}
但是,虽然它可以是相当普遍的,硬连线有一个明确的数组[基本数组操作函数的或者的 data_s
或 data_p
]往往比传递数组地址作为参数的那么有用。
But, while it can be fairly common, "hardwiring" a primitive array manipulation function with an explicit array [either data_s
or data_p
] is often less useful than passing the array address as an argument.
的边注:的铛
将优化的一些使用功能 data_s
到 memset的
调用,因此,在实验过程中,我修改了例如code稍微prevent这一点。
Side note: clang
would optimize some of the functions using data_s
into memset
calls, so, during experimentation, I modified the example code slightly to prevent this.
void
dataincallx(array_t *data,int val)
{
int idx;
for (idx = 0; idx < 65536; ++idx)
data[idx] = val + idx;
}
这确实的不的从multifetch问题的困扰。也就是说, dataincallx(data_s,17)
和 dataincallx(data_p,37)
工作差不多[与初始额外的 data_p
取。这更可能是什么人会使用一般[更好code再利用等。
This does not suffer from the multifetch problem. That is, dataincallx(data_s,17)
and dataincallx(data_p,37)
work about the same [with the initial extra data_p
fetch]. This is more likely to be what one might use in general [better code reuse, etc].
因此, data_s
和 data_p
之间的区别变得更加有争论的一点。再加上合理使用限制
[或使用除字符
其他类型]时, data_p
获取开销可以最小化到它是不是真的那么多的考虑点。
So, the distinction between data_s
and data_p
becomes a bit more of a moot point. Coupled with judicious use of restrict
[or using types other than char
], the data_p
fetch overhead can be minimized to the point where it isn't really that much of a consideration.
现在归结到更多选择一个固定大小的数组或动态分配一个建筑/设计选择。其他人指出的权衡。
It now comes down more to architectural/design choices of choosing a fixed size array or dynamically allocating one. Others have pointed out the tradeoffs.
这是用例相关的。
如果我们有数组功能的数量有限,但大量的的不同的的阵列,传递数组地址的功能是一个明显的赢家。
If we had a limited number of array functions, but a large number of different arrays, passing the array address to the functions is a clear winner.
不过,如果我们有大量的数组操作功能和[说]一个阵列(如[2D]数组是一个游戏板或网格),它可能会更好每个函数引用全局[要么 data_s
或 data_p
]直接。
However, if we had a large number of array manipulation functions and [say] one array (e.g. the [2D] array is a game board or grid), it might be better that each function references the global [either data_s
or data_p
] directly.
这篇关于正在访问静态或动态分配的内存速度更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!