C/C ++中较小和相同类型的循环变量可提高性能 [英] Small vs. identical types of loop variables in C/C++ for performance
问题描述
说我有一个形式的大嵌套循环
Say I have a large nested loop of the form
long long i, j, k, i_end, j_end;
...
for (i = 0; i < i_end; i++) {
j_bgn = get_j_bgn(i);
for (j = j_bgn; j < j_end; j++) {
...
}
}
具有一些较大的i_end
和j_end
,例如i_end = j_end = 10000000000
.如果我知道j_bgn
总是很小,甚至可能总是0
或1
,那么为此使用较小的类型(如signed char j_bgn
)是否对性能有好处?还是由于每次开始新的j
循环时隐式转换为long long
都会产生重复成本?
with some large i_end
and j_end
, say i_end = j_end = 10000000000
. If I know that j_bgn
is always small, perhaps even always either 0
or 1
, is it beneficial performance-wise to use a smaller type for this, like signed char j_bgn
? Or does this come with a recurring cost due to implicit casting to long long
each time we begin a new j
loop?
我想这会产生很小的影响,但是我想知道这样做的正确"/学究的方式:任一种1)保持所有相同类型的循环变量(并使用可以冷的最小类型)最大的整数),或2)独立选择每个循环变量的类型,使其尽可能小.
I guess this has a pretty minor effect, but I would like to know the "proper"/pedantic way of doing this: Either 1) keep all loop variables of the same type (and use the smallest type that can cold the largest integer needed), or 2) choose the type of each loop variable independently to be as small as possible.
我从评论/答案中看到我需要提供更多信息:
From the comments/answers I see I need to supply further information:
- 我有时想要并且有时不想使用这些变量(例如
j
)进行索引.为什么这样有意义(只要我确保使用足够大的类型来覆盖可用内存)? - 在我的实际代码中,例如,使用
size_t
(或ssize_t
)之类的东西.j
,j_end
.在现代硬件上,这是64位.
- I sometimes want and sometimes do not want to use these variables (e.g.
j
) for indexing. Why is this relevant (as long as I make sure to use types large enough to cover my available memory)? - In my actual code I use something like
size_t
(orssize_t
) for e.g.j
,j_end
. On modern hardware this is 64 bit.
我认为使用小于32位的类型是不值得的,但是对于j_bgn
使用32位的类型还是比使用64位的类型还是有益的(因为我确实需要j_end
)?
I take it that using types smaller than 32 bit is not worthwhile, but is it still perhaps beneficial to use a 32 bit type for j_bgn
rather than also using a 64 bit type (as I really do need for j
and j_end
)?
推荐答案
如果整数大于或小于寄存器的宽度,则许多平台都需要一些其他操作. (不过,大多数64位平台都可以像处理64位一样有效地处理32位整数.)
Many platforms require some additional operations if the integers are wider or smaller than the width of the registers. (Most 64-bit platforms can handle 32-bit integers as efficiently as 64-bit, though.)
示例(使用空的asm
语句停止优化循环):
Example (with empty asm
statements to stop the loops optimizing away):
void lfoo(long long int loops)
{
for(long long int i = 0; i < loops; i++) asm("");
}
void foo(int loops)
{
for(int i = 0; i < loops; i++) asm("");
}
void bar(short int loops)
{
for(short int i = 0; i < loops; i++) asm("");
}
void zoo(char loops)
{
for(char i = 0; i < loops; i++) asm("");
}
以及针对旧的32位ARM Cortex处理器的结果代码,而没有ARMv6符号扩展指令,这使short
的不良情况略有减轻( Godbolt编译器资源管理器,gcc8.2 默认选项,-O3
不带-march=
或-mcpu=cortex-...
)
and the resulting code for old 32-bit ARM Cortex processors, without ARMv6 sign-extension instructions which make short
slightly less bad (Godbolt compiler explorer, gcc8.2 default options, -O3
without -march=
or -mcpu=cortex-...
)
lfoo:
cmp r0, #1
sbcs r3, r1, #0
bxlt lr
mov r2, #0
mov r3, #0
.L3:
adds r2, r2, #1
adc r3, r3, #0 @@ long long takes 2 registers, obviously bad
cmp r1, r3
cmpeq r0, r2 @@ and also to compare
bne .L3
bx lr
foo:
cmp r0, #0
bxle lr @ return if loops==0 (predicate condition)
mov r3, #0 @ i = 0
.L8: @ do {
add r3, r3, #1 @ i++ (32-bit)
cmp r0, r3
bne .L8 @ } while(loops != i);
bx lr @ return
bar:
cmp r0, #0
bxle lr
mov r2, #0
.L12: @ do {
add r2, r2, #1 @ i++ (32-bit)
lsl r3, r2, #16 @ i <<= 16
asr r3, r3, #16 @ i >>= 16 (sign extend i from 16 to 32)
cmp r0, r3
bgt .L12 @ }while(loops > i)
bx lr
@@ gcc -mcpu=cortex-a15 for example uses
@@ sxth r2, r3
zoo:
cmp r0, #0
bxeq lr
mov r3, #0
.L16:
add r3, r3, #1
and r2, r3, #255 @ truncation to unsigned char is cheap
cmp r0, r2 @ but not free
bhi .L16
bx lr
如您所见,效率最高的是32位整数,因为它们的大小与处理器寄存器(函数foo
)的大小相同.
As you can see the most efficient are 32 bits integers as they have the same size as processor registers (function foo
).
这篇关于C/C ++中较小和相同类型的循环变量可提高性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!