C/C ++中较小和相同类型的循环变量可提高性能 [英] Small vs. identical types of loop variables in C/C++ for performance

查看:94
本文介绍了C/C ++中较小和相同类型的循环变量可提高性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有一个形式的大嵌套循环

Say I have a large nested loop of the form

long long i, j, k, i_end, j_end;
...
for (i = 0; i < i_end; i++) {
  j_bgn = get_j_bgn(i);
  for (j = j_bgn; j < j_end; j++) {
    ...
  }
}

具有一些较大的i_endj_end,例如i_end = j_end = 10000000000.如果我知道j_bgn总是很小,甚至可能总是01,那么为此使用较小的类型(如signed char j_bgn)是否对性能有好处?还是由于每次开始新的j循环时隐式转换为long long都会产生重复成本?

with some large i_end and j_end, say i_end = j_end = 10000000000. If I know that j_bgn is always small, perhaps even always either 0 or 1, is it beneficial performance-wise to use a smaller type for this, like signed char j_bgn? Or does this come with a recurring cost due to implicit casting to long long each time we begin a new j loop?

我想这会产生很小的影响,但是我想知道这样做的正确"/学究的方式:任一种1)保持所有相同类型的循环变量(并使用可以冷的最小类型)最大的整数),或2)独立选择每个循环变量的类型,使其尽可能小.

I guess this has a pretty minor effect, but I would like to know the "proper"/pedantic way of doing this: Either 1) keep all loop variables of the same type (and use the smallest type that can cold the largest integer needed), or 2) choose the type of each loop variable independently to be as small as possible.

我从评论/答案中看到我需要提供更多信息:

From the comments/answers I see I need to supply further information:

  • 我有时想要并且有时不想使用这些变量(例如j)进行索引.为什么这样有意义(只要我确保使用足够大的类型来覆盖可用内存)?
  • 在我的实际代码中,例如,使用size_t(或ssize_t)之类的东西. jj_end.在现代硬件上,这是64位.
  • I sometimes want and sometimes do not want to use these variables (e.g. j) for indexing. Why is this relevant (as long as I make sure to use types large enough to cover my available memory)?
  • In my actual code I use something like size_t (or ssize_t) for e.g. j, j_end. On modern hardware this is 64 bit.

我认为使用小于32位的类型是不值得的,但是对于j_bgn使用32位的类型还是比使用64位的类型还是有益的(因为我确实需要j_end)?

I take it that using types smaller than 32 bit is not worthwhile, but is it still perhaps beneficial to use a 32 bit type for j_bgn rather than also using a 64 bit type (as I really do need for j and j_end)?

推荐答案

如果整数大于或小于寄存器的宽度,则许多平台都需要一些其他操作. (不过,大多数64位平台都可以像处理64位一样有效地处理32位整数.)

Many platforms require some additional operations if the integers are wider or smaller than the width of the registers. (Most 64-bit platforms can handle 32-bit integers as efficiently as 64-bit, though.)

示例(使用空的asm语句停止优化循环):

Example (with empty asm statements to stop the loops optimizing away):

void lfoo(long long int loops)
{
    for(long long int i = 0; i < loops; i++) asm("");
}

void foo(int loops)
{
    for(int i = 0; i < loops; i++) asm("");
}

void bar(short int loops)
{
    for(short int i = 0; i < loops; i++) asm("");
}

void zoo(char loops)
{
    for(char i = 0; i < loops; i++) asm("");
}

以及针对旧的32位ARM Cortex处理器的结果代码,而没有ARMv6符号扩展指令,这使short的不良情况略有减轻( Godbolt编译器资源管理器,gcc8.2 默认选项,-O3不带-march=-mcpu=cortex-...)

and the resulting code for old 32-bit ARM Cortex processors, without ARMv6 sign-extension instructions which make short slightly less bad (Godbolt compiler explorer, gcc8.2 default options, -O3 without -march= or -mcpu=cortex-...)

lfoo:
        cmp     r0, #1
        sbcs    r3, r1, #0
        bxlt    lr
        mov     r2, #0
        mov     r3, #0
.L3:
        adds    r2, r2, #1
        adc     r3, r3, #0        @@ long long takes 2 registers, obviously bad
        cmp     r1, r3
        cmpeq   r0, r2            @@ and also to compare
        bne     .L3
        bx      lr

foo:
        cmp     r0, #0
        bxle    lr                @ return if loops==0 (predicate condition)
        mov     r3, #0            @ i = 0
.L8:                              @ do {
        add     r3, r3, #1          @ i++  (32-bit)
        cmp     r0, r3             
        bne     .L8               @ } while(loops != i);
        bx      lr                @ return

bar:
        cmp     r0, #0
        bxle    lr
        mov     r2, #0
.L12:                            @ do {
        add     r2, r2, #1          @ i++ (32-bit)
        lsl     r3, r2, #16         @ i <<= 16
        asr     r3, r3, #16         @ i >>= 16  (sign extend i from 16 to 32)
        cmp     r0, r3
        bgt     .L12             @ }while(loops > i)
        bx      lr
                @@ gcc -mcpu=cortex-a15 for example uses
                @@  sxth    r2, r3

zoo:
        cmp     r0, #0
        bxeq    lr
        mov     r3, #0
.L16:
        add     r3, r3, #1
        and     r2, r3, #255     @ truncation to unsigned char is cheap
        cmp     r0, r2           @ but not free
        bhi     .L16
        bx      lr

如您所见,效率最高的是32位整数,因为它们的大小与处理器寄存器(函数foo)的大小相同.

As you can see the most efficient are 32 bits integers as they have the same size as processor registers (function foo).

这篇关于C/C ++中较小和相同类型的循环变量可提高性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆