我们如何使用循环展开和代码预加载为C ++程序编写优化的ARM代码? [英] How can we write optimized ARM code for C++ program using loop unrolling and code pre-loading?

查看:69
本文介绍了我们如何使用循环展开和代码预加载为C ++程序编写优化的ARM代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是相同程序的给定c ++和ARM代码.您能告诉我这个ARM代码是否经过优化以及循环需要多少个代码(数组n的大小很大,并且是64个元素的倍数,并且具有8位掩码和按位异或运算,并且产生输出数组outArr.)?我该如何使用循环展开(一次处理4个元素)来优化代码?

Below are the given c++ and ARM code for same program. Can you tell me this ARM code is optimized or not and how many does the loop requires(The size of the array n is large, and is a multiple of 64 elements and exclusive-OR bit-wise operation with the 8-bit mask and produces an output array outArr.)? What should I do to optimize the code using loop unrolling (process 4 elements at a time)?

c ++代码:

// Gray scale image pixel inversion
void invert(unsigned char *outArr, unsigned char *inArr, 
 unsigned char k, int n)
{
for(int i=0; i<n; i++)
 *outArr++ = *inArr++ ^ k; // ^ is bitwise xor
}

ARM代码:

invert:
        cmp     r3, #0
        bxle    lr
        add     ip, r0, r3
.L3:
        ldrb    r3, [r1], #1    @ zero_extendqisi2
        eor     r3, r3, r2
        strb    r3, [r0], #1
        cmp     ip, r0
        bne     .L3
        bx      lr

推荐答案

我不知道代码预加载"是什么意思.使用 pld 指令进行数据预加载.在示例代码的上下文中,这将是有意义的.

I have no idea what 'code preload' means. There is data preloading with the pld instruction. It would make sense in the context of the sample code.

在给出假设的情况下,这是基本的"C"版本,

Here is the basic 'C' version given the assumptions,

数组n的大小很大,它是64个元素的倍数,并使用8位掩码进行异或运算,并产生输出数组outArr.

The size of the array n is large, and is a multiple of 64 elements and exclusive-OR bit-wise operation with the 8-bit mask and produces an output array outArr.

该代码可能并不完美,但只是为了说明.

The code is probably not perfect, but meant to illustrate.

// Gray scale image pixel inversion
void invert(unsigned char *outArr, unsigned char *inArr, 
 unsigned char k, int n)
{
  unsigned int *out = (void*)outArr;
  unsigned int *in  = (void*)inArr;
  unsigned int mask = k<<24|k<<16|k<<8|k;

  /* Check arguments */
  if( n % 64 != 0) return;
  if((int)outArr & 3) return;
  if((int)inArr & 3) return;
  assert(sizeof(int)==4);

  for(int i=0; i<n/sizeof(int); i+=64/(sizeof(int)) {
   /* 16 transfers per loop 64/4 */
   *out++ = *in++ ^ k; // 1
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k; // 5
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k; // 9
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k; // 13
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
  }
}

您可以查看 Godbolt上的输出.

ldm stm 指令可用于将连续的内存地址加载到寄存器.我们不能全部使用16个ARM寄存器,所以汇编器中循环的核心看起来像这样,

The ldm and stm instructions can be used to load consecutive memory addresses to registers. We can not use all 16 ARM registers, so the core of the loop in assembler would look like this,

  ldmia  [r1], {r4,r5,r6,r7,r8,r9,r10,r11}  # r1 is inArr
  eor    r4,r4,r2   # r2 is expanded k
  eor    r5,r5,r2
  eor    r6,r6,r2
  eor    r7,r7,r2
  eor    r8,r8,r2
  eor    r9,r9,r2
  eor    r10,r10,r2
  eor    r11,r11,r2
  stmia  [r0], {r4,r5,r6,r7,r8,r9,r10,r11}  # r0 is outArr

此操作重复两次,并且可以对照 R3 中存储的数组限制来检查 R0 R1 .如果要符合EABI,则需要保存所有被调用方保存的寄存器.通常可以使用寄存器集 r4 - r11 ,但这取决于系统.如果保存它们,并且它们也不是异常安全的,则还可以使用 lr fp 等.

This is repeated twice and the R0 or R1 can be checked against the array limits stored in R3. You need to save all of the callee saved registers if you want to be EABI compliant. The register set r4-r11 can generally be used, but it will depend on the system. You can also use lr, fp, etc if you save them and are not exception safe.

从评论中

我试图发现该子程序每个执行多少个周期优化和不优化时的数组元素.

I am trying to find that how many cycles does this subroutine take per array element when it is optimized and when it isn't.

在现代CPU上,周期计数非常困难.但是,您可以通过一个简单的循环在核心中获得五个指令,

Cycle counts are extremely difficult on modern CPUs. However you have five instructions in the core with a simple loop,

.L3:
        ldrb    r3, [r1], #1    @ zero_extendqisi2
        eor     r3, r3, r2
        strb    r3, [r0], #1
        cmp     ip, r0
        bne     .L3

要执行32个字节,这是32 * 5(160)指令.具有32 * 2的内存访问.

To do 32 bytes, this is 32 * 5 (160) instructions. With 32 * 2 memory accesses.

扩展的选项仅仅是一个32字节的内存读写.这些将完成,首先提供最低的值.然后只需一条 EOR 指令.因此,它只有10条指令,而只有160条指令.在现代处理器上,内存将成为限制因素.由于内存停滞,最好一次只处理四个字,例如,

The expanded options is just one 32byte memory read and write. These will complete, with the lowest value available first. Then just a single EOR instruction. So it is just 10 instructions versus 160. On modern processors the memory will be the limiting factor. Because of memory stalls, it maybe better to only process four words at a time such as,

  ldmia  [r1], {r4,r5,r6,r7}  # r1 is inArr
  eor    r4,r4,r2   # r2 is expanded k
  eor    r5,r5,r2
  eor    r6,r6,r2
  eor    r7,r7,r2
  ldmia  [r1], {r8,r9,r10,r11}  # r1 is inArr
  stmia  [r0], {r4,r5,r6,r7}    # r0 is outArr
  ...

此操作(或某种排列)将使加载/存储单元和"eor"不会相互阻塞,但这取决于特定的CPU类型.这个主题称为指令调度.它比 pld 或数据预加载功能更强大.同样,您可以使用NEON或ARM64指令,以便循环主体可以在加载/存储之前执行更多的 eor 操作.

This (or some permutation) will allow the load/store unit and the 'eor' to not block each other, but this will depend on the particular CPU type. This topic is called instruction scheduling; it is more powerful than pld or data preloading. As well, you can use NEON or ARM64 instructions so that the body of the loop can do more eor operations before a load/store.

这篇关于我们如何使用循环展开和代码预加载为C ++程序编写优化的ARM代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆