我们如何使用循环展开和代码预加载为 C++ 程序编写优化的 ARM 代码? [英] How can we write optimized ARM code for C++ program using loop unrolling and code pre-loading?

查看:27
本文介绍了我们如何使用循环展开和代码预加载为 C++ 程序编写优化的 ARM 代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是相同程序的给定 c++ 和 ARM 代码.你能告诉我这段 ARM 代码是否优化过,循环需要多少个(数组 n 的大小很大,是 64 个元素的倍数,并使用 8 位掩码和产生一个输出数组 outArr.)?我应该如何使用循环展开来优化代码(一次处理 4 个元素)?

Below are the given c++ and ARM code for same program. Can you tell me this ARM code is optimized or not and how many does the loop requires(The size of the array n is large, and is a multiple of 64 elements and exclusive-OR bit-wise operation with the 8-bit mask and produces an output array outArr.)? What should I do to optimize the code using loop unrolling (process 4 elements at a time)?

c++ 代码:

// Gray scale image pixel inversion
void invert(unsigned char *outArr, unsigned char *inArr, 
 unsigned char k, int n)
{
for(int i=0; i<n; i++)
 *outArr++ = *inArr++ ^ k; // ^ is bitwise xor
}

手臂代码:

invert:
        cmp     r3, #0
        bxle    lr
        add     ip, r0, r3
.L3:
        ldrb    r3, [r1], #1    @ zero_extendqisi2
        eor     r3, r3, r2
        strb    r3, [r0], #1
        cmp     ip, r0
        bne     .L3
        bx      lr

推荐答案

我不知道代码预加载"是什么意思.pld 指令进行数据预加载.这在示例代码的上下文中是有意义的.

I have no idea what 'code preload' means. There is data preloading with the pld instruction. It would make sense in the context of the sample code.

给出假设,这是基本的C"版本,

Here is the basic 'C' version given the assumptions,

数组 n 的大小很大,是 64 个元素的倍数,并与 8 位掩码进行异或按位运算,并产生输出数组 outArr.

The size of the array n is large, and is a multiple of 64 elements and exclusive-OR bit-wise operation with the 8-bit mask and produces an output array outArr.

代码可能并不完美,但只是为了说明.

The code is probably not perfect, but meant to illustrate.

// Gray scale image pixel inversion
void invert(unsigned char *outArr, unsigned char *inArr, 
 unsigned char k, int n)
{
  unsigned int *out = (void*)outArr;
  unsigned int *in  = (void*)inArr;
  unsigned int mask = k<<24|k<<16|k<<8|k;

  /* Check arguments */
  if( n % 64 != 0) return;
  if((int)outArr & 3) return;
  if((int)inArr & 3) return;
  assert(sizeof(int)==4);

  for(int i=0; i<n/sizeof(int); i+=64/(sizeof(int)) {
   /* 16 transfers per loop 64/4 */
   *out++ = *in++ ^ k; // 1
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k; // 5
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k; // 9
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k; // 13
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
   *out++ = *in++ ^ k;
  }
}

您可以在 Godbolt 上查看 输出.

You can view the output on godbolt.

ldmstm 指令可用于将连续的内存地址加载到寄存器中.我们不能使用所有的 16 个 ARM 寄存器,所以汇编程序中循环的核心是这样的,

The ldm and stm instructions can be used to load consecutive memory addresses to registers. We can not use all 16 ARM registers, so the core of the loop in assembler would look like this,

  ldmia  [r1], {r4,r5,r6,r7,r8,r9,r10,r11}  # r1 is inArr
  eor    r4,r4,r2   # r2 is expanded k
  eor    r5,r5,r2
  eor    r6,r6,r2
  eor    r7,r7,r2
  eor    r8,r8,r2
  eor    r9,r9,r2
  eor    r10,r10,r2
  eor    r11,r11,r2
  stmia  [r0], {r4,r5,r6,r7,r8,r9,r10,r11}  # r0 is outArr

这被重复两次,R0R1 可以根据存储在 R3 中的数组限制进行检查.如果您想与 EABI 兼容,您需要保存所有被调用者保存的寄存器.寄存器组r4-r11 一般都可以使用,但要视系统而定.如果您保存它们并且不是异常安全的,您也可以使用 lrfp 等.

This is repeated twice and the R0 or R1 can be checked against the array limits stored in R3. You need to save all of the callee saved registers if you want to be EABI compliant. The register set r4-r11 can generally be used, but it will depend on the system. You can also use lr, fp, etc if you save them and are not exception safe.

来自评论,

我试图找出这个子程序每次需要多少个周期优化和未优化时的数组元素.

I am trying to find that how many cycles does this subroutine take per array element when it is optimized and when it isn't.

在现代 CPU 上循环计数极其困难.然而,核心中有五个指令,带有一个简单的循环,

Cycle counts are extremely difficult on modern CPUs. However you have five instructions in the core with a simple loop,

.L3:
        ldrb    r3, [r1], #1    @ zero_extendqisi2
        eor     r3, r3, r2
        strb    r3, [r0], #1
        cmp     ip, r0
        bne     .L3

要做32个字节,这是32*5(160)条指令.具有 32 * 2 次内存访问.

To do 32 bytes, this is 32 * 5 (160) instructions. With 32 * 2 memory accesses.

扩展选项只是一个 32 字节的内存读写.这些将完成,最低值首先可用.然后只需一个 EOR 指令.所以它只有 10 条指令而不是 160 条指令.在现代处理器上,内存将是限制因素.由于内存停滞,一次只处理四个字可能更好,例如,

The expanded options is just one 32byte memory read and write. These will complete, with the lowest value available first. Then just a single EOR instruction. So it is just 10 instructions versus 160. On modern processors the memory will be the limiting factor. Because of memory stalls, it maybe better to only process four words at a time such as,

  ldmia  [r1], {r4,r5,r6,r7}  # r1 is inArr
  eor    r4,r4,r2   # r2 is expanded k
  eor    r5,r5,r2
  eor    r6,r6,r2
  eor    r7,r7,r2
  ldmia  [r1], {r8,r9,r10,r11}  # r1 is inArr
  stmia  [r0], {r4,r5,r6,r7}    # r0 is outArr
  ...

这(或一些排列)将允许加载/存储单元和eor"不会相互阻塞,但这将取决于特定的 CPU 类型.这个话题叫做指令调度;它比 pld 或数据预加载更强大.同样,您可以使用 NEON 或 ARM64 指令,以便循环体可以在加载/存储之前执行更多 eor 操作.

This (or some permutation) will allow the load/store unit and the 'eor' to not block each other, but this will depend on the particular CPU type. This topic is called instruction scheduling; it is more powerful than pld or data preloading. As well, you can use NEON or ARM64 instructions so that the body of the loop can do more eor operations before a load/store.

这篇关于我们如何使用循环展开和代码预加载为 C++ 程序编写优化的 ARM 代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆