如何从打破我的内在NEON停止GCC? [英] How to stop GCC from breaking my NEON intrinsics?

查看:476
本文介绍了如何从打破我的内在NEON停止GCC?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要编写优化的NEON code的一个项目,我十分乐意写汇编语言,但便携性/可维护性我使用NEON instrinsics。这code必须尽可能快,所以我用我的AR​​M优化的经验,妥善交错说明,避免管道摊位。无论我做什么,海湾合作委员会的工作对我并创建较慢code满档。

I need to write optimized NEON code for a project and I'm perfectly happy to write assembly language, but for portability/maintainability I'm using NEON instrinsics. This code needs to be as fast as possible, so I'm using my experience in ARM optimization to properly interleave instructions and avoid pipe stalls. No matter what I do, GCC works against me and creates slower code full of stalls.

有谁知道如何让GCC走出的方式,只是我的内在转化为code?

Does anyone know how to have GCC get out of the way and just translate my intrinsics into code?

下面是一个例子:我有一个简单的循环,从而否定和复印件浮点值。它的工作原理与4台4在一个时间,以允许一些时间存储器加载和指令来执行。有很多遗留下来的寄存器,所以它有没有理由裂伤事情如此糟糕。

Here's an example: I have a simple loop which negates and copies floating point values. It works with 4 sets of 4 at a time to allow some time for the memory to load and instructions to execute. There are plenty of registers left over, so it's got no reason to mangle things so badly.

float32x4_t f32_0, f32_1, f32_2, f32_3;
int x;
for (x=0; x<n-15; x+=16)
{
   f32_0 = vld1q_f32(&s[x]);
   f32_1 = vld1q_f32(&s[x+4]);
   f32_2 = vld1q_f32(&s[x+8]);
   f32_3 = vld1q_f32(&s[x+12]);
   __builtin_prefetch(&s[x+64]);
   f32_0 = vnegq_f32(f32_0);
   f32_1 = vnegq_f32(f32_1);
   f32_2 = vnegq_f32(f32_2);
   f32_3 = vnegq_f32(f32_3);
   vst1q_f32(&d[x], f32_0);
   vst1q_f32(&d[x+4], f32_1);
   vst1q_f32(&d[x+8], f32_2);
   vst1q_f32(&d[x+12], f32_3);
} 

这是code它产生:

vld1.32 {d18-d19}, [r5]
vneg.f32  q9,q9        <-- GCC intentionally causes stalls
add r7,r7,#16
vld1.32 {d22-d23}, [r8]
add r5,r1,r4
vneg.f32 q11,q11   <-- all of my interleaving is undone (why?!!?)
add r8,r3,#256
vld1.32 {d20-d21}, [r10]
add r4,r1,r3
vneg.f32 q10,q10
add lr,r1,lr
vld1.32 {d16-d17}, [r9]
add ip,r1,ip
vneg.f32 q8,q8

更多信息:


  • GCC 4.9.2的Raspbian

  • 编译器选项: -c -fPIE -march =的ARMv7-A -Wall -O3 -mfloat-ABI =硬-mfpu =霓虹灯

  • GCC 4.9.2 for Raspbian
  • compiler flags: -c -fPIE -march=armv7-a -Wall -O3 -mfloat-abi=hard -mfpu=neon

当我写在ASM code中的循环图案正是因为我的内部函数(甚至没有利用额外的源/目标寄存器来获得一些免费的ARM周期),它仍然快于GCC的code。

When I write the loop in ASM code patterned exactly as my intrinsics (without even making use of extra src/dest registers to gain some free ARM cycles), it's still faster than GCC's code.

更新:我AP preciate詹姆斯的答案,但在对事物的计划,它并没有真正的问题有所帮助。最简单的我的函数执行与在Cortex-A7选择好一点,但多数认为没有变化。可悲的事实是,内联函数的GCC的优化不是很大。当我与微软ARM编译器在几年前的工作,它始终创建以及制作输出NEON内部函数,而GCC始终跌跌撞撞。随着GCC 4.9.x,一切都没有改变。我当然AP preciate GCC的FOSS性质和更大的GNU努力,但不可否认,它不会做的好工作作为英特尔,微软,甚至ARM的编译器。

Update: I appreciate James' answer, but in the scheme of things, it doesn't really help with the problem. The simplest of my functions perform a little better with the cortex-a7 option, but the majority saw no change. The sad truth is that GCC's optimization of intrinsics is not great. When I worked with the Microsoft ARM compiler a few years ago, it consistently created well crafted output for NEON intrinsics while GCC consistently stumbled. With GCC 4.9.x, nothing has changed. I certainly appreciate the FOSS nature of GCC and the greater GNU effort, but there is no denying that it doesn't do as good a job as Intel, Microsoft or even ARM's compilers.

推荐答案

概括地说,类,你在这里看到的是被称为指令调度的优化。 GCC使用指令调度,试图建立一个更好的时间表,在程序的各个基本模块的说明。在这里,一个时间表是指在一个块的说明任何一个正确的顺序,和一个更好的时间表可能是其中一个避免摊等各种管道的危害,或者一个减少的变量(产生更好的寄存器分配生存区间),或者在说明书的一些其他排序目标。

Broadly, the class of optimisation you are seeing here is known as "instruction scheduling". GCC uses instruction scheduling to try to build a better schedule for the instructions in each basic block of your program. Here, a "schedule" refers to any correct ordering of the instructions in a block, and a "better" schedule might be one which avoids stalls and other pipeline hazards, or one which reduces the live range of variables (resulting in better register allocation), or some other ordering goal on the instructions.

要避免摊位因危害,GCC使用你的目标处理器的流水线的模式(见的这里用于这些规范的语言,的这里为例管道模型)。该模型提供了一些指示给处理器的功能单元的GCC调度算法,并指示在这些功能单元的执行特性。然后GCC可以调度的指令,以尽量减少由于需要相同的处理器资源多条指令结构的危害。

To avoid stalls due to hazards, GCC uses a model of the pipeline of the processor you are targeting (see here for details of the specification language used for these, and here for an example pipeline model). This model gives some indication to the GCC scheduling algorithms of the functional units of a processor, and the execution characteristics of instructions on those functional units. GCC can then schedule instructions to minimise structural hazards due to multiple instructions requiring the same processor resources.

如果没有 -mcpu -mtune 选项(编译器)或 - 与CPU的 - 与调选项(编译器的配置),GCC为ARM或AArch64会尝试使用重新presentative型号为你的目标架构版本。在这种情况下, -march =的ARMv7-A ,使编译器将尝试安排的说明,如果 -mtune = Cortex-A8的获得通过在命令行上。

Without a -mcpu or -mtune option (to the compiler), or a --with-cpu, or --with-tune option (to the configuration of the compiler), GCC for ARM or AArch64 will try to use a representative model for the architecture revision you are targeting. In this case, -march=armv7-a, causes the compiler to try to schedule instructions as if -mtune=cortex-a8 were passed on the command line.

所以你看到在输出是海湾合作委员会在它预计在Cortex-A8的运行时,很好地执行,并就其实施的ARMv7-A架构的处理器上运行得相当好时间表改变你的输入尝试。

So what you are seeing in your output is GCC's attempt at transforming your input in to a schedule it expects to execute well when running on a Cortex-A8, and to run reasonably well on processors which implement the ARMv7-A architecture.

要改善这一点,你可以试试:

To improve on this you can try:


  • 设定明确你的目标处理器( -mcpu =的Cortex-A7

  • 禁用指令完全调度(`-fno时间表-的insn -fno时间表 - insns2)

请注意,禁用指令调度完全很可能因为你的问题在​​其他地方,因为GCC将不再试图降低整个code管道的危害。

Note that disabling instruction scheduling entirely may well cause you problems elsewhere, as GCC will no longer be trying to reduce pipeline hazards across your code.

修改至于你的编辑,在GCC的性能缺陷,可以在海湾合作委员会的Bugzilla报告(见的 https://gcc.gnu.org/bugs/ )就像正确性的错误就可以了。当然所有的优化有一定的启发和涉及一个编译器可能无法击败经验丰富的汇编编程,但如果编译器做一些特别严重的也可以是值得强调的。

Edit With regards to your edit, performance bugs in GCC can be reported in the GCC Bugzilla (see https://gcc.gnu.org/bugs/ ) just as correctness bugs can be. Naturally with all optimisations there is some degree of heuristic involved and a compiler may not be able to beat a seasoned assembly programmer, but if the compiler is doing something especially egregious it can be worth highlighting.

这篇关于如何从打破我的内在NEON停止GCC?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆