如何影响 Android/ARM 目标的 Delphi XEx 代码生成? [英] How to affect Delphi XEx code generation for Android/ARM targets?

查看:24
本文介绍了如何影响 Android/ARM 目标的 Delphi XEx 代码生成?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新 2017-05-17.我不再为提出此问题的公司工作,并且无法访问 Delphi XEx.当我在那里时,通过迁移到混合 FPC+GCC (Pascal+C) 解决了这个问题,其中一些例程使用了 NEON 内在函数,这使得它有所不同.(强烈推荐 FPC+GCC 也是因为它可以使用标准工具,尤其是 Valgrind.)如果有人可以用可信的例子来证明他们实际上是如何从 Delphi XEx 生成优化的 ARM 代码的,我很高兴接受这个答案.

Embarcadero 的 Delphi 编译器使用 LLVM 后端为 Android 设备生成原生 ARM 代码.我有大量的 Pascal 代码需要编译成 Android 应用程序,我想知道如何让 Delphi 生成更高效的代码.现在,我什至不是在谈论自动 SIMD 优化等高级功能,而是在谈论生成合理的代码.当然必须有一种方法可以将参数传递给 LLVM 端,或者以某种方式影响结果?通常,任何编译器都会有很多选项来影响代码的编译和优化,但 Delphi 的 ARM 目标似乎只是优化开/关"而已.

Embarcadero's Delphi compilers use an LLVM backend to produce native ARM code for Android devices. I have large amounts of Pascal code that I need to compile into Android applications and I would like to know how to make Delphi generate more efficient code. Right now, I'm not even talking about advanced features like automatic SIMD optimizations, just about producing reasonable code. Surely there must be a way to pass parameters to the LLVM side, or somehow affect the result? Usually, any compiler will have many options to affect code compilation and optimization, but Delphi's ARM targets seem to be just "optimization on/off" and that's it.

LLVM 应该能够生成相当紧凑和合理的代码,但 Delphi 似乎以一种奇怪的方式使用其设施.Delphi 非常想大量使用堆栈,它一般只使用处理器的寄存器 r0-r3 作为临时变量.也许最疯狂的是,它似乎将普通的 32 位整数加载为四个 1 字节的加载操作.如何让 Delphi 生成更好的 ARM 代码,并且没有它为 Android 带来的逐字节麻烦?

LLVM is supposed to be capable of producing reasonably tight and sensible code, but it seems that Delphi is using its facilities in a weird way. Delphi wants to use the stack very heavily, and it generally only utilizes the processor's registers r0-r3 as temporary variables. Perhaps the craziest of all, it seems to be loading normal 32 bit integers as four 1-byte load operations. How to make Delphi produce better ARM code, and without the byte-by-byte hassle it is making for Android?

一开始我以为逐字节加载是为了从 big-endian 交换字节顺序,但事实并非如此,它实际上只是加载了 4 个单字节加载的 32 位数字.* 它可能加载完整的 32 位而不进行未对齐的字大小的内存加载.(是否应该避免那是另一回事,这会暗示整个事情是一个编译器错误)*

At first I thought the byte-by-byte loading was for swapping byte order from big-endian, but that was not the case, it is really just loading a 32 bit number with 4 single-byte loads.* It might be to load the full 32 bits without doing an unaligned word-sized memory load. (whether it SHOULD avoid that is another thing, which would hint to the whole thing being a compiler bug)*

让我们看看这个简单的函数:

Let's look at this simple function:

function ReadInteger(APInteger : PInteger) : Integer;
begin
  Result := APInteger^;
end;

即使开启了优化,带有更新包 1 的 Delphi XE7 以及 XE6 也会为该函数生成以下 ARM 汇编代码:

Even with optimizations switched on, Delphi XE7 with update pack 1, as well as XE6, produce the following ARM assembly code for that function:

Disassembly of section .text._ZN16Uarmcodetestform11ReadIntegerEPi:

00000000 <_ZN16Uarmcodetestform11ReadIntegerEPi>:
   0:   b580        push    {r7, lr}
   2:   466f        mov r7, sp
   4:   b083        sub sp, #12
   6:   9002        str r0, [sp, #8]
   8:   78c1        ldrb    r1, [r0, #3]
   a:   7882        ldrb    r2, [r0, #2]
   c:   ea42 2101   orr.w   r1, r2, r1, lsl #8
  10:   7842        ldrb    r2, [r0, #1]
  12:   7803        ldrb    r3, [r0, #0]
  14:   ea43 2202   orr.w   r2, r3, r2, lsl #8
  18:   ea42 4101   orr.w   r1, r2, r1, lsl #16
  1c:   9101        str r1, [sp, #4]
  1e:   9000        str r0, [sp, #0]
  20:   4608        mov r0, r1
  22:   b003        add sp, #12
  24:   bd80        pop {r7, pc}

只需计算 Delphi 为此所需的指令和内存访问的数量.并从 4 个单字节加载构造一个 32 位整数......如果我稍微改变函数并使用 var 参数而不是指针,它会稍微不那么复杂:

Just count the number of instructions and memory accesses Delphi needs for that. And constructing a 32 bit integer from 4 single-byte loads... If I change the function a little bit and use a var parameter instead of a pointer, it is slightly less convoluted:

Disassembly of section .text._ZN16Uarmcodetestform14ReadIntegerVarERi:

00000000 <_ZN16Uarmcodetestform14ReadIntegerVarERi>:
   0:   b580        push    {r7, lr}
   2:   466f        mov r7, sp
   4:   b083        sub sp, #12
   6:   9002        str r0, [sp, #8]
   8:   6801        ldr r1, [r0, #0]
   a:   9101        str r1, [sp, #4]
   c:   9000        str r0, [sp, #0]
   e:   4608        mov r0, r1
  10:   b003        add sp, #12
  12:   bd80        pop {r7, pc}

我不会在这里包括反汇编,但对于 iOS,Delphi 为指针和 var 参数版本生成相同的代码,它们与 Android 的 var 参数版本几乎但不完全相同.澄清一下,逐字节加载仅适用于 Android.并且仅在 Android 上,指针和 var 参数版本彼此不同.在 iOS 上,两个版本生成完全相同的代码.

I won't include the disassembly here, but for iOS, Delphi produces identical code for the pointer and var parameter versions, and they are almost but not exactly the same as the Android var parameter version. to clarify, the byte-by-byte loading is only on Android. And only on Android, the pointer and var parameter versions differ from each other. On iOS both versions generate exactly the same code.

为了比较,这里是 FPC 2.7.1(2014 年 3 月的 SVN 主干版本)对优化级别 -O2 的功能的看法.指针和var参数版本完全一样.

For comparison, here's what FPC 2.7.1 (SVN trunk version from March 2014) thinks of the function with optimization level -O2. The pointer and var parameter versions are exactly the same.

Disassembly of section .text.n_p$armcodetest_$$_readinteger$pinteger$$longint:

00000000 <P$ARMCODETEST_$$_READINTEGER$PINTEGER$$LONGINT>:

   0:   6800        ldr r0, [r0, #0]
   2:   46f7        mov pc, lr

我还使用 Android NDK 附带的 C 编译器测试了等效的 C 函数.

I also tested an equivalent C function with the C compiler that comes with the Android NDK.

int ReadInteger(int *APInteger)
{
    return *APInteger;
}

这编译成基本上与 FPC 相同的东西:

And this compiles into essentially the same thing FPC made:

Disassembly of section .text._Z11ReadIntegerPi:

00000000 <_Z11ReadIntegerPi>:
   0:   6800        ldr r0, [r0, #0]
   2:   4770        bx  lr

推荐答案

我们正在调查这个问题.简而言之,它取决于指针引用的整数的潜在未对齐(到 32 边界).需要多一点时间才能得到所有答案……以及解决这个问题的计划.

We are investigating the issue. In short, it depends on the potential mis-alignment (to 32 boundary) of the Integer referenced by a pointer. Need a little more time to have all of the answers... and a plan to address this.

Marco Cantù,Delphi Developers 的主持人>

另请参考 为什么Delphi zlib 和 zip 库在 64 位下这么慢吗? 因为 Win64 库是在没有优化的情况下构建的.

Also reference Why are the Delphi zlib and zip libraries so slow under 64 bit? as Win64 libraries are shipped built without optimizations.

在 QP 报告中:RSP-9922编译器产生的错误 ARM 代码,忽略 $O 指令?,Marco 添加了以下解释:

In the QP Report: RSP-9922 Bad ARM code produced by the compiler, $O directive ignored?, Marco added following explanation:

这里有多个问题:

  • 如上所述,优化设置仅适用于整个单元文件,而不适用于单个函数.简而言之,在同一个文件中打开和关闭优化将不起作用.
  • 此外,只需拥有调试信息"即可.启用关闭优化.因此,在调试时,显式打开优化将不起作用.因此,IDE 中的 CPU 视图将无法显示优化代码的反汇编视图.
  • 第三,加载未对齐的 64 位数据不安全并且会导致错误,因此在给定场景中需要单独的 4 个单字节操作.

这篇关于如何影响 Android/ARM 目标的 Delphi XEx 代码生成?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆