如何影响针对Android / ARM目标的Delphi XEx代码生成? [英] How to affect Delphi XEx code generation for Android/ARM targets?

查看:201
本文介绍了如何影响针对Android / ARM目标的Delphi XEx代码生成?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Embarcadero的Delphi编译器使用LLVM后端为Android设备生成本地ARM代码。我有大量的Pascal代码,我需要编译成Android应用程序,我想知道如何使Delphi生成更有效的代码。现在,我甚至没有谈论高级功能,如自动SIMD优化,只是产生合理的代码。当然,必须有一种方法将参数传递给LLVM端,或以某种方式影响结果?通常,任何编译器都会有很多选项来影响代码编译和优化,但Delphi的ARM目标似乎只是优化开/关,就是这样。



LLVM是应该能够产生合理和合理的代码,但似乎德尔福正在以奇怪的方式使用其设施。 Delphi想要非常重要地使用堆栈,它通常只使用处理器的寄存器r0-r3作为临时变量。也许最疯狂的是,它似乎正在加载正常的32位整数作为四个1字节加载操作。如何使Delphi生成更好的ARM代码,而不是逐个字节麻烦它是为Android制作的?



起初我以为逐字节加载是为了从big-endian交换字节顺序,但事实并非如此,它只是加载了32具有4个单字节加载的位数*可能是加载完整的32位而不执行未对齐的字大小的内存负载。 (是否应该避免这是另一回事,这将暗示整个事情是一个编译器错误)*



我们来看看这个简单的函数:

  function ReadInteger(APInteger:PInteger):Integer; 
begin
结果:= APInteger ^;
结束

即使打开优化,带有更新包1的Delphi XE7以及XE6产生以下该功能的ARM汇编代码:

 部分的反汇编.text._ZN16Uarmcodetestform11ReadIntegerEPi:

00000000< _ZN16Uarmcodetestform11ReadIntegerEPi> ;:
0:b580 push {r7,lr}
2:466f mov r7,sp
4:b083 sub sp,#12
6:9002 str r0,[sp,#8]
8:78c1 ldrb r1,[r0,#3]
a:7882 ldrb r2,[r0,#2]
c:ea42 2101 orr.w r1,r2,r1,lsl#8
10:7842 ldrb r2,[r0,#1]
12:7803 ldrb r3,[r0,#0]
14: ea43 2202 orr.w r2,r3,r2,lsl#8
18:ea42 4101 orr.w r1,r2,r1,lsl#16
1c:9101 str r1,[sp,#4]
1e:9000 str r0,[sp,#0]
20:4608 mov r0,r1
22:b003 add sp,#12
24:bd80 pop {r7,pc}

只需计算Delphi需要的指令和内存访问次数。并从4个单字节加载构造一个32位整数...如果我改变一点功能,并使用一个var参数而不是一个指针,它稍微减少一些:

 部分的反汇编.text._ZN16Uarmcodetestform14ReadIntegerVarERi:

00000000< _ZN16Uarmcodetestform14ReadIntegerVarERi> ;:
0:b580 push {r7,lr}
2:466f mov r7,sp
4:b083 sub sp,#12
6:9002 str r0,[sp,#8]
8: 6801 ldr r1,[r0,#0]
a:9101 str r1,[sp,#4]
c:9000 str r0,[sp,#0]
e:4608 mov r0,r1
10:b003 add sp,#12
12:bd80 pop {r7,pc}


$ b $我不会在这里包含反汇编,但对于iOS,Delphi为指针和var参数版本生成相同的代码,并且它们几乎与Android v ar参数版本。
编辑:为了澄清,逐字节加载只在Android上。而只有在Android上,指针和var参数版本会有所不同。在iOS上,这两个版本都会产生完全相同的代码。



为了比较,这里是什么FPC 2.7.1(SVN中继版本从2014年3月)优化级别为-O2。指针和var参数版本完全相同。

 部分的反汇编.text.n_p $ armcodetest _ $$ _ readinteger $ pinteger $$ longint:

00000000< P $ ARMCODETEST _ $$ _ READINTEGER $ PINTEGER $$ LONGINT> ;:

0:6800 ldr r0,[r0 ,#0]
2:46f7 mov pc,lr

我还测试了一个等效的C功能与Android NDK附带的C编译器。

  int ReadInteger(int * APInteger) 
{
return * APInteger;
}

这个编译基本上与FPC所做的一样:

 部分的反汇编.text._Z11ReadIntegerPi:

00000000< _Z11ReadIntegerPi> ;:
0:6800 ldr r0,[r0,#0]
2:4770 bx lr


解决方案


我们正在调查问题。简而言之,这取决于指针引用的整数的潜在错位(到32边界)。需要更多的时间来获得所有的答案...并计划解决这个问题。



MarcoCantù,主持人 Delphi Developers


另请参阅为什么Delphi zlib和zip库在64位之前这么慢?,因为Win64库没有优化地运行。






在QP报告中: RSP-9922
由编译器,$ O指令被忽略?
,Marco添加了以下解释:


这里有多个问题:




  • 如所示,优化设置仅适用于整个单元文件,而不适用于indivi双重功能简单来说,在同一个文件中打开和关闭优化将不起作用。

  • 此外,启用调试信息关闭优化。因此,当调试时,明确地开启优化将不会起作用。因此,IDE中的CPU视图将无法显示经过优化代码的反汇编视图。

  • 第三,加载不对齐的64位数据不安全,导致错误,因此在给定情况下需要单独的4个单字节操作。



Embarcadero's Delphi compilers use an LLVM backend to produce native ARM code for Android devices. I have large amounts of Pascal code that I need to compile into Android applications and I would like to know how to make Delphi generate more efficient code. Right now, I'm not even talking about advanced features like automatic SIMD optimizations, just about producing reasonable code. Surely there must be a way to pass parameters to the LLVM side, or somehow affect the result? Usually, any compiler will have many options to affect code compilation and optimization, but Delphi's ARM targets seem to be just "optimization on/off" and that's it.

LLVM is supposed to be capable of producing reasonably tight and sensible code, but it seems that Delphi is using its facilities in a weird way. Delphi wants to use the stack very heavily, and it generally only utilizes the processor's registers r0-r3 as temporary variables. Perhaps the craziest of all, it seems to be loading normal 32 bit integers as four 1-byte load operations. How to make Delphi produce better ARM code, and without the byte-by-byte hassle it is making for Android?

At first I thought the byte-by-byte loading was for swapping byte order from big-endian, but that was not the case, it is really just loading a 32 bit number with 4 single-byte loads.* It might be to load the full 32 bits without doing an unaligned word-sized memory load. (whether it SHOULD avoid that is another thing, which would hint to the whole thing being a compiler bug)*

Let's look at this simple function:

function ReadInteger(APInteger : PInteger) : Integer;
begin
  Result := APInteger^;
end;

Even with optimizations switched on, Delphi XE7 with update pack 1, as well as XE6, produce the following ARM assembly code for that function:

Disassembly of section .text._ZN16Uarmcodetestform11ReadIntegerEPi:

00000000 <_ZN16Uarmcodetestform11ReadIntegerEPi>:
   0:   b580        push    {r7, lr}
   2:   466f        mov r7, sp
   4:   b083        sub sp, #12
   6:   9002        str r0, [sp, #8]
   8:   78c1        ldrb    r1, [r0, #3]
   a:   7882        ldrb    r2, [r0, #2]
   c:   ea42 2101   orr.w   r1, r2, r1, lsl #8
  10:   7842        ldrb    r2, [r0, #1]
  12:   7803        ldrb    r3, [r0, #0]
  14:   ea43 2202   orr.w   r2, r3, r2, lsl #8
  18:   ea42 4101   orr.w   r1, r2, r1, lsl #16
  1c:   9101        str r1, [sp, #4]
  1e:   9000        str r0, [sp, #0]
  20:   4608        mov r0, r1
  22:   b003        add sp, #12
  24:   bd80        pop {r7, pc}

Just count the number of instructions and memory accesses Delphi needs for that. And constructing a 32 bit integer from 4 single-byte loads... If I change the function a little bit and use a var parameter instead of a pointer, it is slightly less convoluted:

Disassembly of section .text._ZN16Uarmcodetestform14ReadIntegerVarERi:

00000000 <_ZN16Uarmcodetestform14ReadIntegerVarERi>:
   0:   b580        push    {r7, lr}
   2:   466f        mov r7, sp
   4:   b083        sub sp, #12
   6:   9002        str r0, [sp, #8]
   8:   6801        ldr r1, [r0, #0]
   a:   9101        str r1, [sp, #4]
   c:   9000        str r0, [sp, #0]
   e:   4608        mov r0, r1
  10:   b003        add sp, #12
  12:   bd80        pop {r7, pc}

I won't include the disassembly here, but for iOS, Delphi produces identical code for the pointer and var parameter versions, and they are almost but not exactly the same as the Android var parameter version. Edit: to clarify, the byte-by-byte loading is only on Android. And only on Android, the pointer and var parameter versions differ from each other. On iOS both versions generate exactly the same code.

For comparison, here's what FPC 2.7.1 (SVN trunk version from March 2014) thinks of the function with optimization level -O2. The pointer and var parameter versions are exactly the same.

Disassembly of section .text.n_p$armcodetest_$$_readinteger$pinteger$$longint:

00000000 <P$ARMCODETEST_$$_READINTEGER$PINTEGER$$LONGINT>:

   0:   6800        ldr r0, [r0, #0]
   2:   46f7        mov pc, lr

I also tested an equivalent C function with the C compiler that comes with the Android NDK.

int ReadInteger(int *APInteger)
{
    return *APInteger;
}

And this compiles into essentially the same thing FPC made:

Disassembly of section .text._Z11ReadIntegerPi:

00000000 <_Z11ReadIntegerPi>:
   0:   6800        ldr r0, [r0, #0]
   2:   4770        bx  lr

解决方案

We are investigating the issue. In short, it depends on the potential mis-alignment (to 32 boundary) of the Integer referenced by a pointer. Need a little more time to have all of the answers... and a plan to address this.

Marco Cantù, moderator on Delphi Developers

Also reference Why are the Delphi zlib and zip libraries so slow under 64 bit? as Win64 libraries are shipped built without optimizations.


In the QP Report: RSP-9922 Bad ARM code produced by the compiler, $O directive ignored?, Marco added following explanation:

There are multiple issues here:

  • As indicated, optimization settings apply only to entire unit files and not to individual functions. Simply put, turning optimization on and off in the same file will have no effect.
  • Furthermore, simply having "Debug information" enabled turns off optimization. Thus, when one is debugging, explicitly turning on optimizations will have no effect. Consequently, the CPU view in the IDE will not be able to display a disassembled view of optimized code.
  • Third, loading non-aligned 64bit data is not safe and does result in errors, hence the separate 4 one byte operations that are needed in given scenarios.

这篇关于如何影响针对Android / ARM目标的Delphi XEx代码生成?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆