Cortex A9 NEON与VFP使用混淆 [英] Cortex A9 NEON vs VFP usage confusion

查看:2501
本文介绍了Cortex A9 NEON与VFP使用混淆的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图建立一个Cortex A9 ARM处理器的库(更具体的OMAP4),我对于在浮动环境下使用NEON vs VFP时有些混淆点操作和SIMD。需要指出的是,我知道两个硬件协处理器单元之间的差异(也概述),我只是对它们的正确使用有一些误解。



与此相关,我使用以下编译flags:

  GCC 
-O3 -mcpu = cortex-a9 -mfpu = neon -mfloat-abi = softfp
-O3 -mcpu = cortex-a9 -mfpu = vfpv3 -mfloat-abi = softfp
ARMCC
--cpu = Cortex-A9 --apcs = / softfp
--cpu = Cortex-A9 --fpu = VFPv3 --apcs = / softfp

我读过ARM文档,很多维基(像这样一个),论坛和博客文章,每个人似乎都认为,使用NEON比使用VFP
或至少混合使用NEON(例如使用内部函数在SIMD中实现一些算法)要好,VFP不成功一个好主意;我不是100%确定,如果这适用于整个应用程序\ library或只是在代码中的特定位置(函数)的上下文。

m使用霓虹灯作为我的应用程序的FPU,因为我也想使用intrinsics。因此,我遇到了一些麻烦,如何在Cortex A9上最好地利用这些特性(NEON vs VFP),我的困惑只是进一步深化而不是清理。我有一些代码为我的应用程序进行基准测试,并使用一些自定义计时器类
,其中计算基于双精度浮点数。使用NEON作为FPU给出了完全不恰当的结果(试图打印这些值导致主要打印inf和NaN;相同的代码在为x86构建时没有问题)。所以我改变了我的计算使用单精度浮点,因为被记录为NEON不处理双精度浮点。我的基准仍然没有给出正确的结果(最糟糕的是,现在它不再适用于x86;我认为这是因为精度的损失,但我不确定)。所以我几乎完全失去了:一方面我想使用NEON来实现SIMD功能,并且使用它作为FPU不能提供正确的结果,另一方面,将它与VFP混合似乎并不是一个好主意。
在这方面的任何意见将不胜感激!!

我在上面提到的wiki中的文章中找到了一个浮点运算NEON环境下的优化:



  • 仅使用单精度浮动

  • 使用NEON intrinsics / ASM,当您发现一个瓶颈的FP函数时,您可以比编译器做得更好。 li>
  • 启用RunFast模式



  • 对于softfp:


  • 内联浮点代码(除非它非常大)
  • 通过指针传递FP参数,而不是通过值,并在函数调用之间进行整数运算。 / li>




    我无法使用硬性的浮动ABI,因为我无法连结与我可用的图书馆。
    大部分的推荐对我来说都是有意义的(除了runfast模式,我不明白应该做什么以及在这个时候我可以比编译器做得更好),但是我保留得到不一致的结果,我不知道现在什么。



    任何人都可以阐明如何正确使用浮点和NEON的Cortex A9 /我应该使用A8和哪些编译标志?

    解决方案


    ... forum and blog posts and everyone似乎同意使用NEON比使用VFP或至少混合NEON(例如使用instrinsics在SIMD中实现一些算法)更好,VFP不是一个好主意。

    我不确定这是否正确。根据ARM的介绍NEON Development文章| NEON寄存器
    $ b


    NEON寄存器库由32个64位寄存器组成。如果同时执行
    高级SIMD和VFPv3,则它们共享该
    银行的注册。在这种情况下,VFPv3以VFPv3-D32形式实现,
    支持32个双精度浮点寄存器。这
    集成简化了实现上下文切换支持,因为
    保存和恢复VFP上下文的相同例程也保存,
    恢复NEON上下文。

    < NEON单元可以查看相同的寄存器组:


    • 16个128位四字寄存器,Q0-Q15

    • 三十二个64位双字寄存器D0-D31。


      NEON D0-D31寄存器是与VFPv3 D0-D31寄存器
      相同,每个Q0-Q15寄存器映射到一对D寄存器。
      图1.3显示了共享的NEON和VFP
      寄存器组的不同视图。所有这些视图都可以随时访问。软件
      不必显式地在它们之间切换,因为使用的
      指令决定了相应的视图。

    寄存器不竞争;而是作为注册银行的意见共同存在。





    与此相关的是我使用以下编译标志:

      -O3 -mcpu = cortex-a9 -mfpu = neon -mfloat-abi = softfp 
    -O3 -mcpu = cortex-a9 -mfpu = vfpv3 -mfloat-abi = softfp


    这就是我所做的;你的旅费可能会改变。它来源于从平台和编译器收集到的信息的混搭。

    $ p $ g $ g $ gnueabihf 告诉我这个平台使用硬浮动,这可以加快程序调用。如果有疑问,请使用 softfp ,因为它与硬浮点数兼容。 b
    $ b BeagleBone Black

      $ gcc -v 2>& 1 | grep Target 
    目标:arm-linux-gnueabihf
    $ b $ cat / proc / cpuinfo
    型号名称:ARMv7处理器rev 2(v7l)
    特性:half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32
    ...

    所以BeagleBone使用: p>

      -march = armv7 -a -mtune = cortex-a8 -mfpu = neon -mfloat-abi = hard 
    < code $ <$ $ p

    CubieTruck v5

      $ gcc -v 2>& 1 | grep Target 
    目标:arm-linux-gnueabihf
    $ b $ cat / proc / cpuinfo
    处理器:ARMv7 Processor rev 5(v7l)
    特点:swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4

    所以CubieTruck使用:

      -march = armv7 -a -mtune = cortex-a7 -mfpu = neon-vfpv4 -mfloat-abi = hard 

    Banana Pi Pro

      $ gcc -v 2>& 1 | grep Target 
    目标:arm-linux-gnueabihf
    $ b $ cat / proc / cpuinfo
    处理器:ARMv7处理器rev 4(v7l)
    特点:swp half thumb fastmult因此,Banana Pi使用:$ b $ v $ p


    $ b $ p


      -march = armv7 -a -mtune = cortex-a7 -mfpu = neon-vfpv4 -mfloat-abi = hard 

    Raspberry Pi 3

    RPI3的独特之处在于它的ARMv8,但它运行的是32位操作系统。这意味着它有效的32位ARM Aarch32。还有一点对于32位ARM和Aarch32来说,不过这会告诉你Aarch32的标志。另外,RPI3使用Broadcom A53 SoC,它有NEON和

      $ gcc -v 2>& 1 |可选的CRC32指令,但缺少可选的Crypto扩展。 grep Target 
    目标:arm-linux-gnueabihf
    $ b $ cat / proc / cpuinfo
    型号名称:ARMv7 Processor rev 4(v7l)
    特性:half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
    ...

    树莓派可以使用:

      -march = armv8-a + crc -mtune = cortex-a53 -mfpu = neon-fp- armv8 -mfloat-abi = hard 

    或者可以使用(我不知道用什么 -mtune ):

      -march = armv7-a -mfpu = neon-vfpv4 -mfloat-abi = hard 

    ODROID C2

    ODROID C2使用Amlogic A53 SoC,但是它使用64位操作系统。 ODROID C2,它有NEON和可选的CRC32指令,但缺少可选的Crypto扩展(与RPI3类似的配置)。

    pre $ $ $ $ $ $ gcc -v 2>& 1 | grep目标
    目标:aarch64-linux-gnu

    $ cat / proc / cpuinfo
    特性:fp asimd evtstrm crc32
    pre>

    所以ODROID使用:
    $ b $ pre $ -march = armv8- a + crc -mtune = cortex-a53






    通过检查数据表,我学习了ARM处理器(如Cortex A9或A53)。根据 Unix和Linux Stack Exchange 上的这个答案,这个解码器从 / proc / cpuinfo


    CPU部分:部件号。 0xd03表示Cortex-A53处理器。

    所以我们可以查询数据库的值。我不知道它是否存在或位于何处。


    I'm trying to build a library for a Cortex A9 ARM processor(an OMAP4 to be more specific) and I'm in a little bit of confusion regarding which\when to use NEON vs VFP in the context of floating point operations and SIMD. To be noted that I know the difference between the 2 hardware coprocessor units(as also outlined here on SO), I just have some misunderstanding regarding their proper usage.

    Related to this I'm using the following compilation flags:

    GCC
    -O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
    -O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
    ARMCC
    --cpu=Cortex-A9 --apcs=/softfp
    --cpu=Cortex-A9 --fpu=VFPv3 --apcs=/softfp
    

    I've read through the ARM documentation, a lot of wiki(like this one), forum and blog posts and everybody seems to agree that using NEON is better than using VFP or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea; I'm not 100% sure yet if this applies in the context of the entire application\library or just to specific places(functions) in code.

    So I'm using neon as the FPU for my application as I also want to use the intrinsics. As a result I'm in a little bit of trouble and my confusion on how to best use these features(NEON vs VFP) on the Cortex A9 just deepens further instead of clearing up. I have some code that does benchmarking for my app and uses some custom made timer classes in which calculations are based on double precision floating point. Using NEON as the FPU gives completely inappropriate results(trying to print those values results in printing mostly inf and NaN; the same code works without a hitch when built for x86). So I changed my calculations to use single precision floating point as is documented that NEON does not handle double precision floating point. My benchmarks still don't give the proper results(and what's worst is that now it does not work anymore on x86; I think it's because of the lost in precision but I'm not sure). So I'm almost completely lost: on one hand I want to use NEON for the SIMD capabilities and using it as the FPU does not provide the proper results, on the other hand mixing it with the VFP does not seem a very good idea. Any advice in this area will be greatly appreciated !!

    I found in the article in the above mentioned wiki a summary of what should be done for floating point optimization in the context of NEON:

    "

    • Only use single precision floating point
    • Use NEON intrinsics / ASM when ever you find a bottlenecking FP function. You can do better than the compiler.
    • Minimize Conditional Branches
    • Enable RunFast mode

    For softfp:

    • Inline floating point code (unless its very large)
    • Pass FP arguments via pointers instead of by value and do integer work in between function calls.

    "

    I cannot use hard for the float ABI as I cannot link with the libraries I have available. Most of the reccomendations make sense to me(except the "runfast mode" which I don't understand exactly what's supposed to do and the fact that at this moment in time I could do better than the compiler) but I keep getting inconsistent results and I'm not sure of anything right now.

    Could anyone shed some light on how to properly use the floating point and the NEON for the Cortex A9/A8 and which compilation flags should I use?

    解决方案

    ... forum and blog posts and everybody seems to agree that using NEON is better than using VFP or at least mixing NEON(e.g. using the instrinsics to implement some algos in SIMD) and VFP is not such a good idea

    I'm not sure this is correct. According to ARM at Introducing NEON Development Article | NEON registers:

    The NEON register bank consists of 32 64-bit registers. If both Advanced SIMD and VFPv3 are implemented, they share this register bank. In this case, VFPv3 is implemented in the VFPv3-D32 form that supports 32 double-precision floating-point registers. This integration simplifies implementing context switching support, because the same routines that save and restore VFP context also save and restore NEON context.

    The NEON unit can view the same register bank as:

    • sixteen 128-bit quadword registers, Q0-Q15
    • thirty-two 64-bit doubleword registers, D0-D31.

    The NEON D0-D31 registers are the same as the VFPv3 D0-D31 registers and each of the Q0-Q15 registers map onto a pair of D registers. Figure 1.3 shows the different views of the shared NEON and VFP register bank. All of these views are accessible at any time. Software does not have to explicitly switch between them, because the instruction used determines the appropriate view.

    The registers don't compete; rather, they co-exist as views of the register bank. There's no way to disgorge the NEON and FPU gear.


    Related to this I'm using the following compilation flags:

    -O3 -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=softfp
    -O3 -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=softfp
    

    Here's what I do; your mileage may vary. Its derived from a mashup of information gathered from the platform and compiler.

    gnueabihf tells me the platform use hard floats, which can speed up procedural calls. If in doubt, use softfp because its compatible with hard floats.

    BeagleBone Black:

    $ gcc -v 2>&1 | grep Target          
    Target: arm-linux-gnueabihf
    
    $ cat /proc/cpuinfo
    model name  : ARMv7 Processor rev 2 (v7l)
    Features    : half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32 
    ...
    

    So the BeagleBone uses:

    -march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=hard
    

    CubieTruck v5:

    $ gcc -v 2>&1 | grep Target 
    Target: arm-linux-gnueabihf
    
    $ cat /proc/cpuinfo
    Processor   : ARMv7 Processor rev 5 (v7l)
    Features    : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpv4 
    

    So the CubieTruck uses:

    -march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
    

    Banana Pi Pro:

    $ gcc -v 2>&1 | grep Target 
    Target: arm-linux-gnueabihf
    
    $ cat /proc/cpuinfo
    Processor   : ARMv7 Processor rev 4 (v7l)
    Features    : swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt
    

    So the Banana Pi uses:

    -march=armv7-a -mtune=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
    

    Raspberry Pi 3:

    The RPI3 is unique in that its ARMv8, but its running a 32-bit OS. That means its effectively 32-bit ARM or Aarch32. There's a little more to 32-bit ARM vs Aarch32, but this will show you the Aarch32 flags

    Also, the RPI3 uses a Broadcom A53 SoC, and it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions.

    $ gcc -v 2>&1 | grep Target 
    Target: arm-linux-gnueabihf
    
    $ cat /proc/cpuinfo 
    model name  : ARMv7 Processor rev 4 (v7l)
    Features    : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
    ...
    

    So the Raspberry Pi can use:

    -march=armv8-a+crc -mtune=cortex-a53 -mfpu=neon-fp-armv8 -mfloat-abi=hard
    

    Or it can use (I don't know what to use for -mtune):

    -march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard 
    

    ODROID C2:

    ODROID C2 uses an Amlogic A53 SoC, but it uses a 64-bit OS. The ODROID C2, it has NEON and the optional CRC32 instructions, but lacks the optional Crypto extensions (similar config to RPI3).

    $ gcc -v 2>&1 | grep Target 
    Target: aarch64-linux-gnu
    
    $ cat /proc/cpuinfo 
    Features    : fp asimd evtstrm crc32
    

    So the ODROID uses:

    -march=armv8-a+crc -mtune=cortex-a53
    


    In the above recipes, I learned the ARM processor (like Cortex A9 or A53) by inspecting data sheets. According to this answer on Unix and Linux Stack Exchange, which deciphers output from /proc/cpuinfo:

    CPU part: Part number. 0xd03 indicates Cortex-A53 processor.

    So we may be able to lookup the value form a database. I don't know if it exists or where its located.

    这篇关于Cortex A9 NEON与VFP使用混淆的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆