ARM/NEON的64位/32位除法更快的算法? [英] 64bit/32bit division faster algorithm for ARM / NEON?

查看:35
本文介绍了ARM/NEON的64位/32位除法更快的算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个代码,其中在两个地方有 64 位 x 32 位定点除法,结果是 32 位.这两个地方加在一起占用了我总时间的 20% 以上.所以我觉得如果我能去掉 64 位除法,我就可以很好地优化代码.在 NEON 中,我们可以有一些 64 位指令.任何人都可以建议一些例程来通过使用一些更快的实现来解决瓶颈.

I am working on a code in which at two places there are 64bit by 32 bit fixed point division and the result is taken in 32 bits. These two places are together taking more than 20% of my total time taken. So I feel like if I could remove the 64 bit division, I could optimize the code well. In NEON we can have some 64 bit instructions. Can any one suggest some routine to get the bottleneck resolved by using some faster implementation.

或者,如果我可以用 C 中的 32 位/32 位除法来进行 64 位/32 位除法,那也可以吗?

Or if I could make the 64 bit/32 bit division in terms of 32bit/32 bit division in C, that also is fine?

如果有人有什么想法,你能帮我吗?

If any one has some idea, could you please help me out?

推荐答案

我过去做过很多定点运算,也做过很多研究,寻找快速的 64/32 位除法.如果您在 Google 上搜索ARM 部门",您会发现大量关于此问题的重要链接和讨论.

I did a lot of fixed-point arithmetic in the past and did a lot of research looking for fast 64/32 bit divisions myself. If you google for 'ARM division' you will find tons of great links and discussion about this issue.

ARM 架构的最佳解决方案在这里,即使是 32 位划分也可能无法在硬件中使用:

The best solution for ARM architecture, where even a 32 bit division may not be available in hardware is here:

http://www.peter-teichmann.de/adiv2e.html

此汇编代码非常,您的汇编程序可能无法理解其语法.然而,值得将代码移植到您的工具链中.这是迄今为止我见过的最快的特殊情况的除法代码,相信我:我已经对它们进行了基准测试:-)

This assembly code is very old, and your assembler may not understand the syntax of it. It is however worth porting the code to your toolchain. It is the fastest division code for your special case I've seen so far, and trust me: I've benchmarked them all :-)

上次我这样做(大约 5 年前,对于 CortexA8)这段代码比编译器生成的代码快 10 倍.

Last time I did that (about 5 years ago, for CortexA8) this code was about 10 times faster than what the compiler generated.

此代码不使用 NEON.NEON 端口会很有趣.不确定它是否会大大提高性能.

This code doesn't use NEON. A NEON port would be interesting. Not sure if it will improve the performance much though.

我发现带有汇编器的代码移植到 GAS(GNU 工具链).此代码正在运行并经过测试:

I found the code with assembler ported to GAS (GNU Toolchain). This code is working and tested:

Divide.S

.section ".text"

.global udiv64

udiv64:
    adds      r0,r0,r0
    adc       r1,r1,r1

    .rept 31
        cmp     r1,r2   
        subcs   r1,r1,r2  
        adcs    r0,r0,r0
        adc     r1,r1,r1
    .endr

    cmp     r1,r2
    subcs   r1,r1,r2
    adcs    r0,r0,r0

    bx      lr

C 代码:

extern "C" uint32_t udiv64 (uint32_t a, uint32_t b, uint32_t c);

int32_t fixdiv24 (int32_t a, int32_t b)
/* calculate (a<<24)/b with 64 bit immediate result */
{
  int q;
  int sign = (a^b) < 0; /* different signs */
  uint32_t l,h;
  a = a<0 ? -a:a;
  b = b<0 ? -b:b;
  l = (a << 24);
  h = (a >> 8);
  q = udiv64 (l,h,b);
  if (sign) q = -q;
  return q;
}

这篇关于ARM/NEON的64位/32位除法更快的算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆