适用于ARM/NEON的64位/32位除法更快的算法? [英] 64bit/32bit division faster algorithm for ARM / NEON?

查看:577
本文介绍了适用于ARM/NEON的64位/32位除法更快的算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个代码,其中在两个位置都有64位乘以32位的定点除法,结果取为32位.这两个地方合计占我总学习时间的20%以上.因此,我觉得如果可以删除64位除法,则可以很好地优化代码.在NEON中,我们可以使用一些64位指令.任何人都可以提出一些例程来解决瓶颈,方法是使用一些更快的实现方法.

I am working on a code in which at two places there are 64bit by 32 bit fixed point division and the result is taken in 32 bits. These two places are together taking more than 20% of my total time taken. So I feel like if I could remove the 64 bit division, I could optimize the code well. In NEON we can have some 64 bit instructions. Can any one suggest some routine to get the bottleneck resolved by using some faster implementation.

或者如果我可以用C中的32位/32位划分进行64位/32位划分,那还好吗?

Or if I could make the 64 bit/32 bit division in terms of 32bit/32 bit division in C, that also is fine?

如果有人有什么想法,请您帮帮我吗?

If any one has some idea, could you please help me out?

推荐答案

过去,我做了很多定点运算,并且我自己进行了大量研究以寻找快速的64/32位除法.如果您用Google搜索"ARM部门",则会发现很好的链接和关于此问题的讨论.

I did a lot of fixed-point arithmetic in the past and did a lot of research looking for fast 64/32 bit divisions myself. If you google for 'ARM division' you will find tons of great links and discussion about this issue.

ARM体系结构的最佳解决方案如下:

The best solution for ARM architecture, where even a 32 bit division may not be available in hardware is here:

http://www.peter-teichmann.de/adiv2e.html

此汇编代码很旧,并且您的汇编器可能不了解其语法.但是,值得将代码移植到您的工具链中.这是到目前为止我所见过的针对您特殊情况的最快的部门代码,请相信我:我已经对所有这些进行了基准测试:-)

This assembly code is very old, and your assembler may not understand the syntax of it. It is however worth porting the code to your toolchain. It is the fastest division code for your special case I've seen so far, and trust me: I've benchmarked them all :-)

我上次这样做(大约5年前,对于CortexA8),此代码比编译器生成的代码快10倍.

Last time I did that (about 5 years ago, for CortexA8) this code was about 10 times faster than what the compiler generated.

此代码不使用NEON. NEON端口会很有趣.不确定是否会大大提高性能.

This code doesn't use NEON. A NEON port would be interesting. Not sure if it will improve the performance much though.

我找到了将汇编程序移植到GAS(GNU Toolchain)的代码.该代码正在运行并经过测试:

I found the code with assembler ported to GAS (GNU Toolchain). This code is working and tested:

Divide.S

.section ".text"

.global udiv64

udiv64:
    adds      r0,r0,r0
    adc       r1,r1,r1

    .rept 31
        cmp     r1,r2   
        subcs   r1,r1,r2  
        adcs    r0,r0,r0
        adc     r1,r1,r1
    .endr

    cmp     r1,r2
    subcs   r1,r1,r2
    adcs    r0,r0,r0

    bx      lr

C代码:

extern "C" uint32_t udiv64 (uint32_t a, uint32_t b, uint32_t c);

int32_t fixdiv24 (int32_t a, int32_t b)
/* calculate (a<<24)/b with 64 bit immediate result */
{
  int q;
  int sign = (a^b) < 0; /* different signs */
  uint32_t l,h;
  a = a<0 ? -a:a;
  b = b<0 ? -b:b;
  l = (a << 24);
  h = (a >> 8);
  q = udiv64 (l,h,b);
  if (sign) q = -q;
  return q;
}

这篇关于适用于ARM/NEON的64位/32位除法更快的算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆