使用Newton-Raphson方法在x87 FPU上建立多维数据集根 [英] Cube root on x87 FPU using Newton-Raphson method

查看:99
本文介绍了使用Newton-Raphson方法在x87 FPU上建立多维数据集根的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用8086处理器编写一个汇编程序,该程序将找到数字的立方根.显然我正在使用浮点数.

基于 Newton-Raphson方法的算法:

root := 1.0; 
repeat
     oldRoot := root;
     root := (2.0*root + x/(root*root)) / 3.0 
until ( |root – oldRoot| < 0.001;

如何将(2 * root + x)除以(root * root)?

.586
.MODEL FLAT
.STACK 4096

.DATA
root    REAL4   1.0
oldRoot REAL4   2.0
Two     REAL4   2.0
inttwo  DWORD   2
itThree DWORD   3
three   REAL4   3.0
x       DOWRD   27


.CODE
main    PROC
        finit           ; initialize FPU
        fld     root    ; root in ST
        fmul    two     ; root*two
        fadd    x       ; root*two+27

        fld     root    ; root in ST
        fimul    root    ; root*root

        mov     eax, 0  ; exit
        ret
main    ENDP 
END     

我想我不理解堆栈中什么位置的内容.产品行

纤维根;根*根

进入ST(1)?编辑不,它进入st(0)就是st(0)中的内容被压入堆栈到st(1)

但是我还没有弄清楚我问题的答案... 如何除法?现在我看到我需要将st(1)除以st(0),但是我不知道如何.我尝试过了.

finit           ; initialize FPU
fld     root    ; root in ST
fmul    two     ; root*two
fadd    xx      ; root*two+27
; the answer to root*two+x is stored in ST(0) when we load root st(0) moves to ST1 and we will use ST0 for the next operation

fld     root    ; root in ST previous content is now in ST1
fimul   root    ; root*root
fidiv   st(1)

我的公式写错了.这就是我想要的.

(2.0*root) + x / (root*root)) / 3.0 That's what I need. 
STEP 1) (2 * root) 
STEP 2) x / (root * root) 
STEP 3) ADD step one and step 2 
STEP 4) divide step 3 by 3.0

root =(2.0 * 1.0)+ 27/(1.0 * 1.0)/3.0; (2)+ 27/(1.0)/3.0 = 11 ==>根= 11

新代码!

.586
.MODEL FLAT
.STACK 4096

.DATA
root    REAL4   1.0
oldRoot REAL4   2.0
Two     REAL4   2.0
three   REAL4   3.0
xx      REAL4   27.0


.CODE
main    PROC
        finit           ; initialize FPU
                fld     root    ; root in ST    ; Pass 1 ST(0) has 1.0  
repreatAgain:
        ;fld    st(2)

        fmul    two     ; root*two      ; Pass 1 ST(0) has 2                                                                            Pass 2 ST(0) = 19.333333 st(1) = 3.0 st(2) = 29.0 st(3) = 1.0

        ; the answer to roor*two is stored in ST0 when we load rootSTO moves to ST1 and we will use ST0 for the next operation
        fld     root    ; root in ST(0) previous content is now in ST(1)      Pass 1 ST(0) has 1.0 ST(1) has 2.0                        Pass 2 st(
        fmul    st(0), st(0)    ; root*root                                 ; Pass 1 st(0) has 1.0 st(1) has 2.0
        fld     xx                                                          ; Pass 1 st(0) has 27.0 st(1) has 1.0 st(2) has 2.0
        fdiv    st(0), st(1) ; x / (root*root)  ; Pass 1: 27 / 1              Pass 1 st(0) has 27.0 st(1) has 2.0 st(2) has 2.0
        fadd    st(0), st(2) ; (2.0*root) + x / (root*root))                  Pass 1 st(0) has 29.0 st(1) has 1.0 st(2) has 2.0

        fld     three                                                       ; Pass 1 st(0) has 3.0 st(1) has 29.0 st(2) has 1.0 st(3) has 2.0

        fld     st(1)                                                       ; Pass 1 st(0) has 3.0 st(1) has 29.0 st(2) = 1.0 st(3) = 2.0
        fdiv    st(0), st(1) ; (2.0*root) + x / (root*root)) / 3.0            Pass 1 st(1) has 9.6666666666667



        jmp     repreatAgain
        mov     eax, 0  ; exit
        ret
main    ENDP 
END   

解决方案

英特尔的insn参考手册记录了所有说明,包括此x87 FP比较问答..在标签wiki的问题中查看更多链接. >


re:EDIT2代码转储:

循环中有4条fld指令,但是没有p后缀的操作.您的循环将在第3次迭代中使8寄存器FP堆栈溢出,这时您将获得NaN. (具体来说,是不确定值NaN,printf将其打印为1#IND.

我建议设计循环,以使迭代以st(0)中的root开始,并以st(0) 中的下一个迭代的root值结束.不要在循环内从root加载或存储.使用fld1在循环外加载1.0作为您的初始值,并在循环后使用fstp [root]st(0)弹出到内存中.


您选择了最不方便的方法来制作tmp/3.0

                          ; stack = tmp   (and should otherwise be empty once you fix the rest of your code)
    fld     three         ; stack = 3.0, tmp
    fld     st(1)         ; stack = tmp, 3.0, tmp   ; should have used fxchg to just swap instead of making the stack deeper
    fdiv    st(0), st(1)  ; stack = tmp/3.0, 3.0, tmp

fdivfsub等具有多种寄存器-寄存器形式:一种以st(0)为目的地,另一种以其为源.带有st(0)作为来源的表单也可以与p op一起使用,因此您可以

    fld     three         ; stack = 3.0, tmp
    fdivp                 ; stack = tmp / 3.0  popping the stack back to just one entry
    ; fdivp  st(1), st(0) ; this is what fdivp with no operands means

实际上,比起直接使用内存操作数而不是加载操作数,它甚至更简单.由于您需要st(0) /= 3.0,因此您可以执行fdiv [three] .在这种情况下,FP操作就像整数操作一样,您可以在div dword ptr [integer_from_memory]中使用内存源操作数.

非交换运算(减和除)也具有相反的版本(例如fdivr),即使您需要3.0/tmp而不是3.0/tmp,也可以为您节省fxchg或使用内存操作数tmp/3.0


除以3等于乘以1/3 ,并且fmulfdiv快得多.从代码简单性的角度来看,乘法是可交换的,因此实现st(0) /= 3的另一种方法是:

fld    [one_third]
fmulp                  ; shorthand for  fmulp st(1), st(0)

; or
fmul   [one_third]

请注意,1/3.0在二进制浮点数中没有确切的表示形式,但是+/-大约2 ^ 23之间的所有整数都可以(单精度REAL4的尾数大小).仅当您期望使用三的精确倍数时,才应该关心这一点.


对原始代码的评论:

您可以提前执行2.0 / 3.0x/3.0来进行循环除法.如果您希望循环平均运行一次以上的迭代,这是值得的.


您可以使用fld st(0)复制堆栈的顶部,因此不必继续从内存中加载数据.


fimul [root](整数 mul)是一个错误:您的root采用REAL4(32位浮点)格式,而不是整数. fidiv同样是一个错误,当然不能与x87寄存器作为源操作数一起使用.

由于堆栈顶部有root,我想您可以fmul st(0)既可以将st(0)用作显式操作数,也可以使用隐式操作数,从而生成st(0) = st(0) * st(0),而不会改变深度堆栈.


您还可以将sqrt用作比1.0更好的初始近似值,或者将+/-1 * sqrtf(fabsf(x))用作.我没有看到x87指令将一个浮点数的符号应用于另一浮点数,只是fchs可以无条件地翻转,而fabs可以无条件地清除符号位.有一个fcmov,但是它需要P6或更高版本的CPU.您提到8086,但随后使用了.586,因此IDK定位您的目标.


更好的循环体:

未经调试或测试,但是您的代码充满了来自相同数据的重复加载,这使我发疯.这个优化的版本在这里是因为我很好奇,而不是因为我认为它将直接帮助OP.

此外,希望这是一个如何在棘手的地方用代码注释数据流的好例子. (例如x87或带有改组的矢量化代码).

## x/3.0 in st(1)
## 2.0/3.0 in st(2)

# before each iteration: st(0) = root
#  after each iteration: st(0) = root * 2.0/3.0 + (x/3.0 / (root*root)), with constants undisturbed

loop_body:
    fld     st(0)         ; stack: root, root, 2/3, x/3
    fmul    st(0), st(0)  ; stack: root^2, root, 2/3, x/3
    fdivr   st(0), st(3)  ; stack: x/3 / root^2, root, 2/3, x/3
    fxchg   st(1)         ; stack: root, x/3/root^2, 2/3, x/3
    fmul    st(0), st(2)  ; stack: root*2/3, x/3/root^2, 2/3, x/3
    faddp                 ; stack: root*2/3 + x/3/root^2, 2/3, x/3

; TODO: compare and loop back to loop_body

    fstp    [root]         ; store and pop
    fstp    st(0)          ; pop the two constants off the FP stack to empty it before returning
    fstp    st(0)
    ; finit is very slow, ~80cycles, don't use it if you don't have to.

32位函数调用约定在st(0)中返回FP结果,因此您可以执行此操作,但是调用者可能必须将其存储在某个地方.

I am trying to write an assembly program using the 8086 processor that will find the cube root of a number. Obviously I am using floating points.

Algorithm based upon Newton-Raphson method:

root := 1.0; 
repeat
     oldRoot := root;
     root := (2.0*root + x/(root*root)) / 3.0 
until ( |root – oldRoot| < 0.001;

How do I divide (2*root + x) by (root*root)?

.586
.MODEL FLAT
.STACK 4096

.DATA
root    REAL4   1.0
oldRoot REAL4   2.0
Two     REAL4   2.0
inttwo  DWORD   2
itThree DWORD   3
three   REAL4   3.0
x       DOWRD   27


.CODE
main    PROC
        finit           ; initialize FPU
        fld     root    ; root in ST
        fmul    two     ; root*two
        fadd    x       ; root*two+27

        fld     root    ; root in ST
        fimul    root    ; root*root

        mov     eax, 0  ; exit
        ret
main    ENDP 
END     

I guess I don't understand what is in the stack at what location. Does the product for line

fimul root ; root*root

go into ST(1)? EDIT No, it goes into st(0) what was in st(0) got pushed down the stack to st(1)

But I haven't figured out the answer to my question... How do I divide? Now I see I need to divide st(1) by st(0) but I don't know how. I tried this.

finit           ; initialize FPU
fld     root    ; root in ST
fmul    two     ; root*two
fadd    xx      ; root*two+27
; the answer to root*two+x is stored in ST(0) when we load root st(0) moves to ST1 and we will use ST0 for the next operation

fld     root    ; root in ST previous content is now in ST1
fimul   root    ; root*root
fidiv   st(1)

EDIT: I had the formula written wrong. This is what I am looking for.

(2.0*root) + x / (root*root)) / 3.0 That's what I need. 
STEP 1) (2 * root) 
STEP 2) x / (root * root) 
STEP 3) ADD step one and step 2 
STEP 4) divide step 3 by 3.0

root = (2.0*1.0) + 27/(1.0*1.0) / 3.0 ; (2) + 27/(1.0) / 3.0 = 11 ==> root = 11

EDIT2: NEW CODE!!

.586
.MODEL FLAT
.STACK 4096

.DATA
root    REAL4   1.0
oldRoot REAL4   2.0
Two     REAL4   2.0
three   REAL4   3.0
xx      REAL4   27.0


.CODE
main    PROC
        finit           ; initialize FPU
                fld     root    ; root in ST    ; Pass 1 ST(0) has 1.0  
repreatAgain:
        ;fld    st(2)

        fmul    two     ; root*two      ; Pass 1 ST(0) has 2                                                                            Pass 2 ST(0) = 19.333333 st(1) = 3.0 st(2) = 29.0 st(3) = 1.0

        ; the answer to roor*two is stored in ST0 when we load rootSTO moves to ST1 and we will use ST0 for the next operation
        fld     root    ; root in ST(0) previous content is now in ST(1)      Pass 1 ST(0) has 1.0 ST(1) has 2.0                        Pass 2 st(
        fmul    st(0), st(0)    ; root*root                                 ; Pass 1 st(0) has 1.0 st(1) has 2.0
        fld     xx                                                          ; Pass 1 st(0) has 27.0 st(1) has 1.0 st(2) has 2.0
        fdiv    st(0), st(1) ; x / (root*root)  ; Pass 1: 27 / 1              Pass 1 st(0) has 27.0 st(1) has 2.0 st(2) has 2.0
        fadd    st(0), st(2) ; (2.0*root) + x / (root*root))                  Pass 1 st(0) has 29.0 st(1) has 1.0 st(2) has 2.0

        fld     three                                                       ; Pass 1 st(0) has 3.0 st(1) has 29.0 st(2) has 1.0 st(3) has 2.0

        fld     st(1)                                                       ; Pass 1 st(0) has 3.0 st(1) has 29.0 st(2) = 1.0 st(3) = 2.0
        fdiv    st(0), st(1) ; (2.0*root) + x / (root*root)) / 3.0            Pass 1 st(1) has 9.6666666666667



        jmp     repreatAgain
        mov     eax, 0  ; exit
        ret
main    ENDP 
END   

解决方案

Intel's insn reference manual documents all the instructions, including fdiv and fdivr (x/y instead of y/x). If you really need to learn mostly-obsolete x87 (fdiv) instead of SSE2 (divss), then this x87 tutorial is essential reading, esp. the early chapter that explains the register stack. Also see this x87 FP comparison Q&A. See more links in the tag wiki.


re: EDIT2 code dump:

You have 4 fld instructions inside the loop, but no p-suffixed operations. Your loop will overflow the 8-register FP stack on the 3rd iteration, at which point you'll get a NaN. (specifically, the indefinite-value NaN, which printf prints as 1#IND.

I'd suggest designing your loop so an iteration starts with root in st(0), and ends with the next iteration's root value in st(0). Don't load or store to/from root inside the loop. Use fld1 to load 1.0 as your initial value outside the loop, and fstp [root] after the loop to pop st(0) into memory.


You picked the most inconvenient way to do tmp / 3.0

                          ; stack = tmp   (and should otherwise be empty once you fix the rest of your code)
    fld     three         ; stack = 3.0, tmp
    fld     st(1)         ; stack = tmp, 3.0, tmp   ; should have used fxchg to just swap instead of making the stack deeper
    fdiv    st(0), st(1)  ; stack = tmp/3.0, 3.0, tmp

fdiv, fsub, etc. have multiple register-register forms: one where st(0) is the destination, and one where it's the source. The form with st(0) as the source is also available with a pop, so you could

    fld     three         ; stack = 3.0, tmp
    fdivp                 ; stack = tmp / 3.0  popping the stack back to just one entry
    ; fdivp  st(1), st(0) ; this is what fdivp with no operands means

It's actually even simpler than that if you use a memory operand directly instead of loading it. Since you want st(0) /= 3.0, you can do fdiv [three]. In that case, FP ops are just like integer ops, where you can do div dword ptr [integer_from_memory] to use a memory source operand.

The non-commutative operations (subtract and divide) also have reverse versions (e.g. fdivr), which can save you an fxchg or let you use a memory operand even if you'd needed 3.0/tmp instead of tmp/3.0


Dividing by 3 is the same as multiplying by 1/3, and fmul is much faster than fdiv. From a code-simplicity point of view, multiply is commutative, so another way to implement st(0) /= 3 is:

fld    [one_third]
fmulp                  ; shorthand for  fmulp st(1), st(0)

; or
fmul   [one_third]

Note that 1/3.0 has no exact representation in binary floating point, but all integers between +/- about 2^23 do (size of mantissa of single-precision REAL4). You should only care about this if you were expecting to work with exact multiples of three.


Comments on the original code:

You can hoist a division out of the loop by doing 2.0 / 3.0 and x/3.0 ahead of time. This is worth it if you expect the loop to run more than one iteration on average.


You can duplicate the top of the stack with fld st(0), so you don't have to keep loading from memory.


fimul [root] (integer mul) is a bug: Your root is in REAL4 (32bit float) format, not integer. fidiv is similarly a bug, and of course doesn't work with an x87 register as a source operand.

Since you have root at the top of the stack, I think you can just fmul st(0) to use st(0) as both the explicit and implicit operand, resulting in st(0) = st(0) * st(0), with no change in the depth of the stack.


You could also use sqrt as a better initial approximation than 1.0, or maybe +/-1 * sqrtf(fabsf(x)). I don't see an x87 instruction for applying the sign of one float to another, just fchs to unconditionally flip, and fabs to unconditionally clear the sign bit. There is an fcmov, but it requires a P6 or later CPU. You mentioned 8086, but then used .586, so IDK what you're targeting.


Better loop body:

Not debugged or tested, but your code full of repeated loads from the same data was making me crazy. This optimized version is here because I was curious, not because I think it's going to help the OP directly.

Also, hopefully this is a good example of how to comment the data flow in code where it's tricky. (e.g. x87, or vectorized code with shuffles).

## x/3.0 in st(1)
## 2.0/3.0 in st(2)

# before each iteration: st(0) = root
#  after each iteration: st(0) = root * 2.0/3.0 + (x/3.0 / (root*root)), with constants undisturbed

loop_body:
    fld     st(0)         ; stack: root, root, 2/3, x/3
    fmul    st(0), st(0)  ; stack: root^2, root, 2/3, x/3
    fdivr   st(0), st(3)  ; stack: x/3 / root^2, root, 2/3, x/3
    fxchg   st(1)         ; stack: root, x/3/root^2, 2/3, x/3
    fmul    st(0), st(2)  ; stack: root*2/3, x/3/root^2, 2/3, x/3
    faddp                 ; stack: root*2/3 + x/3/root^2, 2/3, x/3

; TODO: compare and loop back to loop_body

    fstp    [root]         ; store and pop
    fstp    st(0)          ; pop the two constants off the FP stack to empty it before returning
    fstp    st(0)
    ; finit is very slow, ~80cycles, don't use it if you don't have to.

32bit function calling-conventions return FP results in st(0), so you could do that, but then the caller probably have to store somewhere.

这篇关于使用Newton-Raphson方法在x87 FPU上建立多维数据集根的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆