“无效指令操作数";在mov ah上,word_variable,并在16位数字上使用imul [英] "invalid instruction operands" on mov ah, word_variable, and using imul on 16-bit numbers

查看:117
本文介绍了“无效指令操作数";在mov ah上,word_variable,并在16位数字上使用imul的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我要实现的目标: a_x*b_x + a_y*b_y + a_z*b_z

Here is what I am trying to achieve: a_x*b_x + a_y*b_y + a_z*b_z

我正在尝试在装配中进行上述计算的宏.

I am trying to make a MACRO in assembly that does the above computation.

我所有的数字都使用WORD.这是我的代码:

I am using WORDs for all of my numbers. Here is my code:

dotProduct   MACRO  A_X,A_Y,A_Z,B_X,B_Y,B_Z ;a.b (a dot b) = a_x*b_x + a_y*b_y + a_z*b_z
    mov ah, A_X
    mov al, B_X
    imul ax
    mov answer, ax
    mov ah, A_Y
    mov al, B_Y
    imul ax
    add answer, ax
    mov ah, A_Z
    mov al, B_Z
    imul ax
    mov answer, ax

    output answer

ENDM

answer BYTE 40 DUP (0)

但是我遇到以下错误:

Assembling: plane_line.asm
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(1): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(2): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(4): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(5): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(6): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(8): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(9): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(10): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(12): Macro Called From
  plane_line.asm(101): Main Line Code

我认为这与我处理寄存器的方式有关.

I believe it has to do with the way I am handling the registers.

我应该怎么做呢?

推荐答案

MOV的两个操作数必须具有相同的大小. AL和AH是字节寄存器.

Both operands of MOV have to be the same size. AL and AH are byte registers.

MASM样式的汇编程序根据您在符号名称后使用的DW推断内存位置的大小.这就是为什么它抱怨操作数大小不匹配(带有通用的无用的错误消息,该消息也适用于许多其他问题)的原因.

MASM-style assemblers infer the size of memory locations from the DW you used after the symbol name. This is why it complains about an operand-size mismatch (with a generic unhelpful error message that also applies to a lot of other problems).

如果您确实想将A_X的第一个字节加载到AL中,则可以使用覆盖:mov al, BTYE PTR A_X.

If you actually wanted to load the the first byte of A_X into AL, you'd use an override: mov al, BTYE PTR A_X.

但这不是您想要的,因为您实际上确实想加载16位数字.两个16位数字的乘积最多可为32位(例如0xffff ^ 2为0xfffe0001).因此,仅进行32位数学运算可能是一个好主意.

But that's not what you want, since you do actually want to load 16-bit numbers. The product of two 16-bit numbers can be up to 32 bits (e.g. 0xffff^2 is 0xfffe0001). So it's probably a good idea to just do 32-bit math.

您还错误地使用了imul:imul ax设置了DX:AX = AX * AX(在一对寄存器中产生32位结果).要乘以AH * AL并在AX中得到结果,您应该使用imul ah.请参阅 insul ref手动输入IMUL .另请参见标记维基.

You're also using imul incorrectly: imul ax sets DX:AX = AX * AX (producing a 32-bit result in a pair of registers). To multiply AH * AL and get the result in AX, you should have used imul ah. See the insn ref manual entry for IMUL. Also see other links to docs and guides in the x86 tag wiki.

IMUL的二操作数形式更易于使用.它的工作方式与ADD完全一样,具有目标和源,并产生一个结果. (它不会在任何地方存储全乘结果的上半部分,但这对于这种用例来说是很好的.)

The two-operand form of IMUL is easier to use. It works exactly like ADD, with a destination and a source, producing one result. (It doesn't store the high half of the full-multiply result anywhere, but that's fine for this use-case).

要设置32位IMUL,请使用MOVSX进行符号扩展从DW的16位存储器位置转换为32位寄存器.

To set up for a 32-bit IMUL, use MOVSX to sign-extend from DW 16-bit memory locations into 32-bit registers.

无论如何,这是您应该做的:

movsx   eax, A_X       ; sign-extend A_X into a 32-bit register
movsx   ecx, B_X       ; Use a different register that's 
imul    eax, ecx       ; eax = A_X * B_X  (as a 32-bit signed integer)

movsx   edx, A_Y
movsx   ecx, B_Y
imul    edx, ecx       ; edx = A_Y * B_Y  (signed int)
add     eax, edx       ; add to the previous result in eax.

movsx   edx, A_Z
movsx   ecx, B_Z
imul    edx, ecx       ; edx = A_Z * B_Z  (signed int)
add     eax, edx       ; add to the previous result in eax

我不确定您的输出"函数/宏应该如何工作,但是将整数存储到字节数组BYTE 40 DUP (0)中似乎不太可能.您可以使用mov dword ptr [answer], eax来做到这一点,但是也许您应该只是output eax.或者,如果output answer将eax转换为存储在answer中的字符串,那么您就不需要先使用mov.

I'm not sure how your "output" function / macro is supposed to work, but storing the integer into an array of bytes BYTE 40 DUP (0) seems unlikely. You could do it with mov dword ptr [answer], eax, but maybe you should just output eax. Or if output answer converts eax to a string stored in answer, then you don't need the mov first.

我假设您的数字以16位为签名.这意味着,如果所有输入均为 INT16_MIN (即, -32768 = 0x8000). 0x8000 ^ 2 = 0x40000000,这是INT32_MAX的一半以上.因此32位ADD并不是很安全,但是我认为您可以接受,并且不想随身携带.

I'm assuming your numbers are signed 16-bit to start with. This means that your dot-product can overflow if all the inputs are INT16_MIN (i.e. -32768 = 0x8000). 0x8000^2 = 0x40000000, which is more than half INT32_MAX. So 32-bit ADDs aren't quite safe, but I assume you're ok with that and don't want to add-with-carry.

另一种方式:我们可以使用16位IMUL指令,因此我们可以将其与内存操作数一起使用,而不必单独加载符号扩展名.但是,如果您确实想要完整的32位结果,则这不太方便.因此,我仅说明仅使用低位一半.

Another way: We could use 16-bit IMUL instructions, so we can use it with a memory operand instead of having to separately load with sign-extension. This is a lot less convenient if you do want the full 32-bit result, though, so I'll just illustrate using the low half only.

mov    ax, A_X
imul   B_X         ; DX:AX  = ax * B_X
mov    cx, ax      ; save the low half of the result somewhere else so we can do another imul B_Y  and  add cx, ax

;or
mov    cx, A_X
imul   cx, B_X     ; result in cx


在这里停止阅读,其余内容对初学者没有用.

有趣的方式:SSE4.1包含SIMD水平点积指令.


Stop reading here, the rest of this is not useful for beginners.

The fun way: SSE4.1 has a SIMD horizontal dot-product instruction.

; Assuming A_X, A_Y, and A_Z are stored contiguously, and same for B_XYZ
pmovsxwd   xmm0, qword ptr [A_X]  ; also gets Y and Z, and a high element of garbage
pmovsxwd   xmm1, qword ptr [B_X]  ; sign-extend from 16-bit elements to 32
cvtdq2ps   xmm0, xmm0             ; convert in-place from signed int32 to float
cvtdq2ps   xmm1, xmm1

dpps       xmm0, xmm1,  0b01110001  ; top 4 bits: sum the first 3 elements, ignore the top one.  Low 4 bits: put the result only in the low element

cvtss2si   eax, xmm0              ; convert back to signed 32-bit integer
; eax = dot product = a_x*b_x + a_y*b_y + a_z*b_z.

这实际上可能比标量脉冲代码要慢,尤其是在每个时钟可以执行两个负载并且具有快速整数乘法的CPU上(例如,Intel SnB系列的imul r32, r32延迟为3个周期,每周期吞吐量为1个) .标量版本具有很多指令级并行性:加载和乘法是独立的,只有将结果组合在一起的加法是相互依赖的.

This may actually be slower than the scalar imul code, especially on CPUs that can do two loads per clock and have fast integer multiply (e.g. Intel SnB-family has imul r32, r32 latency of 3 cycles, with 1 per cycle throughput). The scalar version has lots of instruction-level parallelism: the loads and multiplies are independent, only the adds to combine the results are dependent on each other.

DPPS速度较慢(Skylake在4 uops和13c的延迟下,但每1.5c的吞吐量仍然是1个).

DPPS is slow (4 uops and 13c latency on Skylake, but still one per 1.5c throughput).

整数SIMD点产品​​(仅要求SSE2):

;; SSE2
movq       xmm0, qword ptr [A_X]  ; also gets Y and Z, and a high element of garbage
pslldq     xmm0, 2                ; shift the unwanted garbage out into the next element.  [ 0 x y z   garbage 0 0 0 ]
movq       xmm1, qword ptr [B_X]  ; [ x y z garbage  0 0 0 0 ]
pslldq     xmm1, 2
;; The low 64 bits of xmm0 and xmm1 hold the xyz vectors, with a zero element

pmaddwd    xmm0, xmm1               ; vertical 16b*16b => 32b multiply,  and horizontal add of pairs.  [ 0*0+ax*bx   ay*by+az*bz   garbage  garbage ]

pshufd     xmm1, xmm0, 0b00010001   ; swap the low two 32-bit elements, so ay*by+az*bz is at the bottom of xmm1
paddd      xmm0, xmm1

movd       eax, xmm0

如果您可以保证A_Z之后和B_Z之后的2个字节为零,则可以省略 PSLLDQ字节移位指令.

If you could guarantee that the 2 bytes after A_Z and after B_Z were zero, you could leave out the PSLLDQ byte-shift instructions.

如果您不必将垃圾字从低64位移出,则可以在MMX寄存器中有用地执行此操作,而不需要MOVQ加载来将64位零扩展到128位寄存器中.然后,您可以将PMADDWD与内存操作数一起使用.但是随后您需要EMMS.此外,MMX已过时,并且 Skylake对于pmaddwd mm, mm的吞吐量要比pmaddwd xmm,xmm(或256b ymm)低 ).

If you don't have to shift a word of garbage out of the low 64, you could usefully do it in an MMX register instead of needing a MOVQ load to get 64 bits zero-extended into a 128-bit register. Then you could PMADDWD with a memory operand. But then you need EMMS. Also, MMX is obsolete, and Skylake has lower throughput for pmaddwd mm, mm than for pmaddwd xmm,xmm (or 256b ymm).

除了Intel的5个周期,这里的所有内容都是最近Intel的一个周期延迟. (MOVD是2个周期,但是您可以直接存储到内存中.负载显然也有延迟,但是它们来自固定地址,因此没有输入依赖性.)

Everything here is one-cycle latency on recent Intel, except 5 cycles for PMADDWD. (MOVD is 2 cycles, but you could store directly to memory. The loads obviously have latency too, but they're from fixed addresses so there's no input dependency.)

这篇关于“无效指令操作数";在mov ah上,word_variable,并在16位数字上使用imul的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆