“无效指令操作数";在mov ah上,word_variable,并在16位数字上使用imul [英] "invalid instruction operands" on mov ah, word_variable, and using imul on 16-bit numbers
问题描述
这是我要实现的目标:
a_x*b_x + a_y*b_y + a_z*b_z
Here is what I am trying to achieve:
a_x*b_x + a_y*b_y + a_z*b_z
我正在尝试在装配中进行上述计算的宏.
I am trying to make a MACRO in assembly that does the above computation.
我所有的数字都使用WORD
.这是我的代码:
I am using WORD
s for all of my numbers. Here is my code:
dotProduct MACRO A_X,A_Y,A_Z,B_X,B_Y,B_Z ;a.b (a dot b) = a_x*b_x + a_y*b_y + a_z*b_z
mov ah, A_X
mov al, B_X
imul ax
mov answer, ax
mov ah, A_Y
mov al, B_Y
imul ax
add answer, ax
mov ah, A_Z
mov al, B_Z
imul ax
mov answer, ax
output answer
ENDM
answer BYTE 40 DUP (0)
但是我遇到以下错误:
Assembling: plane_line.asm
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(1): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(2): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(4): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(5): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(6): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(8): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(9): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(10): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(12): Macro Called From
plane_line.asm(101): Main Line Code
我认为这与我处理寄存器的方式有关.
I believe it has to do with the way I am handling the registers.
我应该怎么做呢?
推荐答案
MOV的两个操作数必须具有相同的大小. AL和AH是字节寄存器.
Both operands of MOV have to be the same size. AL and AH are byte registers.
MASM样式的汇编程序根据您在符号名称后使用的DW
推断内存位置的大小.这就是为什么它抱怨操作数大小不匹配(带有通用的无用的错误消息,该消息也适用于许多其他问题)的原因.
MASM-style assemblers infer the size of memory locations from the DW
you used after the symbol name. This is why it complains about an operand-size mismatch (with a generic unhelpful error message that also applies to a lot of other problems).
如果您确实想将A_X的第一个字节加载到AL中,则可以使用覆盖:mov al, BTYE PTR A_X
.
If you actually wanted to load the the first byte of A_X into AL, you'd use an override: mov al, BTYE PTR A_X
.
但这不是您想要的,因为您实际上确实想加载16位数字.两个16位数字的乘积最多可为32位(例如0xffff ^ 2为0xfffe0001).因此,仅进行32位数学运算可能是一个好主意.
But that's not what you want, since you do actually want to load 16-bit numbers. The product of two 16-bit numbers can be up to 32 bits (e.g. 0xffff^2 is 0xfffe0001). So it's probably a good idea to just do 32-bit math.
您还错误地使用了imul
:imul ax
设置了DX:AX = AX * AX
(在一对寄存器中产生32位结果).要乘以AH * AL并在AX中得到结果,您应该使用imul ah
.请参阅 insul ref手动输入IMUL .另请参见x86 标记维基.
You're also using imul
incorrectly: imul ax
sets DX:AX = AX * AX
(producing a 32-bit result in a pair of registers). To multiply AH * AL and get the result in AX, you should have used imul ah
. See the insn ref manual entry for IMUL. Also see other links to docs and guides in the x86 tag wiki.
IMUL的二操作数形式更易于使用.它的工作方式与ADD完全一样,具有目标和源,并产生一个结果. (它不会在任何地方存储全乘结果的上半部分,但这对于这种用例来说是很好的.)
The two-operand form of IMUL is easier to use. It works exactly like ADD, with a destination and a source, producing one result. (It doesn't store the high half of the full-multiply result anywhere, but that's fine for this use-case).
要设置32位IMUL,请使用MOVSX进行符号扩展从DW的16位存储器位置转换为32位寄存器.
To set up for a 32-bit IMUL, use MOVSX to sign-extend from DW 16-bit memory locations into 32-bit registers.
无论如何,这是您应该做的:
movsx eax, A_X ; sign-extend A_X into a 32-bit register
movsx ecx, B_X ; Use a different register that's
imul eax, ecx ; eax = A_X * B_X (as a 32-bit signed integer)
movsx edx, A_Y
movsx ecx, B_Y
imul edx, ecx ; edx = A_Y * B_Y (signed int)
add eax, edx ; add to the previous result in eax.
movsx edx, A_Z
movsx ecx, B_Z
imul edx, ecx ; edx = A_Z * B_Z (signed int)
add eax, edx ; add to the previous result in eax
我不确定您的输出"函数/宏应该如何工作,但是将整数存储到字节数组BYTE 40 DUP (0)
中似乎不太可能.您可以使用mov dword ptr [answer], eax
来做到这一点,但是也许您应该只是output eax
.或者,如果output answer
将eax转换为存储在answer
中的字符串,那么您就不需要先使用mov
.
I'm not sure how your "output" function / macro is supposed to work, but storing the integer into an array of bytes BYTE 40 DUP (0)
seems unlikely. You could do it with mov dword ptr [answer], eax
, but maybe you should just output eax
. Or if output answer
converts eax to a string stored in answer
, then you don't need the mov
first.
我假设您的数字以16位为签名.这意味着,如果所有输入均为 INT16_MIN (即, -32768 = 0x8000). 0x8000 ^ 2 = 0x40000000,这是INT32_MAX的一半以上.因此32位ADD并不是很安全,但是我认为您可以接受,并且不想随身携带.
I'm assuming your numbers are signed 16-bit to start with. This means that your dot-product can overflow if all the inputs are INT16_MIN (i.e. -32768 = 0x8000). 0x8000^2 = 0x40000000, which is more than half INT32_MAX. So 32-bit ADDs aren't quite safe, but I assume you're ok with that and don't want to add-with-carry.
另一种方式:我们可以使用16位IMUL指令,因此我们可以将其与内存操作数一起使用,而不必单独加载符号扩展名.但是,如果您确实想要完整的32位结果,则这不太方便.因此,我仅说明仅使用低位一半.
Another way: We could use 16-bit IMUL instructions, so we can use it with a memory operand instead of having to separately load with sign-extension. This is a lot less convenient if you do want the full 32-bit result, though, so I'll just illustrate using the low half only.
mov ax, A_X
imul B_X ; DX:AX = ax * B_X
mov cx, ax ; save the low half of the result somewhere else so we can do another imul B_Y and add cx, ax
;or
mov cx, A_X
imul cx, B_X ; result in cx
在这里停止阅读,其余内容对初学者没有用.
有趣的方式:SSE4.1包含SIMD水平点积指令.
Stop reading here, the rest of this is not useful for beginners.
The fun way: SSE4.1 has a SIMD horizontal dot-product instruction.
; Assuming A_X, A_Y, and A_Z are stored contiguously, and same for B_XYZ
pmovsxwd xmm0, qword ptr [A_X] ; also gets Y and Z, and a high element of garbage
pmovsxwd xmm1, qword ptr [B_X] ; sign-extend from 16-bit elements to 32
cvtdq2ps xmm0, xmm0 ; convert in-place from signed int32 to float
cvtdq2ps xmm1, xmm1
dpps xmm0, xmm1, 0b01110001 ; top 4 bits: sum the first 3 elements, ignore the top one. Low 4 bits: put the result only in the low element
cvtss2si eax, xmm0 ; convert back to signed 32-bit integer
; eax = dot product = a_x*b_x + a_y*b_y + a_z*b_z.
这实际上可能比标量脉冲代码要慢,尤其是在每个时钟可以执行两个负载并且具有快速整数乘法的CPU上(例如,Intel SnB系列的imul r32, r32
延迟为3个周期,每周期吞吐量为1个) .标量版本具有很多指令级并行性:加载和乘法是独立的,只有将结果组合在一起的加法是相互依赖的.
This may actually be slower than the scalar imul code, especially on CPUs that can do two loads per clock and have fast integer multiply (e.g. Intel SnB-family has imul r32, r32
latency of 3 cycles, with 1 per cycle throughput). The scalar version has lots of instruction-level parallelism: the loads and multiplies are independent, only the adds to combine the results are dependent on each other.
DPPS速度较慢(Skylake在4 uops和13c的延迟下,但每1.5c的吞吐量仍然是1个).
DPPS is slow (4 uops and 13c latency on Skylake, but still one per 1.5c throughput).
整数SIMD点产品(仅要求SSE2):
;; SSE2
movq xmm0, qword ptr [A_X] ; also gets Y and Z, and a high element of garbage
pslldq xmm0, 2 ; shift the unwanted garbage out into the next element. [ 0 x y z garbage 0 0 0 ]
movq xmm1, qword ptr [B_X] ; [ x y z garbage 0 0 0 0 ]
pslldq xmm1, 2
;; The low 64 bits of xmm0 and xmm1 hold the xyz vectors, with a zero element
pmaddwd xmm0, xmm1 ; vertical 16b*16b => 32b multiply, and horizontal add of pairs. [ 0*0+ax*bx ay*by+az*bz garbage garbage ]
pshufd xmm1, xmm0, 0b00010001 ; swap the low two 32-bit elements, so ay*by+az*bz is at the bottom of xmm1
paddd xmm0, xmm1
movd eax, xmm0
如果您可以保证A_Z之后和B_Z之后的2个字节为零,则可以省略 PSLLDQ字节移位指令.
If you could guarantee that the 2 bytes after A_Z and after B_Z were zero, you could leave out the PSLLDQ byte-shift instructions.
如果您不必将垃圾字从低64位移出,则可以在MMX寄存器中有用地执行此操作,而不需要MOVQ加载来将64位零扩展到128位寄存器中.然后,您可以将PMADDWD与内存操作数一起使用.但是随后您需要EMMS.此外,MMX已过时,并且 Skylake对于pmaddwd mm, mm
的吞吐量要比pmaddwd xmm,xmm
(或256b ymm)低 ).
If you don't have to shift a word of garbage out of the low 64, you could usefully do it in an MMX register instead of needing a MOVQ load to get 64 bits zero-extended into a 128-bit register. Then you could PMADDWD with a memory operand. But then you need EMMS. Also, MMX is obsolete, and Skylake has lower throughput for pmaddwd mm, mm
than for pmaddwd xmm,xmm
(or 256b ymm).
除了Intel的5个周期,这里的所有内容都是最近Intel的一个周期延迟. (MOVD是2个周期,但是您可以直接存储到内存中.负载显然也有延迟,但是它们来自固定地址,因此没有输入依赖性.)
Everything here is one-cycle latency on recent Intel, except 5 cycles for PMADDWD. (MOVD is 2 cycles, but you could store directly to memory. The loads obviously have latency too, but they're from fixed addresses so there's no input dependency.)
这篇关于“无效指令操作数";在mov ah上,word_variable,并在16位数字上使用imul的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!