编写x86 asm函数可移植(win / linux / osx),无需构建 - 依赖yasm / nasm? [英] Write x86 asm functions portably (win/linux/osx), without a build-depend on yasm/nasm?

查看:158
本文介绍了编写x86 asm函数可移植(win / linux / osx),无需构建 - 依赖yasm / nasm?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

par2 有一个小而相当干净的C ++代码库,我认为它可以在GNU / Linux上正常工作, OS X和Windows(带有MSVC ++)。



我想将一个函数的x86-64 asm版本合并到几乎所有CPU时间中。 (包含更多详细信息的邮件列表帖子。我的实施/基准)。

内部函数是显而易见的解决方案,但是gcc并没有生成足够好的代码来从64位寄存器获取一个字节,作为LUT的索引。我可能还会花时间安排指令,以便每个uop缓存行拥有4个uops的倍数,因为即使输入/输出缓冲区尺寸合适,uop吞吐量也是瓶颈。



我不想在yasm上引入构建依赖关系,因为许多人安装了gcc,但不是yasm。

有没有一种方法可以在gcc / clang和MSVC可以组合的单独文件中编写函数?目标是:




  • 没有额外的软件作为build-dep。 (没有YASM)。
  • 每个asm函数只有一个版本。 (不维护相同代码的MASM和AT& T版本)。


Par2cmdline的构建系统是用于Unix,MSVC的autoconf / automake .sln for Windows。



我知道GNU汇编有一个 .intel_syntax noprefix code>指令,但只改变指令格式,不改变其他汇编指令。例如 .align 16 align 16 。我的代码非常简单而且很小,所以如果可以的话,可以使用C预处理器 #define s来处理不同的指令。



我假设根据结果进行CPU检测和设置函数指针不应该成为C ++中的问题,即使我必须使用一些 #ifdef 条件编译。



如果没有解决方案,我很可能会介绍build-depends on yasm并且有一个 ./ configure --no-asm 选项可以禁用基于x86构建的人员的asm加速,而无需使用yasm。



我在Windows和Linux ABI中处理不同调用约定的首选计划是在我的C原型中使用 __ attribute __((sysv_abi))我的asm函数。然后我只需要为SysV ABI编写函数序言。 MSVC是否有类似的东西,根据SysV ABI的某些功能将args放入regs? (顺便说一下,这让一个编译器bug 痒痒,所以要小心这个想法如果你想让你的代码和当前的gcc一起工作)。 汇编程序我有一个关于如何处理两种不同的64位调用约定的建议:Microsoft x64与SysV ABI。

最低的commen分母是Microsoft x64调用约定,因为它只能通过寄存器传递前四个值。因此,如果您限制自己并使用宏来定义寄存器,则可以轻松地为UNIX(Linux / BSD / OSX)和Windows编译代码。



在Agner Fog的查看文件 strcat64.asm asmlib

 %IFDEF WINDOWS 
%define Rpar1 rcx;函数参数1
%define Rpar2 rdx;函数参数2
%define Rpar3 r8;函数参数3
%ENDIF
%IFDEF UNIX
%define Rpar1 rdi;函数参数1
%define Rpar2 rsi;函数参数2
%define Rpar3 rdx;函数参数3
%ENDIF

push Rpar1; dest
推送Rpar2; src
调用A_strlen; dest的长度
推动rax; strlen(dest)
mov Rpar1,[rsp + 8]; src
调用A_strlen; src的长度
pop Rpar1; strlen(dest)
pop Rpar2; src
add Rpar1,[rsp]; dest + strlen(dest)
lea Rpar3,[rax + 1]; strlen(src)+1
call A_memcpy;复制
pop rax; return dest
ret

; A_strcat ENDP

I don' t认为四个寄存器实际上是一个限制,因为如果你在汇编中写入某些东西,这是因为你希望效率最高,在这种情况下,与函数本身相比,函数调用开销应该可以忽略不计,所以从堆栈中推入/弹出一些值如果你需要在调用函数的时候不应该在性能上有所作为。


par2 has a small and fairly clean C++ codebase, which I think builds fine on GNU/Linux, OS X, and Windows (with MSVC++).

I'd like to incorporate an x86-64 asm version of the one function that takes nearly all the CPU time. (mailing list posts with more details. My implementation/benchmark here.)

Intrinsics would be the obvious solution, but gcc doesn't generate good enough code for getting one byte at a time from a 64bit register for use as an index into a LUT. I might also take the time to schedule instructions so each uop cache line holds a multiple of 4 uops, since uop throughput is the bottleneck even when the input/output buffer is a decent size.

I'd prefer not to introduce a build-dependency on yasm, since many people have gcc installed, but not yasm.

Is there a way to write a function in asm in a separate file that gcc / clang and MSVC can assemble? The goals are:

  • no extra software as a build-dep. (no YASM).
  • only one version of each asm function. (no maintaining MASM & AT&T versions of the same code.)

Par2cmdline's build systems is autoconf/automake for Unix, MSVC .sln for Windows.

I know GNU assemble has a .intel_syntax noprefix directive, but that only changes instruction formats, not other assembler directives. e.g. .align 16 vs. align 16. My code is fairly simple and small, so it would be ok to work around the different directives with C-preprocessor #defines, if that can work.

I'm assuming that doing CPU-detection and setting a function pointer based on the result shouldn't be a problem in C++, even if I have to use some #ifdef conditional compilation for that.

If there isn't a solution to what I'm hoping for, I'll probably introduce a build-depend on yasm and have a ./configure --no-asm option to disable asm speedups for people building on x86 without yasm available.

My preferred plan for handling the different calling convention in the Windows and Linux ABIs was to use __attribute__((sysv_abi)) on my C prototypes for my asm functions. Then I only have to write the function prologue for the SysV ABI. Does MSVC has anything like that, that would put args into regs according to the SysV ABI for certain functions? (BTW, this tickled a compiler bug, so be careful with this idea if you want your code to work with current gcc.)

解决方案

While I have no good solution to remove the dependency on a particular assembler I do have a suggestion on how to deal the two difference 64-bit calling conventions: Microsoft x64 versus SysV ABI.

The lowest commen denominator is the Microsoft x64 calling conventions since it can only pass the first four values by register. So if you limit yourself to this and use macros to define the registers you can easily make your code compile for both Unix (Linux/BSD/OSX) and Windows.

For example look in the file strcat64.asm in Agner Fog's asmlib

%IFDEF  WINDOWS
%define Rpar1   rcx                    ; function parameter 1
%define Rpar2   rdx                    ; function parameter 2
%define Rpar3   r8                     ; function parameter 3
%ENDIF
%IFDEF  UNIX
%define Rpar1   rdi                    ; function parameter 1
%define Rpar2   rsi                    ; function parameter 2
%define Rpar3   rdx                    ; function parameter 3
%ENDIF

        push    Rpar1                  ; dest
        push    Rpar2                  ; src
        call    A_strlen               ; length of dest
        push    rax                    ; strlen(dest)
        mov     Rpar1, [rsp+8]         ; src
        call    A_strlen               ; length of src
        pop     Rpar1                  ; strlen(dest)
        pop     Rpar2                  ; src
        add     Rpar1, [rsp]           ; dest + strlen(dest)
        lea     Rpar3, [rax+1]         ; strlen(src)+1
        call    A_memcpy               ; copy
        pop     rax                    ; return dest
        ret

;A_strcat ENDP

I don't think four registers is really a limitation because if you're writing something in assembly it's because you want the best efficiency in which case the function calling overhead should be negligible compare to the function itself so pushing/popping some values to/from the stack if you need to when calling the function should not make a difference in performance.

这篇关于编写x86 asm函数可移植(win / linux / osx),无需构建 - 依赖yasm / nasm?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆