adcx和adox的测试用例 [英] Test case for adcx and adox

查看:354
本文介绍了adcx和adox的测试用例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在测试 Intel ADX 添加进位并添加溢出到管道添加大整数。我想看看预期的代码生成应该是什么样子。从 _addcarry_u64和_addcarryx_u64与MSVC和ICC ,我认为这将是一个合适的测试案例:

  #include  
#include< x86intrin.h>
#includeimmintrin.h

int main(int argc,char * argv [])
{
#define MAX_ARRAY 100
uint8_t c1 = 0,c2 = 0;
uint64_t a [MAX_ARRAY] = {0},b [MAX_ARRAY] = {0},res [MAX_ARRAY]; (unsigned int i = 0; i< MAX_ARRAY; i ++){
c1 = _addcarryx_u64(c1,res [i],a [i],(unsigned long long int *)& res [一世]);
c2 = _addcarryx_u64(c2,res [i],b [i],(unsigned long long int *)& res [i]);
}
返回0;
}

当我检查使用 -O3 -madx 生成代码$ c>,它揭示了序列化 addc -O1 -O2 产生类似的结果:

  main:
subq $ 688,%rsp
xorl%edi,%edi
xorl%esi,%esi
leaq -120(%rsp),%rdx
xorl%ecx,%ecx
leaq 680(%rsp),%r8
.L2:
movq(%rdx) ,%rax
addb $ -1,%sil
adcq%rcx,%rax
setc%sil
addb $ -1,%dil
adcq%rcx, %rax
setc%dil
movq%rax,(%rdx)
addq $ 8,%rdx
cmpq%r8,%rdx
jne .L2
xorl%eax,%eax
addq $ 688,%rsp
ret

所以我猜测测试案例没有达到标准,或者我做错了什么,或者我错误地使用了某些东西......



如果我是在 _addcarryx_u64 正确,我相信C代码应该生成管道。所以我在猜测我做错了什么:


描述



将无符号的64位整数a和b与无符号的8位进位c_in
(进位或溢出标志)相加,并存储无符号的64位结果
和dst中的结转(进位或溢出标志)。

如何生成管道'd添加进位/添加溢出( adcx / adox )?






我已经准备好第五代Core i7进行测试了(注意 adx cpu标志) :

  $ cat / proc / cpuinfo | grep adx 
flags:fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc
arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni
pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm
3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase
tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt
...


解决方案

这看起来像一个很好的测试案例。它汇编纠正工作代码,对吧?对于编译器而言,即使它尚不支持制作最佳代码,它也可以支持这种内在意义。它可以让人们开始使用本质。这是兼容性所必需的。



明年或编译器对adcx / adox的后端支持完成后,相同的代码将编译为更快的二进制文件,而不需要修改源文件。 / p>

我假设gcc正在发生什么。




铿3.8.1的实现更具字面意义,但它最终会做出糟糕的工作:用sahf和eax的push / pop标志保存。 在Godbolt上查看。我认为asm源码输出甚至有一个bug,因为 mov eax,ch 不会汇编。 (与gcc不同的是,clang / LLVM使用内置的汇编程序,实际上并没有通过从LLVM IR到机器代码的方式的文本表示形式)。机器码的反汇编在那里显示 mov eax,ebp 。我认为这也是一个错误,因为 bpl (或其他注册表)在那个时候没有有用的值。可能它希望 mov al,ch movzx eax,ch


I'm testing Intel ADX add with carry and add with overflow to pipeline adds on large integers. I'd like to see what expected code generation should look like. From _addcarry_u64 and _addcarryx_u64 with MSVC and ICC, I thought this would be a suitable test case:

#include <stdint.h>
#include <x86intrin.h>
#include "immintrin.h"

int main(int argc, char* argv[])
{
    #define MAX_ARRAY 100
    uint8_t c1 = 0, c2 = 0;
    uint64_t a[MAX_ARRAY]={0}, b[MAX_ARRAY]={0}, res[MAX_ARRAY];
    for(unsigned int i=0; i< MAX_ARRAY; i++){ 
        c1 = _addcarryx_u64(c1, res[i], a[i], (unsigned long long int*)&res[i]);
        c2 = _addcarryx_u64(c2, res[i], b[i], (unsigned long long int*)&res[i]);
    }
    return 0;
}

When I examine the generated code from GCC 6.1 using -O3 and -madx, it reveals serialized addc. -O1 and -O2 produces similar results:

main:
        subq    $688, %rsp
        xorl    %edi, %edi
        xorl    %esi, %esi
        leaq    -120(%rsp), %rdx
        xorl    %ecx, %ecx
        leaq    680(%rsp), %r8
.L2:
        movq    (%rdx), %rax
        addb    $-1, %sil
        adcq    %rcx, %rax
        setc    %sil
        addb    $-1, %dil
        adcq    %rcx, %rax
        setc    %dil
        movq    %rax, (%rdx)
        addq    $8, %rdx
        cmpq    %r8, %rdx
        jne     .L2
        xorl    %eax, %eax
        addq    $688, %rsp
        ret

So I'm guessing the test case is not quite hitting the mark, or I am doing something wrong, or I am using something incorrectly, ...

If I am parsing Intel's docs on _addcarryx_u64 correctly, I believe the C code should generate the pipeline. So I'm guessing I am doing something wrong:

Description

Add unsigned 64-bit integers a and b with unsigned 8-bit carry-in c_in (carry or overflow flag), and store the unsigned 64-bit result in out, and the carry-out in dst (carry or overflow flag).

How can I generate the pipeline'd add with carry/add with overflow (adcx/adox)?


I've actually got a 5th generation Core i7 ready for testing (notice the adx cpu flag):

$ cat /proc/cpuinfo | grep adx
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc
arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni
pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm
3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase
tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt
...

解决方案

This does look like a good test-case. It assembles to correct working code, right? It's useful for a compiler to support the intrinsic in that sense, even if it doesn't yet support making optimal code. It lets people start using the intrinsic. This is necessary for compatibility.

Next year or whenever the compiler's backend support for adcx/adox is done, the same code will compile to faster binaries with no source modification.

I assume that's what's going on for gcc.


clang 3.8.1's implementation is more literal, but it ends up doing a terrible job: flag-saving with sahf and push/pop of eax. See it on Godbolt.

I think there's even a bug in the asm source output, since mov eax, ch won't assemble. (Unlike gcc, clang/LLVM uses a built-in assembler and doesn't actually go through a text representation of asm on the way from LLVM IR to machine code). The disassembly of the machine code shows mov eax,ebp there. I think that's also a bug, because bpl (or the rest of the register) doesn't have a useful value at that point. Probably it wanted mov al, ch or movzx eax, ch.

这篇关于adcx和adox的测试用例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆