在两个ASM GCC内联块之间传播进位 [英] propagate carry bit between two ASM GCC inline block

查看:42
本文介绍了在两个ASM GCC内联块之间传播进位的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

亲爱的Assembly/C ++开发人员,

Dear Assembly/C++ dev,

问题是:即使在两个ASM块之间传播进位(或任何标志)是否现实或完全疯狂?

The question is: Does propagate the carry (or any flag) between two ASM block is realistic or totally insane, even if it works ?

几年前,我开发了一个整数库,用于小于512位(在编译时)的大型算术运算.我目前未使用GMP,因为对于这种规模,由于内存分配和模型为二进制表示形式选择工作台.

A few years ago I developed an integer library for large arithmetic lower than 512 bits (at compile time). I did not use GMP at this time because for this scale, GMP become slower due to the memory allocation and the model choose for the binary representation bench.

我必须承认我使用 BOOST_PP 创建了我的ASM(字符串块),它不是非常光荣(出于好奇,请看一下

I must confess I created my ASM ( the string block) using BOOST_PP, it is not very glorious (for curious have a look on it vli). The library was working well.

但是,我这次说的是,不可能在两个ASM内联块之间传播状态寄存器的进位标志.这是合乎逻辑的,因为对于编译器在两个块之间生成的任何助记符,都会重置寄存器( mov 指令除外(据我的汇编知识)).

However I remark at this time it was impossible to propagate the carry flag of the state register between two ASM inline block. This is logical because for any mnemonic generated by the compiler between two blocks, the register is reset ( except for the mov instruction (from my assembly knowledge)).

昨天,我有了一个在两个ASM块之间传播进位的想法,这有点棘手(使用递归算法).它正在工作,但我认为我很幸运.

Yesterday I get an idea to propagate the carry between two ASM block a bit tricky (using a recursive algo). It is working but I think I am lucky.

#include <iostream>
#include <array>
#include <cassert>
#include <algorithm>

//forward declaration
template<std::size_t NumBits>
struct integer;


//helper using object function, partial specialization  is forbiden on functions
template <std::size_t NumBits, std::size_t W, bool K = W == integer<NumBits>::numwords>
struct helper {
    static inline void add(integer<NumBits> &a, const integer<NumBits> &b){
        helper<NumBits, integer<NumBits>::numwords>::add(a,b);
    }
};

// first addition (call first)
template<std::size_t NumBits, std::size_t W>
struct helper<NumBits, W, 1> {
    static inline void add(integer<NumBits> &a, const integer<NumBits> &b){
        __asm__ (
                              "movq %1, %%rax \n"
                              "addq %%rax, %0 \n"
                              : "+m"(a[0]) // output
                              : "m" (b[0]) // input only
                              : "rax", "cc", "memory");
        helper<NumBits,W-1>::add(a,b);
    }
};

//second and more propagate the carry (call next)
template<std::size_t NumBits, std::size_t W>
struct helper<NumBits, W, 0> {
    static inline void add(integer<NumBits> &a, const integer<NumBits> &b){
        __asm__ (
                              "movq %1, %%rax \n"
                              "adcq %%rax, %0 \n"
                              : "+m"(a[integer<NumBits>::numwords-W])
                              : "m" (b[integer<NumBits>::numwords-W])
                              : "rax", "cc", "memory");
        helper<NumBits,W-1>::add(a,b);
    }
};

//nothing end reccursive process (call last)
template<std::size_t NumBits>
struct helper<NumBits, 0, 0> {
    static inline void add(integer<NumBits> &a, const integer<NumBits> &b){};
};

// tiny integer class
template<std::size_t NumBits>
struct integer{
    typedef uint64_t      value_type;
    static const std::size_t numbits = NumBits;
    static const std::size_t numwords = (NumBits+std::numeric_limits<value_type>::digits-1)/std::numeric_limits<value_type>::digits;
    using container = std::array<uint64_t, numwords>;

    typedef typename container::iterator             iterator;

    iterator begin() { return data_.begin();}
    iterator end() { return data_.end();}

    explicit integer(value_type num = value_type()){
        assert( -1l >> 1 == -1l );
        std::fill(begin(),end(),value_type());
        data_[0] = num;
    }

    inline value_type& operator[](std::size_t n){ return data_[n];}
    inline const value_type& operator[](std::size_t n) const { return data_[n];}

    integer& operator+=(const integer& a){
        helper<numbits,numwords>::add(*this,a);
        return *this;
    }

    integer& operator~(){
        std::transform(begin(),end(),begin(),std::bit_not<value_type>());
        return *this;
    }

    void print_raw(std::ostream& os) const{
        os << "(" ;
        for(std::size_t i = numwords-1; i > 0; --i)
            os << data_[i]<<" ";
        os << data_[0];
        os << ")";
    }

    void print(std::ostream& os) const{
        assert(false && " TO DO ! \n");
    }

private:
    container data_;
};

template <std::size_t NumBits>
std::ostream& operator<< (std::ostream& os, integer<NumBits> const& i){
    if(os.flags() & std::ios_base::hex)
        i.print_raw(os);
    else
        i.print(os);
    return os;
}

int main(int argc, const char * argv[]) {
    integer<256> a; // 0
    integer<256> b(1);

    ~a; //all the 0 become 1

    std::cout << " a: " << std::hex << a << std::endl;
    std::cout << " ref: (ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff) " <<  std::endl;

    a += b; // should propagate the carry

    std::cout << " a+=b: " << a << std::endl;
    std::cout << " ref: (0 0 0 0) " <<  std::endl; // it works but ...

    return 0;
}

我得到正确的结果(必须在-O2或-O3版本中编译!)并且ASM是正确的(在使用clang ++的Mac上:Apple LLVM版本9.0.0(clang-900.0.39.2))

I get a correct result (it must be compiled in release -O2 or -O3!) and the ASM is right (on my Mac with clang++: Apple LLVM version 9.0.0 (clang-900.0.39.2))

    movq    -96(%rbp), %rax
    addq    %rax, -64(%rbp)

    ## InlineAsm End
    ## InlineAsm Start
    movq    -88(%rbp), %rax
    adcq    %rax, -56(%rbp)

    ## InlineAsm End
    ## InlineAsm Start
    movq    -80(%rbp), %rax
    adcq    %rax, -48(%rbp)

    ## InlineAsm End
    ## InlineAsm Start
    movq    -72(%rbp), %rax
    adcq    %rax, -40(%rbp)

我相信它是有效的,因为在优化过程中,编译器会删除ASM块之间的所有无用指令(在调试模式下会失败).

I am conscient it is working because during the optimization the compiler remove all useless instruction between the ASM block (in debug mode it failed).

您怎么看?绝对不安全?编译器人员是否知道它将稳定多少?

What do you think ? Definitively unsafe ? Does a compiler guy know how much it will be stable ?

总结:我只是为了好玩而已:)是的,GMP是大型算术的解决方案!

To conclude: I am just doing that for fun :) Yes, GMP is the solution for large arithmetic !

推荐答案

使用 __ volatile __ 是一种滥用.

__ volatile __ 的目的是强制编译器在书面位置发出汇编代码,而不是依靠数据流分析来解决这一问题.如果要在用户空间中对数据进行常规操作,通常不应该使用 __ volatile __ ,并且如果需要 __ volatile __ 来使代码正常工作,则几乎总是意味着您的操作数指定不正确.

The purpose of __volatile__ is to force the compiler to emit the assembly code at the written location, rather than relying on data flow analysis to figure this out. If you are doing ordinary manipulations of data in user space, usually you should not be using __volatile__, and if you need __volatile__ to get your code to work it almost always means that your operands are specified incorrectly.

是的,操作数指定不正确.让我们看一下第一个块.

And yes, the operands are specified incorrectly. Let's look at the first block.

__asm__ __volatile__ (
                      "movq %1, %%rax \n"
                      "addq %%rax, %0 \n"
                      : "=m"(a[0]) // output
                      : "m" (b[0]) // input only
                      : "rax", "memory");

这里有两个错误.

  • 对输出"= m"(a [0])的约束不正确.回想一下 addq 的目标既是输入又是输出,因此正确的约束是+,因此请使用"+ m"(a [0]).如果您告诉编译器仅输出 a [0] ,则编译器可能会安排 a [0] 包含垃圾值(通过消除死存储),即不是你想要的.

  • The constraint on the output "=m"(a[0]) is incorrect. Recall that the destination for addq is both an input and an output, so the correct constraint is +, so use "+m"(a[0]). If you tell the compiler that a[0] is output only, the compiler may arrange a[0] to contain a garbage value (through dead store elimination), which is not what you want.

程序集规范中缺少这些标志.在不告知编译器修改了标志的情况下,编译器可能会假设在整个汇编块中保留了这些标志,这将导致编译器在其他地方生成不正确的代码.

The flags are missing from the assembly specification. Without telling the compiler that the flags are modified, the compiler may assume that the flags are preserved across the assembly block, which will cause the compiler to generate incorrect code elsewhere.

不幸的是,这些标志仅可用作汇编块的输出或破坏者操作数,而不能用作输入.因此,毕竟在正确地指定操作数之后大惊小怪,因此您不必使用 __ volatile __ ......事实证明,仍然没有一种好的方法来指定您的操作数!

Unfortunately, the flags are only available as output or clobber operands to assembly blocks, and are not available as inputs. So after all that fuss over specifying operands correctly so you don't use __volatile__... it turns out that there isn't a good way to specify your operands anyway!

因此,这里的建议是至少 修复您可以修复的操作数,并指定"cc" 作为对象.但是有一些更好的解决方案根本不需要 __ volatile __ ...

So the recommendation here is that you should at least fix the operands you can fix, and specify "cc" as a clobber. But there are a couple better solutions that do not require __volatile__ at all...

用于添加的 mpn _ 函数不会分配内存. mpz _ 函数是 mpn _ 函数的包装,带有一些额外的逻辑和内存分配.

The mpn_ functions for addition do not allocate memory. The mpz_ functions are wrappers around the mpn_ functions with some extra logic and memory allocations.

如果将整个循环写在一个汇编块中,则不必担心在块之间保留标志.您可以使用程序集宏来执行此操作.麻烦您,我不是汇编程序员:

If you write the entire loop in one assembly block, you don't have to worry about preserving flags between blocks. You can use assembly macros to do this. Excuse the mess, I'm not much of an assembly programmer:

template <int N>
void add(unsigned long long *dest, unsigned long long *src) {
  __asm__(
      "movq (%1), %%rax"
      "\n\taddq %%rax, (%0)"
      "\n.local add_offset"
      "\n.set add_offset,0"
      "\n.rept %P2" // %P0 means %0 but without the $ in front
      "\n.set add_offset,add_offset+8"
      "\n\tmovq add_offset(%1), %%rax"
      "\n\tadcq %%rax, add_offset(%0)"
      "\n.endr"
      :
      : "r"(dest), "r"(src), "n"(N-1)
      : "cc", "memory", "rax");
}   

这是使用 .rept 汇编指令评估循环的方法.尽管如果您查看带有 -S 的GCC的汇编输出,您最终将获得1个 addq 副本和N-1个 adcq 副本.您只会看到其中一个.汇编程序本身将创建副本,从而消除循环.

What this does is evaluate the loop using the .rept assembly directive. You will ultimately get 1 copy of addq and N-1 copies of adcq, although if you look at the assembly output of GCC with -S you will only see one of each. The assembler itself will create the copies, unwinding the loop.

请参见要点: https://gist.github.com/depp/966fc1f4d535e31d9725cc71d97daf91

这篇关于在两个ASM GCC内联块之间传播进位的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆