大整数加法与CUDA [英] large integer addition with CUDA

查看:173
本文介绍了大整数加法与CUDA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在开发GPU上的加密算法,目前套牢的算法进行大整数加法。大整数再以通常的方式一帮32位字psented $ P $。

I've been developing a cryptographic algorithm on the GPU and currently stuck with an algorithm to perform large integer addition. Large integers are represented in a usual way as a bunch of 32-bit words.

例如,我们可以使用一个线程来增加两个32位字。为简单起见,假设让
该要添加的数字是每块的线程相同的长度和数量的==字的数目。然后:

For example, we can use one thread to add two 32-bit words. For simplicity, let assume that the numbers to be added are of the same length and number of threads per block == number of words. Then:

__global__ void add_kernel(int *C, const int *A, const int *B) {
     int x = A[threadIdx.x];
     int y = B[threadIdx.x];
     int z = x + y;
     int carry = (z < x);
     /** do carry propagation in parallel somehow ? */
     ............

     z = z + newcarry; // update the resulting words after carry propagation
     C[threadIdx.x] = z;
 }

我是pretty肯定是有办法通过一些棘手的还原步骤做进位传播,但无法弄清楚..

I am pretty sure that there is a way to do carry propagation via some tricky reduction procedure but could not figure it out..

我看了一下 CUDA推力扩展,但大整数包似乎不是没有实现。
也许有人可以给我一个提示如何做到这一点的CUDA?

I had a look at CUDA thrust extensions but big integer package seems not to be implemented yet. Perhaps someone can give me a hint how to do that on CUDA ?

推荐答案

您是对的,随身携带的传播可以通过preFIX总和计算来完成,但它是一个有点棘手定义二元函数进行此项操作,并证明它是联想(需要并行preFIX总和)。作为事实上,这种算法(理论上)在先行进位加法器

You are right, carry propagation can be done via prefix sum computation but it's a bit tricky to define the binary function for this operation and prove that it is associative (needed for parallel prefix sum). As a matter of fact, this algorithm is used (theoretically) in Carry-lookahead adder.

假设我们有两个大整数a [0到n-1]和b [0到n-1]。
然后,我们计算(i = 0到n-1):

Suppose we have two large integers a[0..n-1] and b[0..n-1]. Then we compute (i = 0..n-1):

s[i] = a[i] + b[i]l;
carryin[i] = (s[i] < a[i]);

我们定义两个函数:

generate[i] = carryin[i];
propagate[i] = (s[i] == 0xffffffff);

用很直观的含义:产生[I] == 1意味着携带在产生
位置i而传播[I] == 1意味着利差将位置传播
(ⅰ - 1)至第(i + 1)。我们的目标是计算函数结转[0..N-1]用于更新所得到的总和S [0..N-1]。结转库存可以递归计算如下:

with quite intuitive meaning: generate[i] == 1 means that the carry is generated at position i while propagate[i] == 1 means that the carry will be propagated from position (i - 1) to (i + 1). Our goal is to compute the function carryout[0..n-1] used to update the resulting sum s[0..n-1]. carryout can be computed recursively as follows:

carryout[i] = generate[i] OR (propagate[i] AND carryout[i-1])
carryout[0] = 0

下面结转库存由[i] == 1,如果在位置上发生I或随身携带,有时前面生成并传播到我的位置。最后,我们更新所产生的总和:

Here carryout[i] == 1 if carry is generated at position i OR it is generated sometimes earlier AND propagated to position i. Finally, we update the resulting sum:

s[i] = s[i] + carryout[i-1];  for i = 1..n-1
carry = carryout[n-1];

现在这是很简单的证明,结转库存的功能确实是二进制的关联,因此并行preFIX和计算应用。为了实现这个CUDA上,我们可以合并这两个标志产生,并在单个变量'宣传',因为它们是相互排斥的,即:

Now it is quite straightforward to prove that carryout function is indeed binary associative and hence parallel prefix sum computation applies. To implement this on CUDA, we can merge both flags 'generate' and 'propagate' in a single variable since they are mutually exclusive, i.e.:

cy[i] = (s[i] == -1u ? -1u : 0) | carryin[i];

在换句话说,

cy[i] = 0xffffffff  if propagate[i]
cy[i] = 1           if generate[i]
cy[u] = 0           otherwise

然后,我们可以确认下列公式计算preFIX总和结转功能:

Then, one can verify that the following formula computes prefix sum for carryout function:

cy[i] = max((int)cy[i], (int)cy[k]) & cy[i];

对于所有k&LT;一世。下面的例子code显示大型除了2048字的整数。在这里,我用CUDA块与512个线程:

for all k < i. The example code below shows large addition for 2048-word integers. Here I used CUDA blocks with 512 threads:

// add & output carry flag
#define UADDO(c, a, b) \ 
     asm volatile("add.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
// add with carry & output carry flag
#define UADDC(c, a, b) \ 
     asm volatile("addc.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));

#define WS 32

__global__ void bignum_add(unsigned *g_R, const unsigned *g_A,const unsigned *g_B) {

extern __shared__ unsigned shared[];
unsigned *r = shared; 

const unsigned N_THIDS = 512;
unsigned thid = threadIdx.x, thid_in_warp = thid & WS-1;
unsigned ofs, cf;

uint4 a = ((const uint4 *)g_A)[thid],
      b = ((const uint4 *)g_B)[thid];

UADDO(a.x, a.x, b.x) // adding 128-bit chunks with carry flag
UADDC(a.y, a.y, b.y)
UADDC(a.z, a.z, b.z)
UADDC(a.w, a.w, b.w)
UADDC(cf, 0, 0) // save carry-out

// memory consumption: 49 * N_THIDS / 64
// use "alternating" data layout for each pair of warps
volatile short *scan = (volatile short *)(r + 16 + thid_in_warp +
        49 * (thid / 64)) + ((thid / 32) & 1);

scan[-32] = -1; // put identity element
if(a.x == -1u && a.x == a.y && a.x == a.z && a.x == a.w)
    // this indicates that carry will propagate through the number
    cf = -1u;

// "Hillis-and-Steele-style" reduction 
scan[0] = cf;
cf = max((int)cf, (int)scan[-2]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-4]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-8]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-16]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-32]) & cf;
scan[0] = cf;

int *postscan = (int *)r + 16 + 49 * (N_THIDS / 64);
if(thid_in_warp == WS - 1) // scan leading carry-outs once again
    postscan[thid >> 5] = cf;

__syncthreads();

if(thid < N_THIDS / 32) {
    volatile int *t = (volatile int *)postscan + thid;
    t[-8] = -1; // load identity symbol
    cf = t[0];
    cf = max((int)cf, (int)t[-1]) & cf;
    t[0] = cf;
    cf = max((int)cf, (int)t[-2]) & cf;
    t[0] = cf;
    cf = max((int)cf, (int)t[-4]) & cf;
    t[0] = cf;
}
__syncthreads();

cf = scan[0];
int ps = postscan[(int)((thid >> 5) - 1)]; // postscan[-1] equals to -1
scan[0] = max((int)cf, ps) & cf; // update carry flags within warps
cf = scan[-2];

if(thid_in_warp == 0)
    cf = ps;
if((int)cf < 0)
    cf = 0;

UADDO(a.x, a.x, cf) // propagate carry flag if needed
UADDC(a.y, a.y, 0)
UADDC(a.z, a.z, 0)
UADDC(a.w, a.w, 0)
((uint4 *)g_R)[thid] = a;
}

请注意,宏UADDO / UADDC可能没有必要再因为CUDA 4.0具有相应的内部函数(但是我不能完全确定)。

Note that macros UADDO / UADDC might not be necessary anymore since CUDA 4.0 has corresponding intrinsics (however I am not entirely sure).

此外此话,虽然平行的减少是相当快的,如果您需要在连续添加几个大的整数,它可能是更好地使用一些多余的再presentation,即(这是在上述评论的建议),第一累积加法在64位字的结果,然后在一次扫描最末端执行一个进位传播

Also remark that, though parallel reduction is quite fast, if you need to add several large integers in a row, it might be better to use some redundant representation (which was suggested in comments above), i.e., first accumulate the results of additions in 64-bit words, and then perform one carry propagation at the very end in "one sweep".

这篇关于大整数加法与CUDA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆