使用 CUDA 进行大整数加法 [英] large integer addition with CUDA

查看:23
本文介绍了使用 CUDA 进行大整数加法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在 GPU 上开发一种加密算法,目前坚持使用一种算法来执行大整数加法.大整数以通常的方式表示为一堆 32 位字.

I've been developing a cryptographic algorithm on the GPU and currently stuck with an algorithm to perform large integer addition. Large integers are represented in a usual way as a bunch of 32-bit words.

例如,我们可以使用一个线程来添加两个 32 位字.为简单起见,假设要添加的数字具有相同的长度和每个块的线程数 == 字数.那么:

For example, we can use one thread to add two 32-bit words. For simplicity, let assume that the numbers to be added are of the same length and number of threads per block == number of words. Then:

__global__ void add_kernel(int *C, const int *A, const int *B) {
     int x = A[threadIdx.x];
     int y = B[threadIdx.x];
     int z = x + y;
     int carry = (z < x);
     /** do carry propagation in parallel somehow ? */
     ............

     z = z + newcarry; // update the resulting words after carry propagation
     C[threadIdx.x] = z;
 }

我很确定有一种方法可以通过一些棘手的缩减程序来进行进位传播,但无法弄清楚..

I am pretty sure that there is a way to do carry propagation via some tricky reduction procedure but could not figure it out..

我查看了 CUDA 推力extensions 但大整数包似乎还没有实现.也许有人可以给我一个提示如何在 CUDA 上做到这一点?

I had a look at CUDA thrust extensions but big integer package seems not to be implemented yet. Perhaps someone can give me a hint how to do that on CUDA ?

推荐答案

你是对的,进位传播可以通过前缀和计算来完成,但是为这个操作定义二进制函数并证明它是关联的(需要并行前缀总和).事实上,这个算法(理论上)用在Carry-lookahead adder中.

You are right, carry propagation can be done via prefix sum computation but it's a bit tricky to define the binary function for this operation and prove that it is associative (needed for parallel prefix sum). As a matter of fact, this algorithm is used (theoretically) in Carry-lookahead adder.

假设我们有两个大整数 a[0..n-1] 和 b[0..n-1].然后我们计算 (i = 0..n-1):

Suppose we have two large integers a[0..n-1] and b[0..n-1]. Then we compute (i = 0..n-1):

s[i] = a[i] + b[i]l;
carryin[i] = (s[i] < a[i]);

我们定义了两个函数:

generate[i] = carryin[i];
propagate[i] = (s[i] == 0xffffffff);

具有相当直观的含义:generate[i] == 1 表示进位产生于position i whilepropagate[i] == 1 表示进位将从位置传播(i - 1) 到 (i + 1).我们的目标是计算用于更新结果和 s[0..n-1] 的函数 carryout[0..n-1].结转可以递归计算如下:

with quite intuitive meaning: generate[i] == 1 means that the carry is generated at position i while propagate[i] == 1 means that the carry will be propagated from position (i - 1) to (i + 1). Our goal is to compute the function carryout[0..n-1] used to update the resulting sum s[0..n-1]. carryout can be computed recursively as follows:

carryout[i] = generate[i] OR (propagate[i] AND carryout[i-1])
carryout[0] = 0

这里进位[i] == 1 如果进位是在位置 i 生成的,或者它有时更早生成并传播到位置 i.最后,我们更新结果和:

Here carryout[i] == 1 if carry is generated at position i OR it is generated sometimes earlier AND propagated to position i. Finally, we update the resulting sum:

s[i] = s[i] + carryout[i-1];  for i = 1..n-1
carry = carryout[n-1];

现在证明进位函数确实是二进制关联的非常简单,因此并行前缀和计算适用.为了在 CUDA 上实现这一点,我们可以将两个标志 'generate' 和 'propagate' 合并到一个变量中,因为它们是互斥的,即:

Now it is quite straightforward to prove that carryout function is indeed binary associative and hence parallel prefix sum computation applies. To implement this on CUDA, we can merge both flags 'generate' and 'propagate' in a single variable since they are mutually exclusive, i.e.:

cy[i] = (s[i] == -1u ? -1u : 0) | carryin[i];

换句话说,

cy[i] = 0xffffffff  if propagate[i]
cy[i] = 1           if generate[i]
cy[u] = 0           otherwise

然后,可以验证以下公式计算进位函数的前缀和:

Then, one can verify that the following formula computes prefix sum for carryout function:

cy[i] = max((int)cy[i], (int)cy[k]) & cy[i];

对于所有 k

一世.下面的示例代码显示了 2048 字整数的大加法.这里我使用了 512 个线程的 CUDA 块:

for all k < i. The example code below shows large addition for 2048-word integers. Here I used CUDA blocks with 512 threads:

// add & output carry flag
#define UADDO(c, a, b)  
     asm volatile("add.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
// add with carry & output carry flag
#define UADDC(c, a, b)  
     asm volatile("addc.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));

#define WS 32

__global__ void bignum_add(unsigned *g_R, const unsigned *g_A,const unsigned *g_B) {

extern __shared__ unsigned shared[];
unsigned *r = shared; 

const unsigned N_THIDS = 512;
unsigned thid = threadIdx.x, thid_in_warp = thid & WS-1;
unsigned ofs, cf;

uint4 a = ((const uint4 *)g_A)[thid],
      b = ((const uint4 *)g_B)[thid];

UADDO(a.x, a.x, b.x) // adding 128-bit chunks with carry flag
UADDC(a.y, a.y, b.y)
UADDC(a.z, a.z, b.z)
UADDC(a.w, a.w, b.w)
UADDC(cf, 0, 0) // save carry-out

// memory consumption: 49 * N_THIDS / 64
// use "alternating" data layout for each pair of warps
volatile short *scan = (volatile short *)(r + 16 + thid_in_warp +
        49 * (thid / 64)) + ((thid / 32) & 1);

scan[-32] = -1; // put identity element
if(a.x == -1u && a.x == a.y && a.x == a.z && a.x == a.w)
    // this indicates that carry will propagate through the number
    cf = -1u;

// "Hillis-and-Steele-style" reduction 
scan[0] = cf;
cf = max((int)cf, (int)scan[-2]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-4]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-8]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-16]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-32]) & cf;
scan[0] = cf;

int *postscan = (int *)r + 16 + 49 * (N_THIDS / 64);
if(thid_in_warp == WS - 1) // scan leading carry-outs once again
    postscan[thid >> 5] = cf;

__syncthreads();

if(thid < N_THIDS / 32) {
    volatile int *t = (volatile int *)postscan + thid;
    t[-8] = -1; // load identity symbol
    cf = t[0];
    cf = max((int)cf, (int)t[-1]) & cf;
    t[0] = cf;
    cf = max((int)cf, (int)t[-2]) & cf;
    t[0] = cf;
    cf = max((int)cf, (int)t[-4]) & cf;
    t[0] = cf;
}
__syncthreads();

cf = scan[0];
int ps = postscan[(int)((thid >> 5) - 1)]; // postscan[-1] equals to -1
scan[0] = max((int)cf, ps) & cf; // update carry flags within warps
cf = scan[-2];

if(thid_in_warp == 0)
    cf = ps;
if((int)cf < 0)
    cf = 0;

UADDO(a.x, a.x, cf) // propagate carry flag if needed
UADDC(a.y, a.y, 0)
UADDC(a.z, a.z, 0)
UADDC(a.w, a.w, 0)
((uint4 *)g_R)[thid] = a;
}

请注意,宏 UADDO/UADDC 可能不再需要,因为 CUDA 4.0 具有相应的内在函数(但我不完全确定).

Note that macros UADDO / UADDC might not be necessary anymore since CUDA 4.0 has corresponding intrinsics (however I am not entirely sure).

还要注意,虽然并行归约速度很快,但如果需要连续添加几个大整数,最好使用一些冗余表示(在上面的评论中建议),即首先将结果累加64位字的加法,然后在一次扫描"的最后进行一次进位传播.

Also remark that, though parallel reduction is quite fast, if you need to add several large integers in a row, it might be better to use some redundant representation (which was suggested in comments above), i.e., first accumulate the results of additions in 64-bit words, and then perform one carry propagation at the very end in "one sweep".

这篇关于使用 CUDA 进行大整数加法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆