使用CUDA的大整数加法 [英] large integer addition with CUDA

查看:255
本文介绍了使用CUDA的大整数加法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在GPU上开发一个加密算法,目前坚持使用算法来执行大整数加法。大整数以通常的方式表示为一串32位字。



例如,我们可以使用一个线程来添加两个32位字。为简单起见,假设
要添加的数字具有相同的长度和每块线程的数量==字数。然后:

  __ global__ void add_kernel(int * C,const int * A,const int * B){
int x = A [threadIdx.x];
int y = B [threadIdx.x];
int z = x + y;
int carry =(z / **做进行传播并行不知何故? * /
............

z = z + newcarry; //在进位传播后更新结果字
C [threadIdx.x] = z;
}



我很确定有一种方法可以通过一些棘手的



我看过 CUDA推力扩展但是大整数包似乎还没有实现。

解决方案

你是对的,进行传播可以通过前缀和计算来完成,但是为这个操作定义二进制函数并证明它是关联的(并行前缀和需要)是有点棘手的。事实上,此算法(理论上)在进位加法器中使用。



假设我们有两个大整数a [0..n-1]和b [0..n-1]。
然后我们计算(i = 0..n-1):

  s [i] = a [i ] + b [i] l; 
carryin [i] =(s [i]

我们定义两个函数:

  generate [i] = carryin [i]; 
propagate [i] =(s [i] == 0xffffffff);

有相当直观的含义:generate [i] == 1表示进位在
position i while propagate [i] == 1表示进位将从位置
(i-1)传播到(i + 1)。我们的目标是计算用于更新结果和s [0..n-1]的函数carryout [0..n-1]。可以如下递归地计算进位:

  carryout [i] = generate [i] OR(propagate [i] AND carryout [ i-1])$ ​​b $ b carryout [0] = 0 

= 1,如果在位置i生成进位,则它有时较早地生成并传播到位置i。最后,我们更新所得的和:

  s [i] = s [i] + carryout [i-1] for i = 1..n-1 
carry = carryout [n-1];

现在,直接证明carryout函数是二进制关联,因此并行前缀和计算。要在CUDA上实现这一点,我们可以在单个变量中合并标志generate和propagate,因为它们是互斥的,即:

  cy [i] =(s [i] == -1u?-1u:0)|载体[i]; 

换句话说,

  cy [i] = 0xffffffff if propagate [i] 
cy [i] = 1 if generate [i]
cy [u] = 0否则

然后,可以验证以下公式计算carryout函数的前缀总和:

  cy [i] = max((int)cy [i],(int)cy [k])& cy [i]; 

一世。下面的示例代码显示了对2048个字整数的大的加法。这里我使用具有512个线程的CUDA块:

  // add&输出进位标志
#define UADDO(c,a,b)\
asm volatile(add.cc.u32%0,%1,%2;:= r :r(a),r(b));
// add with carry&输出进位标志
#define UADDC(c,a,b)\
asm volatile(addc.cc.u32%0,%1,%2;:= r :r(a),r(b));

#define WS 32

__global__ void bignum_add(unsigned * g_R,const unsigned * g_A,const unsigned * g_B){

extern __shared__ unsigned共享[];
unsigned * r = shared;

const unsigned N_THIDS = 512;
unsigned thid = threadIdx.x,thid_in_warp = thid& WS-1;
unsigned ofs,cf;

uint4 a =((const uint4 *)g_A)[thid],
b =((const uint4 *)g_B)[thid]

UADDO(ax,ax,bx)//添加带进位标志的128位块
UADDC(ay,ay,by)
UADDC(az,az,bz)
UADDC(aw,aw,bw)
UADDC(cf,0,0)// save carry-out

//内存消耗:49 * N_THIDS / 64
//为每对经纱使用交替数据布局
volatile short * scan =(volatile short *)(r + 16 + thid_in_warp +
49 *(thid / 64))+ (thid / 32)& 1);

scan [-32] = - 1; // put identity element
if(ax == -1u& ax == ay&& ax == az& ax == aw)
//这表示carry将通过数字
cf = -1u传播;

//Hillis-and-Steele风格缩减
scan [0] = cf;
cf = max((int)cf,(int)scan [-2])& cf;
scan [0] = cf;
cf = max((int)cf,(int)scan [-4])& cf;
scan [0] = cf;
cf = max((int)cf,(int)scan [-8])& cf;
scan [0] = cf;
cf = max((int)cf,(int)scan [-16])& cf;
scan [0] = cf;
cf = max((int)cf,(int)scan [-32])& cf;
scan [0] = cf;

int * postscan =(int *)r + 16 + 49 *(N_THIDS / 64);
if(thid_in_warp == WS - 1)//再次扫描领先的进位
postscan [thid>> 5] = cf;

__syncthreads();

if(thid< N_THIDS / 32){
volatile int * t =(volatile int *)postscan + thid;
t [-8] = -1; //加载标识符号
cf = t [0];
cf = max((int)cf,(int)t [-1])& cf;
t [0] = cf;
cf = max((int)cf,(int)t [-2])& cf;
t [0] = cf;
cf = max((int)cf,(int)t [-4])& cf;
t [0] = cf;
}
__syncthreads();

cf = scan [0];
int ps = postscan [(int)((thid>> 5) - 1)]; // postscan [-1]等于-1
scan [0] = max((int)cf,ps)& cf; //更新warp内的进位标志
cf = scan [-2];

if(thid_in_warp == 0)
cf = ps;
if((int)cf< 0)
cf = 0;

UADDO(ax,ax,cf)//根据需要传播进位标志
UADDC(ay,ay,0)
UADDC(az,az,0)
UADDC(aw,aw,0)
((uint4 *)g_R)[thid] = a;注意,UADDO / UADDC宏可能不再需要,因为CUDA 4.0具有对应的UADDO / UADDC。内在的(但我不完全确定)。



请注意,虽然平行缩减速度相当快,但如果您需要在一行中添加多个大整数,最好使用一些冗余表示在上面的注释中建议),即,首先累加64位字中的加法结果,然后在一次扫描中的最后执行一次进位传播。


I've been developing a cryptographic algorithm on the GPU and currently stuck with an algorithm to perform large integer addition. Large integers are represented in a usual way as a bunch of 32-bit words.

For example, we can use one thread to add two 32-bit words. For simplicity, let assume that the numbers to be added are of the same length and number of threads per block == number of words. Then:

__global__ void add_kernel(int *C, const int *A, const int *B) {
     int x = A[threadIdx.x];
     int y = B[threadIdx.x];
     int z = x + y;
     int carry = (z < x);
     /** do carry propagation in parallel somehow ? */
     ............

     z = z + newcarry; // update the resulting words after carry propagation
     C[threadIdx.x] = z;
 }

I am pretty sure that there is a way to do carry propagation via some tricky reduction procedure but could not figure it out..

I had a look at CUDA thrust extensions but big integer package seems not to be implemented yet. Perhaps someone can give me a hint how to do that on CUDA ?

解决方案

You are right, carry propagation can be done via prefix sum computation but it's a bit tricky to define the binary function for this operation and prove that it is associative (needed for parallel prefix sum). As a matter of fact, this algorithm is used (theoretically) in Carry-lookahead adder.

Suppose we have two large integers a[0..n-1] and b[0..n-1]. Then we compute (i = 0..n-1):

s[i] = a[i] + b[i]l;
carryin[i] = (s[i] < a[i]);

We define two functions:

generate[i] = carryin[i];
propagate[i] = (s[i] == 0xffffffff);

with quite intuitive meaning: generate[i] == 1 means that the carry is generated at position i while propagate[i] == 1 means that the carry will be propagated from position (i - 1) to (i + 1). Our goal is to compute the function carryout[0..n-1] used to update the resulting sum s[0..n-1]. carryout can be computed recursively as follows:

carryout[i] = generate[i] OR (propagate[i] AND carryout[i-1])
carryout[0] = 0

Here carryout[i] == 1 if carry is generated at position i OR it is generated sometimes earlier AND propagated to position i. Finally, we update the resulting sum:

s[i] = s[i] + carryout[i-1];  for i = 1..n-1
carry = carryout[n-1];

Now it is quite straightforward to prove that carryout function is indeed binary associative and hence parallel prefix sum computation applies. To implement this on CUDA, we can merge both flags 'generate' and 'propagate' in a single variable since they are mutually exclusive, i.e.:

cy[i] = (s[i] == -1u ? -1u : 0) | carryin[i];

In other words,

cy[i] = 0xffffffff  if propagate[i]
cy[i] = 1           if generate[i]
cy[u] = 0           otherwise

Then, one can verify that the following formula computes prefix sum for carryout function:

cy[i] = max((int)cy[i], (int)cy[k]) & cy[i];

for all k < i. The example code below shows large addition for 2048-word integers. Here I used CUDA blocks with 512 threads:

// add & output carry flag
#define UADDO(c, a, b) \ 
     asm volatile("add.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
// add with carry & output carry flag
#define UADDC(c, a, b) \ 
     asm volatile("addc.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));

#define WS 32

__global__ void bignum_add(unsigned *g_R, const unsigned *g_A,const unsigned *g_B) {

extern __shared__ unsigned shared[];
unsigned *r = shared; 

const unsigned N_THIDS = 512;
unsigned thid = threadIdx.x, thid_in_warp = thid & WS-1;
unsigned ofs, cf;

uint4 a = ((const uint4 *)g_A)[thid],
      b = ((const uint4 *)g_B)[thid];

UADDO(a.x, a.x, b.x) // adding 128-bit chunks with carry flag
UADDC(a.y, a.y, b.y)
UADDC(a.z, a.z, b.z)
UADDC(a.w, a.w, b.w)
UADDC(cf, 0, 0) // save carry-out

// memory consumption: 49 * N_THIDS / 64
// use "alternating" data layout for each pair of warps
volatile short *scan = (volatile short *)(r + 16 + thid_in_warp +
        49 * (thid / 64)) + ((thid / 32) & 1);

scan[-32] = -1; // put identity element
if(a.x == -1u && a.x == a.y && a.x == a.z && a.x == a.w)
    // this indicates that carry will propagate through the number
    cf = -1u;

// "Hillis-and-Steele-style" reduction 
scan[0] = cf;
cf = max((int)cf, (int)scan[-2]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-4]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-8]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-16]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-32]) & cf;
scan[0] = cf;

int *postscan = (int *)r + 16 + 49 * (N_THIDS / 64);
if(thid_in_warp == WS - 1) // scan leading carry-outs once again
    postscan[thid >> 5] = cf;

__syncthreads();

if(thid < N_THIDS / 32) {
    volatile int *t = (volatile int *)postscan + thid;
    t[-8] = -1; // load identity symbol
    cf = t[0];
    cf = max((int)cf, (int)t[-1]) & cf;
    t[0] = cf;
    cf = max((int)cf, (int)t[-2]) & cf;
    t[0] = cf;
    cf = max((int)cf, (int)t[-4]) & cf;
    t[0] = cf;
}
__syncthreads();

cf = scan[0];
int ps = postscan[(int)((thid >> 5) - 1)]; // postscan[-1] equals to -1
scan[0] = max((int)cf, ps) & cf; // update carry flags within warps
cf = scan[-2];

if(thid_in_warp == 0)
    cf = ps;
if((int)cf < 0)
    cf = 0;

UADDO(a.x, a.x, cf) // propagate carry flag if needed
UADDC(a.y, a.y, 0)
UADDC(a.z, a.z, 0)
UADDC(a.w, a.w, 0)
((uint4 *)g_R)[thid] = a;
}

Note that macros UADDO / UADDC might not be necessary anymore since CUDA 4.0 has corresponding intrinsics (however I am not entirely sure).

Also remark that, though parallel reduction is quite fast, if you need to add several large integers in a row, it might be better to use some redundant representation (which was suggested in comments above), i.e., first accumulate the results of additions in 64-bit words, and then perform one carry propagation at the very end in "one sweep".

这篇关于使用CUDA的大整数加法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆