在Intel cpu上的SIMD前缀总和 [英] SIMD prefix sum on Intel cpu
问题描述
我需要实现一个前缀和算法,并需要它尽可能快。
例如:
[3,1,7,0,4,1,6,3]
应该给予
[3,4,11,11,15,16,22,25]
有没有办法使用SSE / mmx / SIMD cpu指令?
我的第一个想法是递归地并行递归,直到所有的和计算如下!
//并行do
for(int i = 0; i < i ++){
z [i] = x [i <1] + x [(i <
}
为了使算法更清楚,z输出
,但用于计算输出
int [] w = computePrefixSum(z);
for(int i = 1; i ouput [i] =(i%2 == 0)? (x [i] + ouput [i-1]):w [(i-1)> 1]
}
我知道是并行运行两次通过的和,并在第二次通过中使用SSE。
在第一次通过中,并行计算部分和,存储每个部分和的总和。在第二遍中,将来自前面的部分和的总和加到下一个部分和。您可以使用多线程(例如使用OpenMP)并行运行这两个传递。
假设 n
第二遍你也可以使用SIMD,因为常数值被添加到每个部分和。数组的元素, m
核心和SIMD宽度 w
时间成本应为
n / m + n /(m * w)= n / m *(1 + 1 / w)
由于第一遍不使用SIMD,时间成本总是大于 n / m
例如,对于SIMD_width为4的四个内核(四个32位浮点数与SSE),成本为 5n / 16
。或者比具有
n
的时间成本的顺序代码快约3.2倍。使用超线程,加快速度会更快。
在特殊情况下,也可以在第一遍使用SIMD。那么时间成本就是
2 * n /(m * w)
我发布了使用OpenMP用于SSE代码的线程和内在函数的一般情况的代码,并在以下链接讨论有关特殊情况的详细信息
parallel-prefix-cumulative-sum-with-sse p>
编辑:
我设法找到一个SIMD版本的第一次通过,大约是顺序代码的两倍。
编辑:
For更大的数组一个问题是,第一次通过后,大多数值已从高速缓存中逐出。我想出了一个解决方案,在一个块内并行运行,但连续运行每个块。chunk_size
是应该调整的值。例如我设置为1MB = 256K浮点。现在,当值仍在第2级缓存内时,完成第二遍。这对大型数组有很大的改进。
这是SSE的代码。 AVX代码是大约相同的速度,所以我没有在这里张贴。执行前缀和的函数是
scan_omp_SSEp2_SSEp1_chunk
。传递一个数组a
的浮点数,并用累积和填充数组s
。__ m128 scan_SSE(__ m128 x){
x = _mm_add_ps(x,_mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x),4)
x = _mm_add_ps(x,_mm_shuffle_ps(_mm_setzero_ps(),x,0x40));
return x;
}
float pass1_SSE(float * a,float * s,const int n){
__m128 offset = _mm_setzero_ps();
#pragma omp for schedule(static)nowait
for(int i = 0; i__m128 x = _mm_load_ps(& a [4 * i ]);
__m128 out = scan_SSE(x);
out = _mm_add_ps(out,offset);
_mm_store_ps(& s [4 * i],out);
offset = _mm_shuffle_ps(out,out,_MM_SHUFFLE(3,3,3,3));
}
float tmp [4];
_mm_store_ps(tmp,offset);
return tmp [3];
}
void pass2_SSE(float * s,__m128 offset,const int n){
#pragma omp for schedule(static)
for(int i = 0 ; i__m128 tmp1 = _mm_load_ps(& s [4 * i]);
tmp1 = _mm_add_ps(tmp1,offset);
_mm_store_ps(& s [4 * i],tmp1);
}
}
void scan_omp_SSEp2_SSEp1_chunk(float a [],float s [],int n){
float * suma;
const int chunk_size = 1<< 18;
const int nchunks = n%chunk_size == 0? n / chunk_size:n / chunk_size + 1;
// printf(nchunks%d\\\
,nchunks);
#pragma omp parallel
{
const int ithread = omp_get_thread_num();
const int nthreads = omp_get_num_threads();
#pragma omp single
{
suma = new float [nthreads + 1];
suma [0] = 0;
}
float offset2 = 0.0f;
for(int c = 0; cconst int start = c * chunk_size;
const int chunk =(c + 1)* chunk_size< n? chunk_size:n - c * chunk_size;
suma [ithread + 1] = pass1_SSE(& a [start],& s [start],chunk)
#pragma omp barrier
#pragma omp single
{
float tmp = 0;
for(int i = 0; i <(nthreads + 1); i ++){
tmp + = suma [i];
suma [i] = tmp;
}
}
__m128 offset = _mm_set1_ps(suma [ithread] + offset2);
pass2_SSE(& s [start],offset,chunk);
#pragma omp barrier
offset2 = s [start + chunk-1];
}
}
delete [] suma;
}
I need to implement a prefix sum algorithm and would need it to be as fast as possible. Ex:
[3, 1, 7, 0, 4, 1, 6, 3] should give [3, 4, 11, 11, 15, 16, 22, 25]
Is there a way to do this using SSE/mmx/SIMD cpu instruction?
My first idea is to sum each pair in parallel recursively until all sum have been computed like below!
//in parallel do for (int i = 0; i<z.length; i++){ z[i] = x[i<<1] + x[(i<<1)+1]; }
To make the algorithm a little bit more clear "z" is not the final ouput
but instead used to compute the ouput
int[] w = computePrefixSum(z); for (int i = 1; i<ouput.length; i++){ ouput[i] = (i%2==0) ? (x[i] + ouput[i-1]) : w[(i-1)>>1]; }
解决方案The fastest parallel prefix sum algorithm I know of is to run over the sum in two passes in parallel and use SSE as well in the second pass.
In the first pass you calculate partial sums in parallel and store the total sum for each partial sum. In the second pass you add the total sum from the preceding partial sum to the next partial sum. You can run both passes in parallel using multiple threads (e.g. with OpenMP). The second pass you can also use SIMD since a constant value is being added to each partial sum.
Assuming
n
elements of an array,m
cores, and a SIMD width ofw
the time cost should ben/m + n/(m*w) = n/m*(1+1/w)
Since the fist pass does not use SIMD the time cost will always be greater than
n/m
For example for four cores with a SIMD_width of 4 (four 32bit floats with SSE) the cost would be
5n/16
. Or about 3.2 times faster than sequential code which has a time cost ofn
. Using hyper threading the speed up will be greater still.In special cases it's possible to use SIMD on the first pass as well. Then the time cost is simply
2*n/(m*w)
I posted the code for the general case which uses OpenMP for the threading and intrinsics for the SSE code and discuss details about the special case at the following link parallel-prefix-cumulative-sum-with-sse
Edit: I managed to find a SIMD version for the first pass which is about twice as fast as sequential code. Now I get a total boost of about 7 on my four core ivy bridge system.
Edit: For larger arrays one problem is that after the first pass most values have been evicted from the cache. I came up with a solution which runs in parallel inside a chunk but runs each chunk serially. The
chunk_size
is a value that should be tuned. For example I set it to 1MB = 256K floats. Now the second pass is done while the values are still inside the level-2 cache. Doing this gives a big improvement for large arrays.Here is the code for SSE. The AVX code is about the same speed so I did not post it here. The function which does the prefix sum is
scan_omp_SSEp2_SSEp1_chunk
. Pass it an arraya
of floats and it fills the arrays
with the cumulative sum.__m128 scan_SSE(__m128 x) { x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4))); x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40)); return x; } float pass1_SSE(float *a, float *s, const int n) { __m128 offset = _mm_setzero_ps(); #pragma omp for schedule(static) nowait for (int i = 0; i < n / 4; i++) { __m128 x = _mm_load_ps(&a[4 * i]); __m128 out = scan_SSE(x); out = _mm_add_ps(out, offset); _mm_store_ps(&s[4 * i], out); offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3)); } float tmp[4]; _mm_store_ps(tmp, offset); return tmp[3]; } void pass2_SSE(float *s, __m128 offset, const int n) { #pragma omp for schedule(static) for (int i = 0; i<n/4; i++) { __m128 tmp1 = _mm_load_ps(&s[4 * i]); tmp1 = _mm_add_ps(tmp1, offset); _mm_store_ps(&s[4 * i], tmp1); } } void scan_omp_SSEp2_SSEp1_chunk(float a[], float s[], int n) { float *suma; const int chunk_size = 1<<18; const int nchunks = n%chunk_size == 0 ? n / chunk_size : n / chunk_size + 1; //printf("nchunks %d\n", nchunks); #pragma omp parallel { const int ithread = omp_get_thread_num(); const int nthreads = omp_get_num_threads(); #pragma omp single { suma = new float[nthreads + 1]; suma[0] = 0; } float offset2 = 0.0f; for (int c = 0; c < nchunks; c++) { const int start = c*chunk_size; const int chunk = (c + 1)*chunk_size < n ? chunk_size : n - c*chunk_size; suma[ithread + 1] = pass1_SSE(&a[start], &s[start], chunk); #pragma omp barrier #pragma omp single { float tmp = 0; for (int i = 0; i < (nthreads + 1); i++) { tmp += suma[i]; suma[i] = tmp; } } __m128 offset = _mm_set1_ps(suma[ithread]+offset2); pass2_SSE(&s[start], offset, chunk); #pragma omp barrier offset2 = s[start + chunk-1]; } } delete[] suma; }
这篇关于在Intel cpu上的SIMD前缀总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!