将整数序列写入全局内存的最快(最佳)方法? [英] Fast(est) way to write a seqence of integer to global memory?

查看:85
本文介绍了将整数序列写入全局内存的最快(最佳)方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任务非常简单,将整数变量序列写入内存:

The task is very simple, writting a seqence of integer variable to memory:

原始代码:

for (size_t i=0; i<1000*1000*1000; ++i)
{
   data[i]=i;
};

并行代码:

    size_t stepsize=len/N;

#pragma omp parallel num_threads(N)
    {
        int threadIdx=omp_get_thread_num();

        size_t istart=stepsize*threadIdx;
        size_t iend=threadIdx==N-1?len:istart+stepsize;
#pragma simd
        for (size_t i=istart; i<iend; ++i)
            x[i]=i;
    };

性能糟透了,通过简单的并行化( open mp parallel),写入1G uint64 变量(相当于每秒5GB)需要 1.6秒),上面的代码虽然速度有所提高,但是性能仍然很差,在i7 3970上花4个线程需要 1.4秒,花6个线程需要1.35秒.

The performance sucks, it takes 1.6 sec to writing 1G uint64 variables (which is equal to 5GB per sec), by simple parallelization (open mp parallel)of the above code, the speed increase abit, but performance still sucks, take 1.4 sec with 4 threads and 1.35 with 6 threads on a i7 3970.

我的设备( i7 3970/64G DDR3-1600 )的理论内存带宽为 51.2 GB/秒,对于上面的示例,所达到的内存带宽仅为大约是理论带宽的 1/10 ,即使通过应用程序,它也是相当大的内存带宽限制.

The theortical memory bandwidth of my rig (i7 3970/64G DDR3-1600) is 51.2 GB/sec, for the above example, the achieved memory bandwidth is only about 1/10 of the theoritcal bandwidth, even through the application is pretty much memory-bandwidth-bounded.

有人知道如何改进代码吗?

Anyone know how to improve the code?

我在GPU上写了很多内存绑定代码,这对于GPU来说很容易充分利用GPU的设备内存带宽(例如理论带宽的85%以上).

I wrote alot of memory-bound code on GPU, its pretty easy for GPU to take full advantage of the GPU's device memory bandwidth (e.g. 85%+ of theoritcal bandwidth).

该代码由Intel ICC 13.1编译为64位二进制文​​件,并具有最大优化(O3)和AVX代码路径以及自动矢量化功能.

The code is compiled by Intel ICC 13.1, to 64bit binary, and with maximum optimzation (O3) and AVX code path on, as well as auto-vectorization.

更新:

我尝试了下面的所有代码(感谢Paul R),没有什么特别的事情发生,我相信编译器完全有能力进行simd/矢量化优化.

I tried all the codes below ( thanks to Paul R), nothing special happens, I believe the compiler is fully capable of doing the kind of simd/vectorization optimization.

至于我为什么要在那儿填数字,长话短说:

As for why I want to fill the numbers there, well, long story short:

它是高性能异构计算算法的一部分,在设备方面,该算法非常高效,以至于多GPU设置如此之快,以至于我发现性能瓶颈恰好是CPU尝试执行以下任务时所致:将数个序列写到内存中.

Its part of a high-performance heterogeneous computing algorthim, on the device side, the algorthim is highly efficient to the degree that the multi-GPU set is so fast such that I found the performance bottleneck happen to be when CPU try to write several seqence of numbers to memory.

当然,知道CPU很烂(因为相反,GPU可以以非常接近的速度( 238GB/秒)从 288GB/秒填充数字序列.GK110上的strong>与CPU上的 51.2GB/秒相比可怜的 5GB/秒)到GPU全局内存的理论带宽),我可以稍微修改一下算法,但令我感到奇怪的是,为什么CPU在这里无法很好地填充数字序列.

Of cause, knowing that CPU sucks at filling numbers (in contrast, the GPU can fill seqence of number at a speed very close (238GB/sec out of 288GB/sec on GK110 vs a pathetic 5GB/sec out of 51.2GB/sec on CPU) to the theorical bandwidth of GPU's global memory), I could change my algorthim a bit, but what make me wonder is why CPU sucks so bad at filling seqence of numbers here.

对于我的设备的内存带宽,我相信带宽(51.2GB)大约是正确的,根据我的 memcpy()测试,所达到的带宽大约为 80%+ 的理论带宽(> 40GB/秒).

As for memory bandwidth of my rig, I believe the bandwidth (51.2GB) is about correct, based on my memcpy() test, the achieved bandwidth is about 80%+ of the theoritical bandwidth (>40GB/sec).

推荐答案

假设这是x86,并且您尚未使可用的DRAM带宽饱和,则可以尝试使用SSE2或AVX2一次写入2或4个元素:

Assuming this is x86, and that you are not already saturating your available DRAM bandwidth, you can try using SSE2 or AVX2 to write 2 or 4 elements at a time:

SSE2:

#include "emmintrin.h"

const __m128i v2 = _mm_set1_epi64x(2);
__m128i v = _mm_set_epi64x(1, 0);

for (size_t i=0; i<1000*1000*1000; i += 2)
{
    _mm_stream_si128((__m128i *)&data[i], v);
    v = _mm_add_epi64(v, v2);
}

AVX2:

#include "immintrin.h"

const __m256i v4 = _mm256_set1_epi64x(4);
__m256i v = _mm256_set_epi64x(3, 2, 1, 0);

for (size_t i=0; i<1000*1000*1000; i += 4)
{
    _mm256_stream_si256((__m256i *)&data[i], v);
    v = _mm256_add_epi64(v, v4);
}

请注意,数据必须适当对齐(16字节或32字节边界).

Note that data needs to be suitably aligned (16 byte or 32 byte boundary).

AVX2仅在Intel Haswell和更高版本上可用,但是SSE2现在已经很通用了.

AVX2 is only available on Intel Haswell and later, but SSE2 is pretty much universal these days.

FWIW我将一个具有标量循环的测试工具放在一起,并且上面的SSE和AVX循环使用clang对其进行了编译,并在Haswell MacBook Air(1600MHz LPDDR3 DRAM)上进行了测试.我得到以下结果:

FWIW I put together a test harness with a scalar loop and the above SSE and AVX loops compiled it with clang, and tested it on a Haswell MacBook Air (1600MHz LPDDR3 DRAM). I got the following results:

# sequence_scalar: t = 0.870903 s = 8.76033 GB / s
# sequence_SSE: t = 0.429768 s = 17.7524 GB / s
# sequence_AVX: t = 0.431182 s = 17.6941 GB / s

我还在具有3.6 GHz Haswell的Linux台式机上进行了尝试,并使用gcc 4.7.2进行了编译,并且得到了以下信息:

I also tried it on a Linux desktop PC with a 3.6 GHz Haswell, compiling with gcc 4.7.2, and got the following:

# sequence_scalar: t = 0.816692 s = 9.34183 GB / s
# sequence_SSE: t = 0.39286 s = 19.4201 GB / s
# sequence_AVX: t = 0.392545 s = 19.4357 GB / s

因此,看起来SIMD实现比64位标量代码有2倍或更多的改进(尽管256位SIMD似乎没有比128位SIMD有所改进),并且典型的吞吐量应该比5 GB/秒.

So it looks like the SIMD implementations give a 2x or more improvement over 64 bit scalar code (although 256 bit SIMD doesn't seem to give any improvement over 128 bit SIMD), and that typical throughput should be a lot faster than 5 GB / s.

我的猜测是,OP的系统或基准测试代码出了点问题,从而导致吞吐量明显降低.

My guess is that there is something wrong with the OP's system or benchmarking code which is resulting in an apparently reduced throughput.

这篇关于将整数序列写入全局内存的最快(最佳)方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆