从两个数组的点积测量存储器带宽 [英] Measuring memory bandwidth from the dot product of two arrays

查看:329
本文介绍了从两个数组的点积测量存储器带宽的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

两个数组的点积

  for(int i = 0; i  sum + = x [i] * y [i]; 
}

不会重复使用数据,因此应该是内存限制操作。因此,我应该能够从点积测量内存带宽。



使用
为什么 - 矢量化的循环不具有性能改进 我的系统带宽为9.3 GB / s 。然而,当我尝试使用点积计算带宽时,单线程的速率是两倍,三倍时间速率使用多线程(我的系统有四个核/八个超线程)。这对我没有意义,因为内存绑定操作不应受益于多线程。这是以下代码的输出:

  Xeon E5-1620,GCC 4.9.0,Linux内核3.13 
dot 1 thread:1.0 GB,sum 191054.81,time 4.98 s,21.56 GB / s,5.39 GFLOPS
dot_avx 1 thread 1.0 GB,sum 191043.33,time 5.16 s,20.79 GB / s,5.20 GFLOPS
dot_avx 2线程:1.0 GB,总和191045.34,时间3.44 s,31.24 GB / s,7.81 GFLOPS
dot_avx 8线程:1.0 GB,总和191043.34,时间3.26 s,32.91 GB / s,8.23 GFLOPS

有人可以向我解释为什么我得到的带宽是一个线程的两倍,



以下是我使用的代码:

 

code> // g ++ -O3 -fopenmp -mavx -ffast-math dot.cpp
#include< stdio.h>
#include< string.h>
#include< stdlib.h>
#include< stdint.h>
#include< x86intrin.h>
#include< omp.h>

externCinline float horizo​​ntal_add(__ m256 a){
__m256 t1 = _mm256_hadd_ps(a,a);
__m256 t2 = _mm256_hadd_ps(t1,t1);
__m128 t3 = _mm256_extractf128_ps(t2,1);
__m128 t4 = _mm_add_ss(_mm256_castps256_ps128(t2),t3);
return _mm_cvtss_f32(t4);
}

externCfloat dot_avx(float * __restrict x,float * __restrict y,const int n){
x =(float *)__ builtin_assume_aligned ;
y =(float *)__ builtin_assume_aligned(y,32);
float sum = 0;
#pragma omp parallel reduction(+:sum)
{
__m256 sum1 = _mm256_setzero_ps();
__m256 sum2 = _mm256_setzero_ps();
__m256 sum3 = _mm256_setzero_ps();
__m256 sum4 = _mm256_setzero_ps();
__m256 x8,y8;
#pragma omp for
for(int i = 0; i x8 = _mm256_loadu_ps(& x [i]);
y8 = _mm256_loadu_ps(& y [i]);
sum1 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum1);
x8 = _mm256_loadu_ps(& x [i + 8]);
y8 = _mm256_loadu_ps(& y [i + 8]);
sum2 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum2);
x8 = _mm256_loadu_ps(& x [i + 16]);
y8 = _mm256_loadu_ps(& y [i + 16]);
sum3 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum3);
x8 = _mm256_loadu_ps(& x [i + 24]);
y8 = _mm256_loadu_ps(& y [i + 24]);
sum4 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum4);
}
sum + = horizo​​ntal_add(_mm256_add_ps(_mm256_add_ps(sum1,sum2),_mm256_add_ps(sum3,sum4)))
}
return sum;
}

externCfloat dot(float * __restrict x,float * __restrict y,const int n){
x =(float *)__ builtin_assume_aligned ;
y =(float *)__ builtin_assume_aligned(y,32);
float sum = 0;
for(int i = 0; i sum + = x [i] * y [i]
}
return sum;
}

int main(){
uint64_t LEN = 1< 27;
float * x =(float *)_ mm_malloc(sizeof(float)* LEN,64);
float * y =(float *)_ mm_malloc(sizeof(float)* LEN,64);
for(uint64_t i = 0; i
uint64_t size = 2 * sizeof(float)* LEN;

volatile float sum = 0;
double dtime,rate,flops;
int repeat = 100;

dtime = omp_get_wtime();
for(int i = 0; i rate = 1.0 * repeat * size / dtime * 1E-9;
flops = 2.0 * repeat * LEN / dtime * 1E-9;
printf(%f GB,sum%f,time%fs,%.2f GB / s,%.2f GFLOPS\\\
,1.0 * size / 1024/1024/1024,sum,dtime,rate ,flops);

sum = 0;
dtime = omp_get_wtime();
for(int i = 0; i dtime = omp_get_wtime() - dtime;
rate = 1.0 * repeat * size / dtime * 1E-9;
flops = 2.0 * repeat * LEN / dtime * 1E-9;

printf(%f GB,sum%f,time%fs,%.2f GB / s,%.2f GFLOPS\\\
,1.0 * size / 1024/1024/1024,sum ,dtime,rate,flops);
}

我刚刚下载,编译并运行STREAM,由Jonathan Dursi和是结果:



一个线程

  s)平均时间最短时间最长时间
复制:14292.1657 0.0023 0.0022 0.0023
比例:14286.0807 0.0023 0.0022 0.0023
添加:14724.3906 0.0033 0.0033 0.0033
三元组:15224.3339 0.0032 0.0032 0.0032

八线程

 功能率(MB / s)平均时间最小时间最大时间
复制:24501.2282 0.0014 0.0013 0.0021
比例:23121.0556 0.0014 0.0014 0.0015
添加:25263.7209 0.0024 0.0019 0.0056
黑社会:25817.7215 0.0020 0.0019 0.0027


解决方案

在这里,下降到:




  • 你必须努力工作以获得内存子系统的最后一点性能;和

  • 不同的基准测量不同的东西。



以饱和可用的存储器带宽。在内存系统中有很多并发性,它利用这些并发性通常需要在CPU代码中有一些并发性。多个执行线程帮助的一个重要原因是延迟隐藏 - 当一个线程停止等待数据到达时,另一个线程可能能够利用刚刚变得可用的一些其他数据。



在这种情况下,硬件可以帮助您处理单个线程,因为内存访问是可预测的,硬件可以在需要时预取数据,给你一些延迟隐藏的优势,即使有一个线程;但是有什么预取可以做到的限制。例如,预取器不会将其自身跨越页面边界。其中大部分内容的标准参考资料是 Ulrich Drepper每个程序员应该了解的记忆,现在已经足够长一些差距已经开始显现(英特尔的Sandy Bridge处理器的热插件概述在这里 - 特别注意内存管理硬件与CPU的更紧密集成)。



对于与memset比较的问题, mbw STREAM ,比较跨基准将总是引起头痛,甚至基准,声称是测量相同的事情。特别地,存储器带宽不是单个数字 - 性能取决于操作变化相当多。 mbw和Stream都做了一些拷贝操作的版本,其中STREAMs操作在这里被拼写出来(从网页上直接得到,所有操作数都是双精度浮点数):

  --------------------------------------- --------------------------- 
name kernel bytes / iter FLOPS / iter
------- -------------------------------------------------- ---------
COPY:a(i)= b(i)16 0
SCALE:a(i)= q * b(i)16 1
:a(i)= b(i)+ c(i)24 1
TRIAD:a(i)= b(i)+ q * c -------------------------------------------------- -----------

大约1 / 2-1 / 3在这些情况下的内存操作是写(和memset的情况下一切都写)。虽然单个写操作可能比读操作慢一些,但更大的问题是使用写操作使存储器子系统饱和变得困难得多,因为当然不能等效于预写写操作。交错读取和写入有助于实现,但是你的点产品示例基本上是所有的读取都是关于内存带宽的最大可能的情况。



此外,STREAM基准(有意地)完全可移植地编写,只有一些编译器pragmas建议矢量化,所以击败STREAM基准不一定是一个警告信号,特别是当你正在做的是两个流读。 / p>

The dot product of two arrays

for(int i=0; i<n; i++) {
    sum += x[i]*y[i];
}

does not reuse data so it should be a memory bound operation. Therefore, I should be able to measure the memory bandwidth from the dot product.

Using the code at why-vectorizing-the-loop-does-not-have-performance-improvement I get a bandwidth of 9.3 GB/s for my system. However, when I attempt to calculate the bandwidth using the dot product I get over twice the rate for a single thread and over three time the rate using multiple threads (my system has four cores/eight hyper-threads). This makes no sense to me since a memory bound operation should not benefit from multiple threads. Here is the output from the code below:

Xeon E5-1620, GCC 4.9.0, Linux kernel 3.13
dot 1 thread:      1.0 GB, sum 191054.81, time 4.98 s, 21.56 GB/s, 5.39 GFLOPS
dot_avx 1 thread   1.0 GB, sum 191043.33, time 5.16 s, 20.79 GB/s, 5.20 GFLOPS
dot_avx 2 threads: 1.0 GB, sum 191045.34, time 3.44 s, 31.24 GB/s, 7.81 GFLOPS
dot_avx 8 threads: 1.0 GB, sum 191043.34, time 3.26 s, 32.91 GB/s, 8.23 GFLOPS

Can somebody please explain to me why I get over twice the bandwidth for one thread and over three times the bandwidth using more than one thread?

Here is the code I used:

//g++ -O3 -fopenmp -mavx -ffast-math dot.cpp
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include <x86intrin.h>
#include <omp.h>

extern "C" inline float horizontal_add(__m256 a) {
    __m256 t1 = _mm256_hadd_ps(a,a);
    __m256 t2 = _mm256_hadd_ps(t1,t1);
    __m128 t3 = _mm256_extractf128_ps(t2,1);
    __m128 t4 = _mm_add_ss(_mm256_castps256_ps128(t2),t3);
    return _mm_cvtss_f32(t4);
}

extern "C" float dot_avx(float * __restrict x, float * __restrict y, const int n) {
    x = (float*)__builtin_assume_aligned (x, 32);
    y = (float*)__builtin_assume_aligned (y, 32);
    float sum = 0;
    #pragma omp parallel reduction(+:sum)
    {
        __m256 sum1 = _mm256_setzero_ps();
        __m256 sum2 = _mm256_setzero_ps();
        __m256 sum3 = _mm256_setzero_ps();
        __m256 sum4 = _mm256_setzero_ps();
        __m256 x8, y8;
        #pragma omp for
        for(int i=0; i<n; i+=32) {
            x8 = _mm256_loadu_ps(&x[i]);
            y8 = _mm256_loadu_ps(&y[i]);
            sum1 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum1);
            x8 = _mm256_loadu_ps(&x[i+8]);
            y8 = _mm256_loadu_ps(&y[i+8]);
            sum2 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum2);
            x8 = _mm256_loadu_ps(&x[i+16]);
            y8 = _mm256_loadu_ps(&y[i+16]);
            sum3 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum3);
            x8 = _mm256_loadu_ps(&x[i+24]);
            y8 = _mm256_loadu_ps(&y[i+24]);
            sum4 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum4);
        }
        sum += horizontal_add(_mm256_add_ps(_mm256_add_ps(sum1,sum2),_mm256_add_ps(sum3,sum4)));
    }
    return sum; 
}

extern "C" float dot(float * __restrict x, float * __restrict y, const int n) {
    x = (float*)__builtin_assume_aligned (x, 32);
    y = (float*)__builtin_assume_aligned (y, 32);
    float sum = 0;
    for(int i=0; i<n; i++) {
        sum += x[i]*y[i];
    }
    return sum;
}

int main(){
    uint64_t LEN = 1 << 27;
    float *x = (float*)_mm_malloc(sizeof(float)*LEN,64);
    float *y = (float*)_mm_malloc(sizeof(float)*LEN,64);
    for(uint64_t i=0; i<LEN; i++) { x[i] = 1.0*rand()/RAND_MAX - 0.5; y[i] = 1.0*rand()/RAND_MAX - 0.5;}

    uint64_t size = 2*sizeof(float)*LEN;

    volatile float sum = 0;
    double dtime, rate, flops;  
    int repeat = 100;

    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) sum += dot(x,y,LEN);
    dtime = omp_get_wtime() - dtime;
    rate = 1.0*repeat*size/dtime*1E-9;
    flops = 2.0*repeat*LEN/dtime*1E-9;
    printf("%f GB, sum %f, time %f s, %.2f GB/s, %.2f GFLOPS\n", 1.0*size/1024/1024/1024, sum, dtime, rate,flops);

    sum = 0;
    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) sum += dot_avx(x,y,LEN);
    dtime = omp_get_wtime() - dtime;
    rate = 1.0*repeat*size/dtime*1E-9;
    flops = 2.0*repeat*LEN/dtime*1E-9;

    printf("%f GB, sum %f, time %f s, %.2f GB/s, %.2f GFLOPS\n", 1.0*size/1024/1024/1024, sum, dtime, rate,flops);
}

I just downloaded, complied, and ran STREAM as suggested by Jonathan Dursi and here are the results:

One thread

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       14292.1657       0.0023       0.0022       0.0023
Scale:      14286.0807       0.0023       0.0022       0.0023
Add:        14724.3906       0.0033       0.0033       0.0033
Triad:      15224.3339       0.0032       0.0032       0.0032

Eight threads

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       24501.2282       0.0014       0.0013       0.0021
Scale:      23121.0556       0.0014       0.0014       0.0015
Add:        25263.7209       0.0024       0.0019       0.0056
Triad:      25817.7215       0.0020       0.0019       0.0027

解决方案

There's a few things going on here, that come down to:

  • You have to work fairly hard to get every last bit of performance out of the memory subsystem; and
  • Different benchmarks measure different things.

The first helps explain why you need multiple threads to saturate the available memory bandwidth. There is a lot of concurrency in the memory system, and it taking advantage of that will often require some concurrency in your CPU code. One big reason that multiple threads of execution help is latency hiding - while one thread is stalled waiting for data to arrive, another thread may be able to take advantage of some other data that has just become available.

The hardware helps you a lot on a single thread in this case - because the memory access is so predictable, the hardware can prefetch the data ahead of when you need it, giving you some of the advantage of latency hiding even with one thread; but there are limits to what prefetch can do. The prefetcher won't take it upon itself to cross page boundaries, for instance. The canonical reference for much of this is What Every Programmer Should Know About Memory by Ulrich Drepper, which is now old enough that some gaps are starting to show (Intel's Hot Chips overview of your Sandy Bridge processor is here - note in particular the tighter integration of the memory management hardware with the CPU).

As to the question about comparing with memset, mbw or STREAM, comparing across benchmarks will always cause headaches, even benchmarks that claim to be measuring the same thing. In particular, "memory bandwidth" isn't a single number - performance varies quite a bit depending on the operations. Both mbw and Stream do some version of a copy operation, with STREAMs operations being spelled out here (taken straight from the web page, all operands are double-precision floating points):

------------------------------------------------------------------
name        kernel                  bytes/iter      FLOPS/iter
------------------------------------------------------------------
COPY:       a(i) = b(i)                 16              0
SCALE:      a(i) = q*b(i)               16              1
SUM:        a(i) = b(i) + c(i)          24              1
TRIAD:      a(i) = b(i) + q*c(i)        24              2
------------------------------------------------------------------

so roughly 1/2-1/3 of the memory operations in these cases are writes (and everything's a write in the case of memset). While individual writes can be a little slower than reads, the bigger issue is that it's much harder to saturate the memory subsystem with writes because of course you can't do the equivalent of prefetching a write. Interleaving the reads and writes helps, but your dot-product example which is essentially all reads is going to be about the best-possible case for pegging the needle on memory bandwidth.

In addition, the STREAM benchmark is (intentionally) written completely portably, with only some compiler pragmas to suggest vectorization, so beating the STREAM benchmark isn't necessarily a warning sign, especially when what you're doing is two streaming reads.

这篇关于从两个数组的点积测量存储器带宽的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆