从两个数组的点积测量内存带宽 [英] Measuring memory bandwidth from the dot product of two arrays

查看:17
本文介绍了从两个数组的点积测量内存带宽的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

两个数组的点积

for(int i=0; i<n; i++) {
    sum += x[i]*y[i];
}

不重用数据,因此它应该是内存绑定操作.因此,我应该可以通过点积来衡量内存带宽.

does not reuse data so it should be a memory bound operation. Therefore, I should be able to measure the memory bandwidth from the dot product.

使用代码在why-vectorizing-the-loop-does-not-have-performance-improvement 我的系统获得了 9.3 GB/s 的带宽.但是,当我尝试使用点积计算带宽时,我获得了单线程速率的两倍多,多线程速率的三倍多(我的系统有四个内核/八个超线程).这对我来说毫无意义,因为内存绑定操作不应该从多线程中受益.下面是代码的输出:

Using the code at why-vectorizing-the-loop-does-not-have-performance-improvement I get a bandwidth of 9.3 GB/s for my system. However, when I attempt to calculate the bandwidth using the dot product I get over twice the rate for a single thread and over three time the rate using multiple threads (my system has four cores/eight hyper-threads). This makes no sense to me since a memory bound operation should not benefit from multiple threads. Here is the output from the code below:

Xeon E5-1620, GCC 4.9.0, Linux kernel 3.13
dot 1 thread:      1.0 GB, sum 191054.81, time 4.98 s, 21.56 GB/s, 5.39 GFLOPS
dot_avx 1 thread   1.0 GB, sum 191043.33, time 5.16 s, 20.79 GB/s, 5.20 GFLOPS
dot_avx 2 threads: 1.0 GB, sum 191045.34, time 3.44 s, 31.24 GB/s, 7.81 GFLOPS
dot_avx 8 threads: 1.0 GB, sum 191043.34, time 3.26 s, 32.91 GB/s, 8.23 GFLOPS

谁能向我解释为什么我的一个线程获得了两倍以上的带宽,而使用一个以上的线程获得了三倍以上的带宽?

这是我使用的代码:

//g++ -O3 -fopenmp -mavx -ffast-math dot.cpp
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <stdint.h>
#include <x86intrin.h>
#include <omp.h>

extern "C" inline float horizontal_add(__m256 a) {
    __m256 t1 = _mm256_hadd_ps(a,a);
    __m256 t2 = _mm256_hadd_ps(t1,t1);
    __m128 t3 = _mm256_extractf128_ps(t2,1);
    __m128 t4 = _mm_add_ss(_mm256_castps256_ps128(t2),t3);
    return _mm_cvtss_f32(t4);
}

extern "C" float dot_avx(float * __restrict x, float * __restrict y, const int n) {
    x = (float*)__builtin_assume_aligned (x, 32);
    y = (float*)__builtin_assume_aligned (y, 32);
    float sum = 0;
    #pragma omp parallel reduction(+:sum)
    {
        __m256 sum1 = _mm256_setzero_ps();
        __m256 sum2 = _mm256_setzero_ps();
        __m256 sum3 = _mm256_setzero_ps();
        __m256 sum4 = _mm256_setzero_ps();
        __m256 x8, y8;
        #pragma omp for
        for(int i=0; i<n; i+=32) {
            x8 = _mm256_loadu_ps(&x[i]);
            y8 = _mm256_loadu_ps(&y[i]);
            sum1 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum1);
            x8 = _mm256_loadu_ps(&x[i+8]);
            y8 = _mm256_loadu_ps(&y[i+8]);
            sum2 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum2);
            x8 = _mm256_loadu_ps(&x[i+16]);
            y8 = _mm256_loadu_ps(&y[i+16]);
            sum3 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum3);
            x8 = _mm256_loadu_ps(&x[i+24]);
            y8 = _mm256_loadu_ps(&y[i+24]);
            sum4 = _mm256_add_ps(_mm256_mul_ps(x8,y8),sum4);
        }
        sum += horizontal_add(_mm256_add_ps(_mm256_add_ps(sum1,sum2),_mm256_add_ps(sum3,sum4)));
    }
    return sum; 
}

extern "C" float dot(float * __restrict x, float * __restrict y, const int n) {
    x = (float*)__builtin_assume_aligned (x, 32);
    y = (float*)__builtin_assume_aligned (y, 32);
    float sum = 0;
    for(int i=0; i<n; i++) {
        sum += x[i]*y[i];
    }
    return sum;
}

int main(){
    uint64_t LEN = 1 << 27;
    float *x = (float*)_mm_malloc(sizeof(float)*LEN,64);
    float *y = (float*)_mm_malloc(sizeof(float)*LEN,64);
    for(uint64_t i=0; i<LEN; i++) { x[i] = 1.0*rand()/RAND_MAX - 0.5; y[i] = 1.0*rand()/RAND_MAX - 0.5;}

    uint64_t size = 2*sizeof(float)*LEN;

    volatile float sum = 0;
    double dtime, rate, flops;  
    int repeat = 100;

    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) sum += dot(x,y,LEN);
    dtime = omp_get_wtime() - dtime;
    rate = 1.0*repeat*size/dtime*1E-9;
    flops = 2.0*repeat*LEN/dtime*1E-9;
    printf("%f GB, sum %f, time %f s, %.2f GB/s, %.2f GFLOPS
", 1.0*size/1024/1024/1024, sum, dtime, rate,flops);

    sum = 0;
    dtime = omp_get_wtime();
    for(int i=0; i<repeat; i++) sum += dot_avx(x,y,LEN);
    dtime = omp_get_wtime() - dtime;
    rate = 1.0*repeat*size/dtime*1E-9;
    flops = 2.0*repeat*LEN/dtime*1E-9;

    printf("%f GB, sum %f, time %f s, %.2f GB/s, %.2f GFLOPS
", 1.0*size/1024/1024/1024, sum, dtime, rate,flops);
}

我刚刚按照 Jonathan Dursi 的建议下载、编译并运行了 STREAM,结果如下:

I just downloaded, complied, and ran STREAM as suggested by Jonathan Dursi and here are the results:

一个线程

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       14292.1657       0.0023       0.0022       0.0023
Scale:      14286.0807       0.0023       0.0022       0.0023
Add:        14724.3906       0.0033       0.0033       0.0033
Triad:      15224.3339       0.0032       0.0032       0.0032

八个线程

Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       24501.2282       0.0014       0.0013       0.0021
Scale:      23121.0556       0.0014       0.0014       0.0015
Add:        25263.7209       0.0024       0.0019       0.0056
Triad:      25817.7215       0.0020       0.0019       0.0027

推荐答案

这里发生了一些事情,归结为:

There's a few things going on here, that come down to:

  • 您必须非常努力地工作,才能充分发挥内存子系统的性能;和
  • 不同的基准衡量不同的东西.

第一个有助于解释为什么需要多个线程来使可用内存带宽饱和.内存系统中有很多并发性,利用它通常需要 CPU 代码中的一些并发性.多线程执行的一个重要原因是延迟隐藏 -当一个线程停止等待数据到达时,另一个线程可能能够利用一些刚刚变得可用的其他数据.

The first helps explain why you need multiple threads to saturate the available memory bandwidth. There is a lot of concurrency in the memory system, and it taking advantage of that will often require some concurrency in your CPU code. One big reason that multiple threads of execution help is latency hiding - while one thread is stalled waiting for data to arrive, another thread may be able to take advantage of some other data that has just become available.

在这种情况下,硬件可以在单个线程上为您提供很多帮助 - 因为内存访问是可预测的,硬件可以在您需要时提前预取数据,即使只有一个,也可以为您提供延迟隐藏的一些优势线;但是预取的功能是有限的.例如,预取器不会自行跨越页面边界.其中大部分内容的规范参考是 Ulrich Drepper 撰写的每个程序员都应该了解的关于内存的知识,它现在已经足够老了一些差距开始显现(英特尔对您的 Sandy Bridge 处理器的热芯片概述是 此处 - 特别注意内存管理硬件与 CPU 的更紧密集成.

The hardware helps you a lot on a single thread in this case - because the memory access is so predictable, the hardware can prefetch the data ahead of when you need it, giving you some of the advantage of latency hiding even with one thread; but there are limits to what prefetch can do. The prefetcher won't take it upon itself to cross page boundaries, for instance. The canonical reference for much of this is What Every Programmer Should Know About Memory by Ulrich Drepper, which is now old enough that some gaps are starting to show (Intel's Hot Chips overview of your Sandy Bridge processor is here - note in particular the tighter integration of the memory management hardware with the CPU).

关于和memset比较的问题,mbw或者STREAM,跨基准比较总是会让人头疼,即使是声称测量相同事物的基准.特别是,内存带宽"不是一个单一的数字 - 性能因操作而异.mbw 和 Stream 都执行某种版本的复制操作,此处详细说明了 STREAMs 操作(直接取自网页,所有操作数都是双精度浮点数):

As to the question about comparing with memset, mbw or STREAM, comparing across benchmarks will always cause headaches, even benchmarks that claim to be measuring the same thing. In particular, "memory bandwidth" isn't a single number - performance varies quite a bit depending on the operations. Both mbw and Stream do some version of a copy operation, with STREAMs operations being spelled out here (taken straight from the web page, all operands are double-precision floating points):

------------------------------------------------------------------
name        kernel                  bytes/iter      FLOPS/iter
------------------------------------------------------------------
COPY:       a(i) = b(i)                 16              0
SCALE:      a(i) = q*b(i)               16              1
SUM:        a(i) = b(i) + c(i)          24              1
TRIAD:      a(i) = b(i) + q*c(i)        24              2
------------------------------------------------------------------

在这些情况下,大约 1/2-1/3 的内存操作是写操作(在 memset 的情况下,所有操作都是写操作).虽然单个写入可能比读取慢一点,但更大的问题是用写入使内存子系统饱和要困难得多,因为您当然不能执行与预取写入等效的操作.交错读取和写入会有所帮助,但您的点积示例(本质上是所有读取)将是关于锁定内存带宽的最佳情况.

so roughly 1/2-1/3 of the memory operations in these cases are writes (and everything's a write in the case of memset). While individual writes can be a little slower than reads, the bigger issue is that it's much harder to saturate the memory subsystem with writes because of course you can't do the equivalent of prefetching a write. Interleaving the reads and writes helps, but your dot-product example which is essentially all reads is going to be about the best-possible case for pegging the needle on memory bandwidth.

此外,STREAM 基准测试(有意)完全可移植,只有一些编译器编译指示建议向量化,因此击败 STREAM 基准测试不一定是一个警告信号,尤其是当您正在做的是两次流式读取时.

In addition, the STREAM benchmark is (intentionally) written completely portably, with only some compiler pragmas to suggest vectorization, so beating the STREAM benchmark isn't necessarily a warning sign, especially when what you're doing is two streaming reads.

这篇关于从两个数组的点积测量内存带宽的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆