使用Haswell架构的并行编程 [英] Parallel programming using Haswell architecture

查看:117
本文介绍了使用Haswell架构的并行编程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想学习使用英特尔Haswell CPU微体系结构的并行编程. 关于使用SIMD:ASM/C/C ++/(其他语言)中的SSE4.2,AVX2? 您可以推荐书籍,教程,互联网资源,课程吗?

I want to learn about parallel programming using Intel's Haswell CPU microarchitecture. About using SIMD: SSE4.2, AVX2 in asm/C/C++/(any other langs)?. Can you recommend books, tutorials, internet resources, courses?

谢谢!

推荐答案

在我看来,您通常需要学习有关CPU的并行编程的知识.我大约在10个月前开始使用SSE,OpenMP或内部函数之前就开始研究此问题,因此让我简要总结一下我学到的一些重要概念和一些有用的资源.

It sounds to me like you need to learn about parallel programming in general on the CPU. I started looking into this about 10 months ago before I ever used SSE, OpenMP, or intrinsics so let me give a brief summary of some important concepts I have learned and some useful resources.

可以使用多种并行计算技术: MIMD,SIMD,指令级并行性,多级缓存和FMA .有了Haswell,IGP上也可以进行计算.

There are several parallel computing technologies that can be employed: MIMD, SIMD, instruction level parallelism, multi-level cahces, and FMA. With Haswell there is also computing on the IGP.

我建议选择一个主题,例如矩阵乘法或Mandelbrot集.他们俩都可以从所有这些技术中受益.

I recommend picking a topic like matrix multiplication or the Mandelbrot set. They can both benefit from all these technologies.

MIMD

通过MIMD,我指的是使用多个物理核心的计算.我为此推荐OpenMP.阅读本教程 http://bisqwit.iki.fi/story/howto/openmp/#Abstract 然后将其用作参考 https://computing.llnl.gov/tutorials/openMP/.使用MIMD的两个最常见的问题是比赛条件

By MIMD I am referring to computing using multiple physical cores. I recommend OpenMP for this. Go through this tutorial http://bisqwit.iki.fi/story/howto/openmp/#Abstract and then use this as a reference https://computing.llnl.gov/tutorials/openMP/. Two of the most common problems using MIMD are race conditions and false sharing. Follow OpenMP on SO reguarly.

SIMD

许多编译器都可以进行自动向量化,因此我将对此进行研究. MSVC的自动矢量化是相当原始的,但GCC的确很好.

Many compilers can do auto-vectorization so I would look into that. MSVC's auto-vectorization is quite primitive but GCC's is really good.

学习内在函数.知道内在函数的最佳资源是 http://software.intel.com/sites/landingpage/IntrinsicsGuide/

Learn intrinsics. The best resource to know what a intrinsic does is http://software.intel.com/sites/landingpage/IntrinsicsGuide/

另一个很棒的资源是Agner Fog的 vectorclass .通过查看vectorclass的源代码,可以回答SSE/AVX上关于SO的95%的问题.最重要的是,您可以对大多数SIMD使用vectorclass,但仍可以获得全速并跳过内在函数.

Another great resource is Agner Fog's vectorclass. 95% of the questions on SO on SSE/AVX can be answered by looking at the source code of the vectorclass. On top of that you could use the vectorclass for most SIMD and still get the full speed and skip intrinsics.

许多人效率低下地使用SIMD.阅读有关结构数组(AOS)和数组结构(SOA)和数组结构数组(AOSOA)的信息.还要查看英特尔带状采矿

A lot of people use SIMD inefficiently. Read about Array of Structs (AOS) and Struct of Arrays (SOA) and Array of struct of Arrays (AOSOA). Also look into Intel strip mining Calculating matrix product is much slower with SSE than with straight-forward-algorithm

请参见 Ingo Wald的博士学位论文,以有趣的方式在光线追踪.我对Mandelbrot集使用了相同的想法,即使用SSE(AVX)一次计算4(8)个像素.

See Ingo Wald's PhD thesis for a interesting way to implement SIMD in ray tracing. I used this same idea for the Mandelbrot set to calculate 4(8) pixels at once using SSE(AVX).

也请阅读Wald

Also read this paper "Extending a C-like Language for Portable SIMD Programming" by Wald http://www.cdl.uni-saarland.de/papers/leissa_vecimp_tr.pdf to get a better idea how to use SIMD.

FMA

FMA3自Haswell以来是新的.它是如此新,以至于关于它的讨论还不多.但是这个答案(对我的问题)很好 如何使用融合乘加(FMA) SSE/AVX 中的说明. FMA3使峰值FLOPS加倍,因此与Ivy Bridge相比,Haswell上潜在的矩阵乘法快两倍.

FMA3 is new since Haswell. It's so new that there is not much discussion on it on SO yet. But this answer (to my question) is good How to use Fused Multiply-Add (FMA) instructions with SSE/AVX. FMA3 doubles the peak FLOPS so potentially matrix multiplication is twice as fast on Haswell compared to Ivy Bridge.

根据这个答案 FMA最重要的方面不是事实,它是一个指令而不是两个指令相乘,而是中间结果(实际上)是无限精确的".例如,在不使用FMA的情况下实现双-双乘法,则需要6次乘法和数次加法,而在使用FMA时,只需两次操作.

According to this answer the most important aspect of FMA is not the fact that it's one instructions instead of two to do multiplication and addition it's the "(virtually) infinite precision of the intermediate result." For example implementing double-double multiplication without FMA it takes 6 multiplications and several additions whereas with FMA it's only two operations.

指令级并行性

Haswell有8个端口,可以将μ-op发送到(虽然不是每个端口都可以执行相同的mirco-op;请参阅此两个256位负载,一个256位存储,两个256位FMA操作,一个标量加法以及条件同时跳跃(每个时钟周期六次μop).

Haswell has 8 ports which it can send μ-ops to (though not every port can take the same mirco-op; see this AnandTech review). This means Haswell can do, for example two 256-bit loads, one 256-bit store, two 256-bit FMA operations, one scalar addition, and a condition jump at the same time (six μ-ops per clock cycle).

在大多数情况下,您不必担心,因为它是由CPU完成的.但是,在某些情况下,您的代码可能会限制潜在的指令级并行性.最常见的是循环携带的依赖关系.以下代码具有循环携带的依赖性

For the most part you don't have to worry about this since it's done by the CPU. However, there are cases where your code can limit the potential instruction level parallelism. The most common is a loop carried dependency. The following code has a loop carried dependency

for(int i=0; i<n; i++) {
    sum += x(i)*y(i);
}

解决此问题的方法是展开循环并进行部分求和

The way to fix this is to unroll the loop and do partial sums

for(int i=0; i<n; i+=2) {
    sum1 += x(i)*y(i);
    sum2 += x(i+1)*y(i+1);
}
sum = sum1 + sum2;

多级缓存:

Haswell最多具有四个缓存级别.在我看来,编写代码以最佳利用缓存是迄今为止最困难的挑战.这是我仍然最努力,最无知的话题,但是在许多情况下,提高缓存使用率比其他任何技术都提供更好的性能.我对此没有太多建议.

Haswell has up to four levels of caches. Writing your code to optimally take advantage of the cache is by far the most difficult challenge in my opinion. It's the topic I still struggle the most with and feel the most ignorant about but in many cases improving cache usage gives better performance than any of the other technologies. I don't have many recommendations for this.

您需要了解集合和缓存行(以及关键步幅)以及在NUMA系统上有关页面的知识.要了解一些关于集合和关键步幅的信息,请参阅Agner Fog的 http://www.agner.org/optimize/optimizing_cpp.pdf 和此

You need to learn about sets and cache lines (and the critical stride) and on NUMA systems about pages. To learn a little about sets and the critical stride see Agner Fog's http://www.agner.org/optimize/optimizing_cpp.pdf and this Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?

关于缓存的另一个非常有用的主题是循环阻止或平铺.在

Another very useful topic for the cache is loop blocking or tiling. See my answer (the one with the highest votes) at What is the fastest way to transpose a matrix in C++? for an example.

使用Iris Pro在IGP上进行计算.

所有Haswell消费者处理器(Haswell-E尚未推出)都具有IGP. IGP使用至少30%的硅到50%以上.这足以用于至少两个以上的x86内核.对于大多数程序员而言,这浪费了计算潜力.对IGP进行编程的唯一方法是使用OpenCL.英特尔没有适用于Linux的OpenCL Iris Pro驱动程序,因此您只能与Windows一起使用(我不确定Apple对此的实施情况如何). 在没有OpenCL的情况下对Intel IGP(例如Iris Pro 5200)硬件进行编程.

All Haswell consumer processors (Haswell-E is not out yet) have an IGP. The IGP uses at least 30% of the silicon to over 50%. That's enough for at least 2 more x86 cores. This is wasted computing potential for most programmers. The only way to program the IGP is with OpenCL. Intel does not have OpenCL Iris Pro drivers for Linux so you can only do with with Windows (I'm not sure how good Apple's implementation of this is). Programming Intel IGP (e.g. Iris Pro 5200) hardware without OpenCL.

与Nvidia和AMD相比,Iris Pro的一个优势是双浮点仅

One advantage of the Iris Pro compared to Nvidia and AMD is that double floating point is only one quarter the speed of single floating point with the Iris Pro (however fp64 is only enabled in Direct Compute and not with OpenCL). NVIDIA and AMD (recently) cripple double floating point so much that it makes GPGPU double floating point computing not very effective on their consumer cards.

这篇关于使用Haswell架构的并行编程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆