在OpenMP并行code,会不会有任何的好处memset的并行运行? [英] In an OpenMP parallel code, would there be any benefit for memset to be run in parallel?

查看:302
本文介绍了在OpenMP并行code,会不会有任何的好处memset的并行运行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的内存块的可能相当大(超过二级缓存大),有时我必须将它们设置为全零。 memset的良好的串行code,但对于并行code?
如果有调用memset的从并发线程实际上加快速度对于大数组别人的经验?
甚至使用OpenMP并行简单的for循环?

I have blocks of memory that can be quite large (larger than the L2 cache), and sometimes I must set them to all zero. memset is good in a serial code, but what about parallel code ? Has somebody experience if calling memset from concurrent threads actually speed things up for large arrays ? Or even using simple openmp parallel for loops ?

推荐答案

在HPC人们通常说,一个线程通常是不够的,饱和单条内存的链接,同样是通常用于网络链接也是如此。 这里是一个快速和肮脏的OpenMP启用memsetter我为你写的与零点的内存的两倍2吉布填充。这里是使用GCC 4.7(从报告的多个运行最大值)不同数量不同架构的线程的结果:

People in HPC usually say that one thread is usually not enough to saturate a single memory link, the same usually being true for network links as well. Here is a quick and dirty OpenMP enabled memsetter I wrote for you that fills with zeros twice 2 GiB of memory. And here are the results using GCC 4.7 with different number of threads on different architectures (max values from several runs reported):

GCC 4.7,code。与 -O3 -mtune =本地-fopenmp 编译:

GCC 4.7, code compiled with -O3 -mtune=native -fopenmp:

四路英特尔至强X7350 - pre-的Nehalem四核CPU配有独立的内存控制器和前端总线

Quad-socket Intel Xeon X7350 - pre-Nehalem quad-core CPU with separate memory controller and Front Side Bus

单插槽

threads   1st touch      rewrite
1         1452.223 MB/s  3279.745 MB/s
2         1541.130 MB/s  3227.216 MB/s
3         1502.889 MB/s  3215.992 MB/s
4         1468.931 MB/s  3201.481 MB/s

(如被从头创建线程队和操作系统映射物理页到由保留的虚拟地址空间的malloc第一触摸是缓慢(3)

一个线程已经饱和单CPU及所述的存储器带宽; - > NB链路。 (NB =北桥)

One thread already saturates the memory bandwidth of a single CPU <-> NB link. (NB = North Bridge)

每个插槽1个线程

threads   1st touch      rewrite
1         1455.603 MB/s  3273.959 MB/s
2         2824.883 MB/s  5346.416 MB/s
3         3979.515 MB/s  5301.140 MB/s
4         4128.784 MB/s  5296.082 MB/s

两个线程是必要的饱和NB&LT的全部内存带宽; - >内存链接

Two threads are necessary to saturate the full memory bandwidth of the NB <-> memory link.

章鱼插槽英特尔至强X7550 - 8路NUMA系统,八核处理器(CMT禁用)

Octo-socket Intel Xeon X7550 - 8-way NUMA system with octo-core CPUs (CMT disabled)

单插槽

threads   1st touch      rewrite
1         1469.897 MB/s  3435.087 MB/s
2         2801.953 MB/s  6527.076 MB/s
3         3805.691 MB/s  9297.412 MB/s
4         4647.067 MB/s  10816.266 MB/s
5         5159.968 MB/s  11220.991 MB/s
6         5330.690 MB/s  11227.760 MB/s

目前至少5个线程,以饱和一个存储器链路的带宽是必需的。

At least 5 threads are necessary in order to saturate the bandwidth of one memory link.

每个插槽1个线程

threads   1st touch      rewrite
1         1460.012 MB/s  3436.950 MB/s
2         2928.678 MB/s  6866.857 MB/s
3         4408.359 MB/s  10301.129 MB/s
4         5859.548 MB/s  13712.755 MB/s
5         7276.209 MB/s  16940.793 MB/s
6         8760.900 MB/s  20252.937 MB/s

带宽扩展几乎线性的线程数。基于单插槽的意见可以说,至少有40个线程分布为每个插槽5个线程是必要的,以便饱和全部8个内存链接。

Bandwidth scales almost linearly with the number of threads. Based on the single-socket observations one could say that at least 40 threads distributed as 5 threads per socket would be necessary in order to saturate all of the eight memory links.

NUMA系统的基本问题是第一个触式记忆政策 - 内存的NUMA节点,在线程第一时间内为特定页面执行触摸的虚拟地址分配上。作为线程迁移会导致远程访问,这是慢走丝线穿针(绑定到特定的CPU核心)就是在这样的系统是必不可少的。支持pinnig在大多数的OpenMP运行时可用。 GCC与 libgomp GOMP_CPU_AFFINITY 环境变量,英特尔有 KMP_AFFINITY 环境变量等。此外,OpenMP的4.0推出的地方

Basic problem on NUMA systems is the first-touch memory policy - memory is allocated on the NUMA node where the thread first to touch a virtual address within a specific page executes. Thread pinning (binding to specific CPU cores) is essential on such systems as thread migration leads to remote access, which is slower. Supported for pinnig is available in most OpenMP runtimes. GCC with its libgomp has the GOMP_CPU_AFFINITY environment variable, Intel has the KMP_AFFINITY environment variable, etc. Also, OpenMP 4.0 introduced the vendor-neutral concept of places.

编辑:为了完整起见,这里是运行code。与MacBook Air上的1吉布阵列结果的英特尔酷睿i5-2557M (双核心的Sandy Bridge CPU与HT和QPI)。编译器GCC是4.2.1(苹果LLVM编译)

For completeness, here are the results of running the code with an 1 GiB array on MacBook Air with Intel Core i5-2557M (dual-core Sandy Bridge CPU with HT and QPI). Compiler is GCC 4.2.1 (Apple LLVM build)

threads   1st touch      rewrite
1         2257.699 MB/s  7659.678 MB/s
2         3282.500 MB/s  8157.528 MB/s
3         4109.371 MB/s  8157.335 MB/s
4         4591.780 MB/s  8141.439 MB/s

为什么这样高的速度,甚至一个单独的线程?与 GDB 一个小的探索表明, memset的(BUF,0,LEN)得到由OS X编译器编译 bzero(BUF,LEN)和一个由SSE4.2 bzero $ VARIANT $ sse42 是名称启用矢量化版本通过 libc.dylib 提供,在运行时使用。它使用 MOVDQA 指令零16字节的内存一次。这就是为什么即使一个线程的内存带宽几乎饱和。使用单线程AVX功能的版本 VMOVDQA 可以一次零32个字节,可能饱和的内存链接。

Why this high speed with even a single thread? A little exploration with gdb shows that memset(buf, 0, len) gets translated by the OS X compiler to bzero(buf, len) and that an SSE4.2 enabled vectorised version by the name of bzero$VARIANT$sse42 is provided by libc.dylib and used at run-time. It uses the MOVDQA instruction to zero 16 bytes of memory at once. That's why even with one thread the memory bandwidth is almost saturated. A single-threaded AVX enabled version using VMOVDQA can zero 32 bytes at once and probably saturate the memory link.

这里最重要的消息是,有时向量化和多线程不在带来加速的操作是正交的。

The important message here is that sometimes vectorisation and multithreading are not orthogonal in bringing speed-up to the operation.

这篇关于在OpenMP并行code,会不会有任何的好处memset的并行运行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆