在 OpenMP 并行代码中,并行运行 memset 有什么好处吗? [英] In an OpenMP parallel code, would there be any benefit for memset to be run in parallel?

查看:27
本文介绍了在 OpenMP 并行代码中,并行运行 memset 有什么好处吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的内存块可能非常大(大于 L2 缓存),有时我必须将它们设置为零.memset 在串行代码中很好,但是并行代码呢?如果从并发线程调用 memset 实际上会加快大型数组的速度,有人有经验吗?或者甚至使用简单的 openmp 并行循环?

I have blocks of memory that can be quite large (larger than the L2 cache), and sometimes I must set them to all zero. memset is good in a serial code, but what about parallel code ? Has somebody experience if calling memset from concurrent threads actually speed things up for large arrays ? Or even using simple openmp parallel for loops ?

推荐答案

HPC 中的人们通常说一个线程通常不足以使单个内存链接饱和,对于网络链接也是如此.这里是我为您编写的一个快速而肮脏的启用 OpenMP 的 memsetter,它填充了两次 2 GiB 内存的零.以下是使用 GCC 4.7 和不同架构上不同线程数的结果(报告的多次运行的最大值):

People in HPC usually say that one thread is usually not enough to saturate a single memory link, the same usually being true for network links as well. Here is a quick and dirty OpenMP enabled memsetter I wrote for you that fills with zeros twice 2 GiB of memory. And here are the results using GCC 4.7 with different number of threads on different architectures (max values from several runs reported):

GCC 4.7,使用 -O3 -mtune=native -fopenmp 编译的代码:

GCC 4.7, code compiled with -O3 -mtune=native -fopenmp:

四路 Intel Xeon X7350 - Nehalem 前四核 CPU,具有独立的内存控制器和前端总线

Quad-socket Intel Xeon X7350 - pre-Nehalem quad-core CPU with separate memory controller and Front Side Bus

单插座

threads   1st touch      rewrite
1         1452.223 MB/s  3279.745 MB/s
2         1541.130 MB/s  3227.216 MB/s
3         1502.889 MB/s  3215.992 MB/s
4         1468.931 MB/s  3201.481 MB/s

(第一次接触很慢,因为线程组是从头开始创建的,操作系统正在将物理页面映射到 malloc(3) 保留的虚拟地址空间)

(1st touch is slow as the thread team is being created from scratch and the operating system is mapping physical pages into the virtual address space reserved by malloc(3))

一个线程已经使单个 CPU <-> NB 链路的内存带宽饱和.(NB = 北桥)

One thread already saturates the memory bandwidth of a single CPU <-> NB link. (NB = North Bridge)

每个套接字 1 个线程

1 thread per socket

threads   1st touch      rewrite
1         1455.603 MB/s  3273.959 MB/s
2         2824.883 MB/s  5346.416 MB/s
3         3979.515 MB/s  5301.140 MB/s
4         4128.784 MB/s  5296.082 MB/s

需要两个线程来使 NB <-> 内存链接的全部内存带宽饱和.

Two threads are necessary to saturate the full memory bandwidth of the NB <-> memory link.

Octo-socket Intel Xeon X7550 - 具有八核 CPU 的 8 路 NUMA 系统(禁用 CMT)

Octo-socket Intel Xeon X7550 - 8-way NUMA system with octo-core CPUs (CMT disabled)

单插座

threads   1st touch      rewrite
1         1469.897 MB/s  3435.087 MB/s
2         2801.953 MB/s  6527.076 MB/s
3         3805.691 MB/s  9297.412 MB/s
4         4647.067 MB/s  10816.266 MB/s
5         5159.968 MB/s  11220.991 MB/s
6         5330.690 MB/s  11227.760 MB/s

至少需要 5 个线程才能使一个内存链接的带宽饱和.

At least 5 threads are necessary in order to saturate the bandwidth of one memory link.

每个套接字 1 个线程

1 thread per socket

threads   1st touch      rewrite
1         1460.012 MB/s  3436.950 MB/s
2         2928.678 MB/s  6866.857 MB/s
3         4408.359 MB/s  10301.129 MB/s
4         5859.548 MB/s  13712.755 MB/s
5         7276.209 MB/s  16940.793 MB/s
6         8760.900 MB/s  20252.937 MB/s

带宽几乎与线程数成线性关系.根据单插槽观察,可以说至少需要 40 个线程分布为每个插槽 5 个线程才能使所有 8 个内存链接饱和.

Bandwidth scales almost linearly with the number of threads. Based on the single-socket observations one could say that at least 40 threads distributed as 5 threads per socket would be necessary in order to saturate all of the eight memory links.

NUMA 系统的基本问题是首次接触内存策略 - 内存分配在 NUMA 节点上,线程首先接触特定页面内的虚拟地址在该节点上执行.线程锁定(绑定到特定的 CPU 内核)在此类系统上是必不可少的,因为线程迁移会导致远程访问速度较慢.大多数 OpenMP 运行时都支持 pinnig.GCC 及其 libgomp 具有 GOMP_CPU_AFFINITY 环境变量,Intel 具有 KMP_AFFINITY 环境变量等.此外,OpenMP 4.0 引入了供应商中立的概念地点.

Basic problem on NUMA systems is the first-touch memory policy - memory is allocated on the NUMA node where the thread first to touch a virtual address within a specific page executes. Thread pinning (binding to specific CPU cores) is essential on such systems as thread migration leads to remote access, which is slower. Supported for pinnig is available in most OpenMP runtimes. GCC with its libgomp has the GOMP_CPU_AFFINITY environment variable, Intel has the KMP_AFFINITY environment variable, etc. Also, OpenMP 4.0 introduced the vendor-neutral concept of places.

为了完整起见,以下是在配备 Intel Core i5-2557M(双核 Sandy Bridge)的 MacBook Air 上运行具有 1 GiB 阵列的代码的结果带有 HT 和 QPI 的 CPU).编译器是 GCC 4.2.1 (Apple LLVM build)

For completeness, here are the results of running the code with an 1 GiB array on MacBook Air with Intel Core i5-2557M (dual-core Sandy Bridge CPU with HT and QPI). Compiler is GCC 4.2.1 (Apple LLVM build)

threads   1st touch      rewrite
1         2257.699 MB/s  7659.678 MB/s
2         3282.500 MB/s  8157.528 MB/s
3         4109.371 MB/s  8157.335 MB/s
4         4591.780 MB/s  8141.439 MB/s

为什么即使是单线程也能达到如此高的速度?对 gdb 的一点探索表明 memset(buf, 0, len) 被 OS X 编译器翻译成 bzero(buf, len)并且由 libc.dylib 提供并在运行时使用的名为 bzero$VARIANT$sse42 的启用了 SSE4.2 的矢量化版本.它使用 MOVDQA 指令一次将 16 个字节的内存清零.这就是为什么即使只有一个线程,内存带宽也几乎饱和.使用 VMOVDQA 的单线程 AVX 启用版本可以一次将 32 个字节归零,并且可能使内存链接饱和.

Why this high speed with even a single thread? A little exploration with gdb shows that memset(buf, 0, len) gets translated by the OS X compiler to bzero(buf, len) and that an SSE4.2 enabled vectorised version by the name of bzero$VARIANT$sse42 is provided by libc.dylib and used at run-time. It uses the MOVDQA instruction to zero 16 bytes of memory at once. That's why even with one thread the memory bandwidth is almost saturated. A single-threaded AVX enabled version using VMOVDQA can zero 32 bytes at once and probably saturate the memory link.

这里的重要信息是,有时矢量化和多线程在提高操作速度方面并不是正交的.

The important message here is that sometimes vectorisation and multithreading are not orthogonal in bringing speed-up to the operation.

这篇关于在 OpenMP 并行代码中,并行运行 memset 有什么好处吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆