OpenMP循环数组访问中的虚假共享 [英] False sharing in OpenMP loop array access

查看:509
本文介绍了OpenMP循环数组访问中的虚假共享的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想利用OpenMP来使我的任务并行化.

I would like to take advantage of OpenMP to make my task parallel.

我需要对数组的所有元素减去相同的数量,然后将结果写入另一个向量.这两个数组都使用malloc动态分配,第一个用文件中的值填充.每个元素的类型为uint64_t.

I need to subtract the same quantity to all the elements of an array and write the result in another vector. Both arrays are dynamically allocated with malloc and the first one is filled with values from a file. Each element is of type uint64_t.

#pragma omp parallel for
for (uint64_t i = 0; i < size; ++i) {
    new_vec[i] = vec[i] - shift;
}

shift是我要从vec的每个元素中删除的固定值. sizevecnew_vec的长度,大约为200k.

Where shift is the fixed value I want to remove from every element of vec. size is the length of both vec and new_vec, which is approximately 200k.

我在Arch Linux上使用g++ -fopenmp编译代码.我在Intel Core i7-6700HQ上,使用8个线程.使用OpenMP版本时,运行时间要长5至6倍.运行OpenMP版本时,我可以看到所有内核都在工作.

I compile the code with g++ -fopenmp on Arch Linux. I'm on an Intel Core i7-6700HQ, and I use 8 threads. The running time is 5 to 6 times higher when I use the OpenMP version. I can see that all the cores are working when I run the OpenMP version.

我认为这可能是由于虚假共享问题引起的,但我找不到它.

I think this might be caused by a False Sharing issue, but I can't find it.

推荐答案

您应该调整迭代在线程之间的分配方式.有了schedule(static,chunk_size),您就可以做到.

You should adjust how the iterations are split among the threads. With schedule(static,chunk_size) you are able to do so.

尝试使用chunk_size值(64/sizeof(uint64_t)的倍数)避免所说的错误共享:

Try to use chunk_size values multiples of 64/sizeof(uint64_t) to avoid the said false sharing:

[ cache line n   ][ cache line n+1 ]
[ chuhk 0  ][ chunk 1  ][ chunk 2  ]

并实现以下目标:

[ cache line n   ][ cache line n+1 ][ cache line n+2 ][...]
[ chunk 0                          ][ chunk 1             ]

您还应该以使向量与高速缓存行对齐的方式分配向量.这样,您可以确保第一个和后续块也正确对齐.

You also should allocate your vectors in such a way that they are aligned to cache lines. That way you ensure that the first and subsequent chunks are properly aligned as well.

#define CACHE_LINE_SIZE sysconf(_SC_LEVEL1_DCACHE_LINESIZE) 
uint64_t *vec = aligned_alloc( CACHE_LINE_SIZE/*alignment*/, 200000 * sizeof(uint64_t)/*size*/);

您的问题与如何优化该基准,您将可以在代码上几乎完全映射出优化.

Your problem is really similar to what Stream Triad benchmark represents. Check out how to optimize that benchmark and you will be able to map almost exactly the optimizations on your code.

这篇关于OpenMP循环数组访问中的虚假共享的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆