缓存行,错误共享和对齐 [英] Cache lines, false sharing and alignment

查看:144
本文介绍了缓存行,错误共享和对齐的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了以下简短的C ++程序,以重现

I wrote the following short C++ program to reproduce the false sharing effect as described by Herb Sutter:

说,我们要执行总数为WORKLOAD的整数运算,并且希望将它们平均分配给多个(PARALLEL)线程.出于此测试的目的,每个线程将从整数数组中增加其自己的专用变量,因此该过程可以理想地实现并行化.

Say, we want to perform a total amount of WORKLOAD integer operations and we want them to be equally distributed to a number (PARALLEL) of threads. For the purpose of this test, each thread will increment its own dedicated variable from an array of integers, so the process may be ideally parallelizable.

void thread_func(int* ptr)
{
    for (unsigned i = 0; i < WORKLOAD / PARALLEL; ++i)
    {
        (*ptr)++;
    }
}

int main()
{
    int arr[PARALLEL * PADDING];
    thread threads[PARALLEL];

    for (unsigned i = 0; i < PARALLEL; ++i)
    {
        threads[i] = thread(thread_func, &(arr[i * PADDING]));
    }
    for (auto& th : threads)
    {
        th.join();
    }
    return 0;
}

我认为这个主意很容易掌握.如果您设置

I think the idea is easy to grasp. If you set

#define PADDING 16

每个线程将在单独的缓存行上工作(假设缓存行的长度为64字节).因此结果将是线性加速,直到PARALLEL>#cores.另一方面,如果将PADDING设置为小于16的任何值,则应该遇到严重的争用,因为现在至少有两个线程可能在同一高速缓存行上运行,但是该线程受内置硬件互斥量的保护.我们希望在这种情况下,我们的加速不仅是亚线性的,甚至总是< 1,因为看不见的锁护卫队.

every thread will work on a separate cache line (assuming the length of a cache line to be 64 bytes). So the result will be linear increase of speedup until PARALLEL > # cores. If, on the other hand, PADDING is set to any value below 16, one should encounter severe contention, for at least two threads are now likely to operate on the same cache line, which however is protected by a built-in hardware mutex. We would expect our speedup not only to be sublinear in this case, but even to be always < 1, because of the invisible lock convoy.

现在,我的初次尝试几乎满足了这些期望,但是避免错误共享所需的PADDING的最小值大约是8,而不是16.在大约半小时之前,我很困惑,直到得出明显的结论:不能保证我的数组与主内存中的缓存行的开头完全对齐.实际对齐方式可能会因许多条件而异,包括阵列的大小.

Now, my first attempts nearly satisfied these expectations, yet the minimum value of PADDING needed to avoid false sharing was around 8 and not 16. I was quite puzzled for about half an hour until I came to the obvious conclusion, that there is no guarantee for my array to be aligned exactly to the beginning of a cache line inside main memory. The actual alignment may vary depending on many conditions, including the size of the array.

在此示例中,我们当然无需以特殊方式对齐数组,因为我们可以将PADDING保留为16,一切正常.但是,可以想象在某些情况下确实会有所作为的情况,即某种结构是否与高速缓存行对齐.因此,我添加了一些代码行以获取有关数组的实际对齐方式的一些信息.

In this example, there is of course no need for us to have the array aligned in a special way, because we can just leave PADDING at 16 and everything works out fine. But one could imagine cases, where it does make a difference, whether a certain structure is aligned to a cache line or not. Hence, I added some lines of code to get some information about the actual alignment of my array.

int main()
{
    int arr[PARALLEL * 16];
    thread threads[PARALLEL];
    int offset = 0;

    while (reinterpret_cast<int>(&arr[offset]) % 64) ++offset;
    for (unsigned i = 0; i < PARALLEL; ++i)
    {
        threads[i] = thread(thread_func, &(arr[i * 16 + offset]));
    }
    for (auto& th : threads)
    {
        th.join();
    }
    return 0;
}

尽管在这种情况下,该解决方案对我来说效果很好,但我不确定总体上这是否是个好方法.所以这是我的问题:

Despite this solution worked out fine for me in this case, I'm not sure if it would be a good approach in general. So here is my question:

除了上例中的操作之外,是否有任何常用的方法来使内存中的对象与高速缓存行对齐?

Is there any common way to have objects in memory aligned to cache lines other than what I did in the above example?

(使用g ++ MinGW Win32 x86 v.4.8.1 posix dwarf rev3)

(using g++ MinGW Win32 x86 v.4.8.1 posix dwarf rev3)

推荐答案

您应该能够从编译器中请求所需的对齐方式:

You should be able to request the required alignment from the compiler:

alignas(64) int arr[PARALELL * PADDING]; // align the array to a 64 byte line

这篇关于缓存行,错误共享和对齐的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆