测量 NUMA(非统一内存访问).没有可观察到的不对称性.为什么? [英] Measuring NUMA (Non-Uniform Memory Access). No observable asymmetry. Why?

查看:31
本文介绍了测量 NUMA(非统一内存访问).没有可观察到的不对称性.为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我曾尝试测量 NUMA 的非对称内存访问效果,但失败了.

I've tried to measure the asymmetric memory access effects of NUMA, and failed.

在 Intel Xeon X5570 @ 2.93GHz、2 个 CPU、8 个内核上执行.

Performed on an Intel Xeon X5570 @ 2.93GHz, 2 CPUs, 8 cores.

在固定到核心 0 的线程上,我使用 numa_alloc_local 在核心 0 的 NUMA 节点上分配了一个大小为 10,000,000 字节的数组 x.然后我遍历数组 x 50 次并读取和写入数组中的每个字节.测量执行 50 次迭代所用的时间.

On a thread pinned to core 0, I allocate an array x of size 10,000,000 bytes on core 0's NUMA node with numa_alloc_local. Then I iterate over array x 50 times and read and write each byte in the array. Measure the elapsed time to do the 50 iterations.

然后,在我服务器的其他每个内核上,我固定一个新线程并再次测量执行 50 次读写迭代所用的时间到数组 x 中的每个字节.

Then, on each of the other cores in my server, I pin a new thread and again measure the elapsed time to do 50 iterations of reading and writing to every byte in array x.

Array x 很大,以尽量减少缓存影响.我们想测量 CPU 必须一直到 RAM 进行加载和存储时的速度,而不是缓存有帮助时的速度.

Array x is large to minimize cache effects. We want to measure the speed when the CPU has to go all the way to RAM to load and store, not when caches are helping.

我的服务器中有两个 NUMA 节点,因此我希望在分配了数组 x 的同一节点上具有关联的内核具有更快的读/写速度.我没有看到.

There are two NUMA nodes in my server, so I would expect the cores that have affinity on the same node in which array x is allocated to have faster read/write speed. I'm not seeing that.

为什么?

也许 NUMA 只与具有 > 8-12 个内核的系统相关,正如我在其他地方看到的建议?

Perhaps NUMA is only relevant on systems with > 8-12 cores, as I've seen suggested elsewhere?

http://lse.sourceforge.net/numa/faq/

#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>

void pin_to_core(size_t core)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}

std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
    for(size_t i=0;i<bm.size;++i)
    {
        os << numa_bitmask_isbitset(&bm, i);
    }
    return os;
}

void* thread1(void** x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    void* y = numa_alloc_local(N);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            c = ((char*)y)[j];
            ((char*)y)[j] = c;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;

    *x = y;
}

void thread2(void* x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            c = ((char*)x)[j];
            ((char*)x)[j] = c;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}

int main(int argc, const char **argv)
{
    int numcpus = numa_num_task_cpus();
    std::cout << "numa_available() " << numa_available() << std::endl;
    numa_set_localalloc();

    bitmask* bm = numa_bitmask_alloc(numcpus);
    for (int i=0;i<=numa_max_node();++i)
    {
        numa_node_to_cpus(i, bm);
        std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
    }
    numa_bitmask_free(bm);

    void* x;
    size_t N(10000000);
    size_t M(50);

    boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
    t1.join();

    for (size_t i(0);i<numcpus;++i)
    {
        boost::thread t2(boost::bind(&thread2, x, i, N, M));
        t2.join();
    }

    numa_free(x, N);

    return 0;
}

输出

g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp

./numatest

numa_available() 0                    <-- NUMA is available on this system
numa node 0 10101010 12884901888      <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb
numa node 1 01010101 12874584064      <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0

Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428
Elapsed read/write by thread on core 0: 00:00:01.760554
Elapsed read/write by thread on core 1: 00:00:01.719686
Elapsed read/write by thread on core 2: 00:00:01.708830
Elapsed read/write by thread on core 3: 00:00:01.691560
Elapsed read/write by thread on core 4: 00:00:01.686912
Elapsed read/write by thread on core 5: 00:00:01.691917
Elapsed read/write by thread on core 6: 00:00:01.686509
Elapsed read/write by thread on core 7: 00:00:01.689928

对数组 x 进行 50 次迭代读取和写入大约需要 1.7 秒,无论哪个内核在进行读取和写入.

Doing 50 iterations reading and writing over array x takes about 1.7 seconds, no matter which core is doing the reading and writing.

我的 CPU 上的缓存大小是 8Mb,所以也许 10Mb 的阵列 x 不足以消除缓存效应.我尝试了 100Mb 阵列 x,并且我已经尝试在最内层循环中使用 __sync_synchronize() 发出完整的内存栅栏.它仍然没有揭示 NUMA 节点之间的任何不对称性.

The cache size on my CPUs is 8Mb, so maybe 10Mb array x is not big enough to eliminate cache effecs. I tried 100Mb array x, and I've tried issuing a full memory fence with __sync_synchronize() inside my innermost loops. It still doesn't reveal any asymmetry between NUMA nodes.

我尝试使用 __sync_fetch_and_add() 读取和写入数组 x.还是什么都没有.

I've tried reading and writing to array x with __sync_fetch_and_add(). Still nothing.

推荐答案

啊哈!玄学说得对!不知何故,硬件预取正在优化我的读/写.

Ah hah! Mysticial is right! Somehow, hardware pre-fetching is optimizing my read/writes.

如果是缓存优化,那么强制内存屏障会使优化失败:

If it were a cache optimization, then forcing a memory barrier would defeat the optimization:

c = __sync_fetch_and_add(((char*)x) + j, 1);

但这没有任何区别.有什么不同之处在于将我的迭代器索引乘以素数 1009 以击败预取优化:

but that doesn't make any difference. What does make a difference is multiplying my iterator index by prime 1009 to defeat the pre-fetching optimization:

*(((char*)x) + ((j * 1009) % N)) += 1;

随着这种变化,NUMA 的不对称性被清楚地揭示出来:

With that change, the NUMA asymmetry is clearly revealed:

numa_available() 0
numa node 0 10101010 12884901888
numa node 1 01010101 12874584064
Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725
Elapsed read/write by thread on core 0: 00:00:00.942300
Elapsed read/write by thread on core 1: 00:00:01.216286
Elapsed read/write by thread on core 2: 00:00:00.909353
Elapsed read/write by thread on core 3: 00:00:01.218935
Elapsed read/write by thread on core 4: 00:00:00.898107
Elapsed read/write by thread on core 5: 00:00:01.211413
Elapsed read/write by thread on core 6: 00:00:00.898021
Elapsed read/write by thread on core 7: 00:00:01.207114

至少我认为这就是正在发生的事情.

At least I think that's what's going on.

感谢神秘主义者!

结论 ~133%

对于那些只是浏览这篇文章以粗略了解 NUMA 性能特征的人,根据我的测试,以下是底线:

For anyone who is just glancing at this post to get a rough idea of the performance characteristics of NUMA, here is the bottom line according to my tests:

对非本地 NUMA 节点的内存访问的延迟约为对本地节点的内存访问的 1.33 倍.

Memory access to a non-local NUMA node has about 1.33 times the latency of memory access to a local node.

这篇关于测量 NUMA(非统一内存访问).没有可观察到的不对称性.为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆