测量NUMA(非均匀内存访问)。没有可观察到的不对称。为什么? [英] Measuring NUMA (Non-Uniform Memory Access). No observable asymmetry. Why?

查看:1071
本文介绍了测量NUMA(非均匀内存访问)。没有可观察到的不对称。为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图测量NUMA的非对称内存访问效果,但失败。



实验



在Intel Xeon X5570 @ 2.93GHz,2个CPU,8核上执行。



在固定到核心0的线程上,我分配一个数组 x ,大​​小为10,000,000字节,位于核心0的NUMA节点上,numa_alloc_local。
然后我迭代数组 x 50次,读取和写入数组中的每个字节。测量进行50次迭代的经过时间。



然后,在我的服务器的每个其他核心,我固定一个新的线程,并再次测量经过的时间做50次阅读和写作
。 strong>大以最小化缓存效果。我们要测量的速度,当CPU必须一直到RAM加载和存储,而不是当缓存有帮助时。



我的两个NUMA节点服务器,因此我希望在分配了数组 x 的同一节点上具有亲和性的核心具有更快的读/写速度
。我没有看到。



为什么?



也许NUMA只与具有> 8-12核心的系统相关,在其他地方推荐?



http://lse.sourceforge.net/ numa / faq /



numatest.cpp



  #include< numa.h> 
#include< iostream>
#include< boost / thread / thread.hpp>
#include< boost / date_time / posix_time / posix_time.hpp>
#include< pthread.h>

void pin_to_core(size_t core)
{
cpu_set_t cpuset;
CPU_ZERO(& cpuset);
CPU_SET(core,& cpuset);
pthread_setaffinity_np(pthread_self(),sizeof(cpu_set_t),& cpuset);
}

std :: ostream& operator<<<<(std :: ostream& os,const bitmask& bm)
{
for(size_t i = 0; i {
os<< numa_bitmask_isbitset(& bm,i);
}
return os;
}

void * thread1(void ** x,size_t core,size_t N,size_t M)
{
pin_to_core(core);

void * y = numa_alloc_local(N);

boost :: posix_time :: ptime t1 = boost :: posix_time :: microsec_clock :: universal_time();

char c;
for(size_t i(0); i for(size_t j(0); j {
c = ((char *)y)[j];
((char *)y)[j] = c
}

boost :: posix_time :: ptime t2 = boost :: posix_time :: microsec_clock :: universal_time();

std :: cout<< 由在核上分配的相同线程经过的读/写<核心< :<< (t2-t1)< std :: endl;

* x = y;
}

void thread2(void * x,size_t core,size_t N,size_t M)
{
pin_to_core(core);

boost :: posix_time :: ptime t1 = boost :: posix_time :: microsec_clock :: universal_time();

char c;
for(size_t i(0); i for(size_t j(0); j {
c = ((char *)x)[j];
((char *)x)[j] = c;
}

boost :: posix_time :: ptime t2 = boost :: posix_time :: microsec_clock :: universal_time();

std :: cout<< 经由核心上的线程读取/写入<核心< :<< (t2-t1)< std :: endl;
}

int main(int argc,const char ** argv)
{
int numcpus = numa_num_task_cpus();
std :: cout<< numa_available()< numa_available()<< std :: endl;
numa_set_localalloc();

bitmask * bm = numa_bitmask_alloc(numcpus);
for(int i = 0; i <= numa_max_node(); ++ i)
{
numa_node_to_cpus(i,bm);
std :: cout<< numa node< i<< < * bm<< < numa_node_size(i,0)< std :: endl;
}
numa_bitmask_free(bm);

void * x;
size_t N(10000000);
size_t M(50);

boost :: thread t1(boost :: bind(& thread1,& x,0,N,M));
t1.join();

for(size_t i(0); i {
boost :: thread t2(boost :: bind(& thread2,x, i,N,M));
t2.join();
}

numa_free(x,N);

return 0;
}



输出



  g ++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp 

。/ numatest

numa_available()0< - NUMA在该系统上可用
numa node 0 10101010 12884901888< - 核0,2,4,6在NUMA节点0上,大约为12 Gb
numa节点1 01010101 12874584064< ; - 核心1,3,5,7在NUMA节点1上,略小于节点0

在核心0:00:00:01.767428上分配的同一线程的已读/写操作
核心上的线程经过的读/写0:00:00:01.760554
线程在核心1上的经过的读/写1:00:00:01.719686
在核心上经历的读/ 2:00:00:01.708830
核心3:00:00经线程读取/写入的线程3:00:00:01.691560
核心4:00:00经线程读取/写入的线程4:00:00:01.686912
由核心上的线程读取/写入5:00:00:01.691917
由核心上的线程经过的读取/写入6:00:00:01.686509
核心上的线程经过的读/写7:00:00 :01.689928

执行50次迭代读写数组



更新:



p>我的CPU上的高速缓存大小是8Mb,因此可能是10Mb阵列
不够大,不足以消除缓存效果。我尝试了100Mb数组 x
我尝试使用__sync_synchronize()内部循环发出一个完整的内存围栏。它仍然不会显示NUMA节点之间的任何不对称。



更新2:



并使用__sync_fetch_and_add()将数组写入数组 x 。仍然没有。

解决方案

啊哈!神秘是对的!不知何故,硬件预抓取正在优化我的读/写。



如果是高速缓存优化,则强制内存屏障将无法优化:

  c = __sync_fetch_and_add(((char *)x)+ j,1); 

但这没有什么区别。什么是有区别的是我的迭代器索引乘以质数1009以击败预取优化:

  *(( *)x)+((j * 1009)%N))+ = 1; 

有了这个改变,NUMA不对称就会清楚地显示出来:

  numa_available()0 
numa节点0 10101010 12884901888
numa节点1 01010101 12874584064
由分配在同一线程上的已读/内核0:00:00:00.961725
内核上的线程经过的读/写0:00:00:00.942300
内核上的线程经过的读/写1:00:00:01.216286
在核心2:00:00:00.909353
上的线程经过的读/写由线程在核心3:00:00:01.218935
上经历的读/写由核心4:00: 00:00.898107
核心5:00:00:00:01.211413
经过的读取/写入由核心6:00:00:00.898021
经过读取/线程在核心7:00:00:01.207114

至少我认为这是发生了。 p>

感谢Mysticial!



编辑:结论〜133%
$ b

对于谁只是看了这个职位,以粗略了解NUMA的性能特点,这里是根据我的测试的底线:



对非本地NUMA节点的内存访问约为本地节点的内存访问延迟的1.33倍。


I've tried to measure the asymmetric memory access effects of NUMA, and failed.

The Experiment

Performed on an Intel Xeon X5570 @ 2.93GHz, 2 CPUs, 8 cores.

On a thread pinned to core 0, I allocate an array x of size 10,000,000 bytes on core 0's NUMA node with numa_alloc_local. Then I iterate over array x 50 times and read and write each byte in the array. Measure the elapsed time to do the 50 iterations.

Then, on each of the other cores in my server, I pin a new thread and again measure the elapsed time to do 50 iterations of reading and writing to every byte in array x.

Array x is large to minimize cache effects. We want to measure the speed when the CPU has to go all the way to RAM to load and store, not when caches are helping.

There are two NUMA nodes in my server, so I would expect the cores that have affinity on the same node in which array x is allocated to have faster read/write speed. I'm not seeing that.

Why?

Perhaps NUMA is only relevant on systems with > 8-12 cores, as I've seen suggested elsewhere?

http://lse.sourceforge.net/numa/faq/

numatest.cpp

#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>

void pin_to_core(size_t core)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}

std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
    for(size_t i=0;i<bm.size;++i)
    {
        os << numa_bitmask_isbitset(&bm, i);
    }
    return os;
}

void* thread1(void** x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    void* y = numa_alloc_local(N);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            c = ((char*)y)[j];
            ((char*)y)[j] = c;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;

    *x = y;
}

void thread2(void* x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            c = ((char*)x)[j];
            ((char*)x)[j] = c;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}

int main(int argc, const char **argv)
{
    int numcpus = numa_num_task_cpus();
    std::cout << "numa_available() " << numa_available() << std::endl;
    numa_set_localalloc();

    bitmask* bm = numa_bitmask_alloc(numcpus);
    for (int i=0;i<=numa_max_node();++i)
    {
        numa_node_to_cpus(i, bm);
        std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
    }
    numa_bitmask_free(bm);

    void* x;
    size_t N(10000000);
    size_t M(50);

    boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
    t1.join();

    for (size_t i(0);i<numcpus;++i)
    {
        boost::thread t2(boost::bind(&thread2, x, i, N, M));
        t2.join();
    }

    numa_free(x, N);

    return 0;
}

The Output

g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp

./numatest

numa_available() 0                    <-- NUMA is available on this system
numa node 0 10101010 12884901888      <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb
numa node 1 01010101 12874584064      <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0

Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428
Elapsed read/write by thread on core 0: 00:00:01.760554
Elapsed read/write by thread on core 1: 00:00:01.719686
Elapsed read/write by thread on core 2: 00:00:01.708830
Elapsed read/write by thread on core 3: 00:00:01.691560
Elapsed read/write by thread on core 4: 00:00:01.686912
Elapsed read/write by thread on core 5: 00:00:01.691917
Elapsed read/write by thread on core 6: 00:00:01.686509
Elapsed read/write by thread on core 7: 00:00:01.689928

Doing 50 iterations reading and writing over array x takes about 1.7 seconds, no matter which core is doing the reading and writing.

Update:

The cache size on my CPUs is 8Mb, so maybe 10Mb array x is not big enough to eliminate cache effecs. I tried 100Mb array x, and I've tried issuing a full memory fence with __sync_synchronize() inside my innermost loops. It still doesn't reveal any asymmetry between NUMA nodes.

Update 2:

I've tried reading and writing to array x with __sync_fetch_and_add(). Still nothing.

解决方案

Ah hah! Mysticial is right! Somehow, hardware pre-fetching is optimizing my read/writes.

If it were a cache optimization, then forcing a memory barrier would defeat the optimization:

c = __sync_fetch_and_add(((char*)x) + j, 1);

but that doesn't make any difference. What does make a difference is multiplying my iterator index by prime 1009 to defeat the pre-fetching optimization:

*(((char*)x) + ((j * 1009) % N)) += 1;

With that change, the NUMA asymmetry is clearly revealed:

numa_available() 0
numa node 0 10101010 12884901888
numa node 1 01010101 12874584064
Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725
Elapsed read/write by thread on core 0: 00:00:00.942300
Elapsed read/write by thread on core 1: 00:00:01.216286
Elapsed read/write by thread on core 2: 00:00:00.909353
Elapsed read/write by thread on core 3: 00:00:01.218935
Elapsed read/write by thread on core 4: 00:00:00.898107
Elapsed read/write by thread on core 5: 00:00:01.211413
Elapsed read/write by thread on core 6: 00:00:00.898021
Elapsed read/write by thread on core 7: 00:00:01.207114

At least I think that's what's going on.

Thanks Mysticial!

EDIT: CONCLUSION ~133%

For anyone who is just glancing at this post to get a rough idea of the performance characteristics of NUMA, here is the bottom line according to my tests:

Memory access to a non-local NUMA node has about 1.33 times the latency of memory access to a local node.

这篇关于测量NUMA(非均匀内存访问)。没有可观察到的不对称。为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆