测量NUMA（非均匀内存访问）。没有可观察到的不对称。为什么？ [英] Measuring NUMA (Non-Uniform Memory Access). No observable asymmetry. Why?

查看：1071 发布时间：2016/10/14 22:50:59 c++ linux performance linux-kernel numa

本文介绍了测量NUMA（非均匀内存访问）。没有可观察到的不对称。为什么？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图测量NUMA的非对称内存访问效果，但失败。

实验

在Intel Xeon X5570 @ 2.93GHz，2个CPU，8核上执行。

在固定到核心0的线程上，我分配一个数组 x ，大小为10,000,000字节，位于核心0的NUMA节点上，numa_alloc_local。
然后我迭代数组 x 50次，读取和写入数组中的每个字节。测量进行50次迭代的经过时间。

然后，在我的服务器的每个其他核心，我固定一个新的线程，并再次测量经过的时间做50次阅读和写作
。 strong>大以最小化缓存效果。我们要测量的速度，当CPU必须一直到RAM加载和存储，而不是当缓存有帮助时。

我的两个NUMA节点服务器，因此我希望在分配了数组 x 的同一节点上具有亲和性的核心具有更快的读/写速度
。我没有看到。

为什么？

也许NUMA只与具有> 8-12核心的系统相关，在其他地方推荐？

http://lse.sourceforge.net/ numa / faq /

numatest.cpp

  #include< numa.h> 
 #include< iostream> 
 #include< boost / thread / thread.hpp> 
 #include< boost / date_time / posix_time / posix_time.hpp> 
 #include< pthread.h> 
 
 void pin_to_core（size_t core）
 {
 cpu_set_t cpuset; 
 CPU_ZERO（& cpuset）; 
 CPU_SET（core，& cpuset）; 
 pthread_setaffinity_np（pthread_self（），sizeof（cpu_set_t），& cpuset）; 
} 
 
 std :: ostream& operator<<<<（std :: ostream& os，const bitmask& bm）
 {
 for（size_t i = 0; i  {
 os<< numa_bitmask_isbitset（& bm，i）; 
} 
 return os; 
} 
 
 void * thread1（void ** x，size_t core，size_t N，size_t M）
 {
 pin_to_core（core）; 
 
 void * y = numa_alloc_local（N）; 
 
 boost :: posix_time :: ptime t1 = boost :: posix_time :: microsec_clock :: universal_time（）; 
 
 char c; 
 for（size_t i（0）; i  for（size_t j（0）; j  {
c = （（char *）y）[j]; 
（（char *）y）[j] = c 
} 
 
 boost :: posix_time :: ptime t2 = boost :: posix_time :: microsec_clock :: universal_time（）; 
 
 std :: cout<< 由在核上分配的相同线程经过的读/写<核心< ：<< （t2-t1）< std :: endl; 
 
 * x = y; 
} 
 
 void thread2（void * x，size_t core，size_t N，size_t M）
 {
 pin_to_core（core）; 
 
 boost :: posix_time :: ptime t1 = boost :: posix_time :: microsec_clock :: universal_time（）; 
 
 char c; 
 for（size_t i（0）; i  for（size_t j（0）; j  {
c = （（char *）x）[j]; 
（（char *）x）[j] = c; 
} 
 
 boost :: posix_time :: ptime t2 = boost :: posix_time :: microsec_clock :: universal_time（）; 
 
 std :: cout<< 经由核心上的线程读取/写入<核心< ：<< （t2-t1）< std :: endl; 
} 
 
 int main（int argc，const char ** argv）
 {
 int numcpus = numa_num_task_cpus（）; 
 std :: cout<< numa_available（）< numa_available（）<< std :: endl; 
 numa_set_localalloc（）; 
 
 bitmask * bm = numa_bitmask_alloc（numcpus）; 
 for（int i = 0; i <= numa_max_node（）; ++ i）
 {
 numa_node_to_cpus（i，bm）; 
 std :: cout<< numa node< i<< < * bm<< < numa_node_size（i，0）< std :: endl; 
} 
 numa_bitmask_free（bm）; 
 
 void * x; 
 size_t N（10000000）; 
 size_t M（50）; 
 
 boost :: thread t1（boost :: bind（& thread1，& x，0，N，M））; 
 t1.join（）; 
 
 for（size_t i（0）; i  {
 boost :: thread t2（boost :: bind（& thread2，x， i，N，M））; 
 t2.join（）; 
} 
 
 numa_free（x，N）; 
 
 return 0; 
}

输出

  g ++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp 
 
。/ numatest 
 
 numa_available（）0< -  NUMA在该系统上可用
 numa node 0 10101010 12884901888<  - 核0,2,4,6在NUMA节点0上，大约为12 Gb 
 numa节点1 01010101 12874584064< ;  - 核心1,3,5,7在NUMA节点1上，略小于节点0 
 
在核心0：00：00：01.767428上分配的同一线程的已读/写操作
核心上的线程经过的读/写0：00：00：01.760554 
线程在核心1上的经过的读/写1：00：00：01.719686 
在核心上经历的读/ 2：00：00：01.708830 
核心3：00：00经线程读取/写入的线程3：00：00：01.691560 
核心4：00：00经线程读取/写入的线程4：00：00：01.686912 
由核心上的线程读取/写入5：00：00：01.691917 
由核心上的线程经过的读取/写入6：00：00：01.686509 
核心上的线程经过的读/写7：00:00 ：01.689928

执行50次迭代读写数组

更新：

p>我的CPU上的高速缓存大小是8Mb，因此可能是10Mb阵列不够大，不足以消除缓存效果。我尝试了100Mb数组 x 和
我尝试使用__sync_synchronize（）内部循环发出一个完整的内存围栏。它仍然不会显示NUMA节点之间的任何不对称。

更新2：

并使用__sync_fetch_and_add（）将数组写入数组 x 。仍然没有。
解决方案
啊哈！神秘是对的！不知何故，硬件预抓取正在优化我的读/写。

如果是高速缓存优化，则强制内存屏障将无法优化：
c = __sync_fetch_and_add（（（char *）x）+ j，1）;
但这没有什么区别。什么是有区别的是我的迭代器索引乘以质数1009以击败预取优化：
*（（ *）x）+（（j * 1009）％N））+ = 1;
有了这个改变，NUMA不对称就会清楚地显示出来：
numa_available（）0 numa节点0 10101010 12884901888 numa节点1 01010101 12874584064 由分配在同一线程上的已读/内核0：00：00：00.961725 内核上的线程经过的读/写0：00：00：00.942300 内核上的线程经过的读/写1：00：00：01.216286 在核心2：00：00：00.909353 上的线程经过的读/写由线程在核心3：00：00：01.218935 上经历的读/写由核心4：00： 00：00.898107 核心5：00：00：00：01.211413 经过的读取/写入由核心6：00：00：00.898021 经过读取/线程在核心7：00：00：01.207114
至少我认为这是发生了。 p>

感谢Mysticial！

编辑：结论〜133％
$ b
对于谁只是看了这个职位，以粗略了解NUMA的性能特点，这里是根据我的测试的底线：

对非本地NUMA节点的内存访问约为本地节点的内存访问延迟的1.33倍。

I've tried to measure the asymmetric memory access effects of NUMA, and failed.

The Experiment

Performed on an Intel Xeon X5570 @ 2.93GHz, 2 CPUs, 8 cores.

On a thread pinned to core 0, I allocate an array x of size 10,000,000 bytes on core 0's NUMA node with numa_alloc_local. Then I iterate over array x 50 times and read and write each byte in the array. Measure the elapsed time to do the 50 iterations.

Then, on each of the other cores in my server, I pin a new thread and again measure the elapsed time to do 50 iterations of reading and writing to every byte in array x.

Array x is large to minimize cache effects. We want to measure the speed when the CPU has to go all the way to RAM to load and store, not when caches are helping.

There are two NUMA nodes in my server, so I would expect the cores that have affinity on the same node in which array x is allocated to have faster read/write speed. I'm not seeing that.

Why?

Perhaps NUMA is only relevant on systems with > 8-12 cores, as I've seen suggested elsewhere?

http://lse.sourceforge.net/numa/faq/

numatest.cpp

#include <numa.h> #include <iostream> #include <boost/thread/thread.hpp> #include <boost/date_time/posix_time/posix_time.hpp> #include <pthread.h> void pin_to_core(size_t core) { cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(core, &cpuset); pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset); } std::ostream& operator<<(std::ostream& os, const bitmask& bm) { for(size_t i=0;i<bm.size;++i) { os << numa_bitmask_isbitset(&bm, i); } return os; } void* thread1(void** x, size_t core, size_t N, size_t M) { pin_to_core(core); void* y = numa_alloc_local(N); boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time(); char c; for (size_t i(0);i<M;++i) for(size_t j(0);j<N;++j) { c = ((char*)y)[j]; ((char*)y)[j] = c; } boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time(); std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl; *x = y; } void thread2(void* x, size_t core, size_t N, size_t M) { pin_to_core(core); boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time(); char c; for (size_t i(0);i<M;++i) for(size_t j(0);j<N;++j) { c = ((char*)x)[j]; ((char*)x)[j] = c; } boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time(); std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl; } int main(int argc, const char **argv) { int numcpus = numa_num_task_cpus(); std::cout << "numa_available() " << numa_available() << std::endl; numa_set_localalloc(); bitmask* bm = numa_bitmask_alloc(numcpus); for (int i=0;i<=numa_max_node();++i) { numa_node_to_cpus(i, bm); std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl; } numa_bitmask_free(bm); void* x; size_t N(10000000); size_t M(50); boost::thread t1(boost::bind(&thread1, &x, 0, N, M)); t1.join(); for (size_t i(0);i<numcpus;++i) { boost::thread t2(boost::bind(&thread2, x, i, N, M)); t2.join(); } numa_free(x, N); return 0; }

The Output

g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp ./numatest numa_available() 0 <-- NUMA is available on this system numa node 0 10101010 12884901888 <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb numa node 1 01010101 12874584064 <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0 Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428 Elapsed read/write by thread on core 0: 00:00:01.760554 Elapsed read/write by thread on core 1: 00:00:01.719686 Elapsed read/write by thread on core 2: 00:00:01.708830 Elapsed read/write by thread on core 3: 00:00:01.691560 Elapsed read/write by thread on core 4: 00:00:01.686912 Elapsed read/write by thread on core 5: 00:00:01.691917 Elapsed read/write by thread on core 6: 00:00:01.686509 Elapsed read/write by thread on core 7: 00:00:01.689928
Doing 50 iterations reading and writing over array x takes about 1.7 seconds, no matter which core is doing the reading and writing.

Update:

The cache size on my CPUs is 8Mb, so maybe 10Mb array x is not big enough to eliminate cache effecs. I tried 100Mb array x, and I've tried issuing a full memory fence with __sync_synchronize() inside my innermost loops. It still doesn't reveal any asymmetry between NUMA nodes.

Update 2:

I've tried reading and writing to array x with __sync_fetch_and_add(). Still nothing.
解决方案
Ah hah! Mysticial is right! Somehow, hardware pre-fetching is optimizing my read/writes.

If it were a cache optimization, then forcing a memory barrier would defeat the optimization:
c = __sync_fetch_and_add(((char*)x) + j, 1);
but that doesn't make any difference. What does make a difference is multiplying my iterator index by prime 1009 to defeat the pre-fetching optimization:
*(((char*)x) + ((j * 1009) % N)) += 1;
With that change, the NUMA asymmetry is clearly revealed:
numa_available() 0 numa node 0 10101010 12884901888 numa node 1 01010101 12874584064 Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725 Elapsed read/write by thread on core 0: 00:00:00.942300 Elapsed read/write by thread on core 1: 00:00:01.216286 Elapsed read/write by thread on core 2: 00:00:00.909353 Elapsed read/write by thread on core 3: 00:00:01.218935 Elapsed read/write by thread on core 4: 00:00:00.898107 Elapsed read/write by thread on core 5: 00:00:01.211413 Elapsed read/write by thread on core 6: 00:00:00.898021 Elapsed read/write by thread on core 7: 00:00:01.207114
At least I think that's what's going on.

Thanks Mysticial!

EDIT: CONCLUSION ~133%

For anyone who is just glancing at this post to get a rough idea of the performance characteristics of NUMA, here is the bottom line according to my tests:

Memory access to a non-local NUMA node has about 1.33 times the latency of memory access to a local node.

这篇关于测量NUMA（非均匀内存访问）。没有可观察到的不对称。为什么？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

测量NUMA（非均匀内存访问）。没有可观察到的不对称。为什么？ [英] Measuring NUMA (Non-Uniform Memory Access). No observable asymmetry. Why?

问题描述

实验

numatest.cpp

输出

更新：

更新2：

The Experiment

numatest.cpp

The Output

Update:

Update 2:

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

测量NUMA（非均匀内存访问）。没有可观察到的不对称。为什么？ [英] Measuring NUMA (Non-Uniform Memory Access). No observable asymmetry. Why?

问题描述

实验

numatest.cpp

输出

更新：

更新2：

The Experiment

numatest.cpp

The Output

Update:

Update 2:

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭