多线程环境中的Malloc性能 [英] Malloc performance in a multithreaded environment

查看:480
本文介绍了多线程环境中的Malloc性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在openmp框架上进行了一些实验,发现一些奇怪的结果,我不确定我是否会解释.

I've been running some experiments with the openmp framework and found some odd results I'm not sure I know how to explain.

我的目标是创建一个巨大的矩阵,然后将其填充值.我做了一些代码部分,例如并行循环,以从多线程环境中获得性能.我正在具有2个四核xeon处理器的计算机上运行此程序,因此我可以安全地在其中放置8个并发线程.

My goal is to create this huge matrix and then fill it with values. I made some parts of my code like parallel loops in order to gain performance from my multithreaded enviroment. I'm running this in a machine with 2 quad-core xeon processors, so I can safely put up to 8 concurrent threads in there.

一切正常,但是由于某些原因,实际上仅在3个线程下运行时,for循环实际上在分配矩阵的行时却表现出奇怪的峰值性能.从那里开始,添加更多线程只会使我的循环花费更长的时间.实际上,只有8个线程占用了一个线程所需要的更多时间.

Everything works as expected, but for some reason the for loop actually allocating the rows of my matrix have an odd peak performance when running with only 3 threads. From there on, adding some more threads just makes my loop take longer. With 8 threads taking actually more time that it would need with only one.

这是我的并行循环:

 int width = 11;
 int height = 39916800;
 vector<vector<int> > matrix;
 matrix.resize(height);    
 #pragma omp parallel shared(matrix,width,height) private(i) num_threads(3)
 {
   #pragma omp for schedule(dynamic,chunk)
   for(i = 0; i < height; i++){
     matrix[i].resize(width);
   }
 } /* End of parallel block */

这让我感到奇怪:在多线程环境中调用malloc(我认为这是矢量模板类的resize方法实际上在调用什么)时是否存在已知的性能问题?我发现有些文章说了在多线程环境中释放堆空间时性能会下降的情况,但是在这种情况下,关于分配新空间没有具体说明.

This made me wonder: is there a known performance problem when calling malloc (which I suppose is what the resize method of the vector template class is actually calling) in a multithreaded enviroment? I found some articles saying something about performance loss in freeing heap space in a mutithreaded enviroment, but nothing specific about allocating new space as in this case.

仅举一个例子,我在下面的图表中说明了循环完成所需的时间,该时间是分配循环和仅从中读取数据的普通循环的线程数的函数以后,这个巨大的矩阵.

Just to give you an example, I'm placing below a graph of the time it takes for the loop to finish as a function of the number of threads for both the allocation loop, and a normal loop that just reads data from this huge matrix later on.

这两个时间都是使用gettimeofday函数测量的,并且似乎在不同的执行实例之间返回非常相似且准确的结果.所以,有人有很好的解释吗?

Both times where measured using the gettimeofday function and seem to return very similar and accurate results across different execution instances. So, anyone has a good explanation?

推荐答案

您对内部调用malloc的vector :: resize()是正确的.在实现方面,malloc非常复杂.我看到在多线程环境中malloc可能导致争用的多个地方.

You are right about vector::resize() internally calling malloc. Implementation-wise malloc is fairly complicated. I can see multiple places where malloc can lead to contention in a multi-threaded environment.

  1. malloc可能在用户空间中保留全局数据结构,以管理用户的堆地址空间.需要保护此全局数据结构以防止并发访问和修改.一些分配器进行了优化,以减少访问此全局数据结构的次数.我不知道Ubuntu出现了多远.

  1. malloc probably keeps a global data structure in userspace to manage the user's heap address space. This global data structure would need to be protected against concurrent access and modification. Some allocators have optimizations to alleviate the number of times this global data structure is accessed... I don't know how far has Ubuntu come along.

malloc分配地址空间.因此,当您实际上开始触摸分配的内存时,您将遇到软页面错误",这是一个页面错误,它使OS内核可以为分配的地址空间分配备用RAM.由于访问内核,这可能会很昂贵,并且需要内核采取一些全局锁来访问其自己的全局RAM资源数据结构.

malloc allocates address space. So when you actually begin to touch the allocated memory you would go through a "soft page fault" which is a page fault which allows the OS kernel to allocate the backing RAM for the allocated address space. This can be expensive because of the trip to the kernel and would require the kernel to take some global locks to access its own global RAM resource data structures.

用户空间分配器可能会保留一些分配的空间,以便从中分配新的分配.但是,一旦这些分配用完,分配器将需要回到内核并从内核分配更多的地址空间.这也很昂贵,并且需要访问内核,并且内核需要一些全局锁才能访问其与全局地址空间管理相关的数据结构.

the user space allocator probably keeps some allocated space to give out new allocations from. However, once those allocations run out the allocator would need to go back to the kernel and allocate some more address space from the kernel. This is also expensive and would require a trip to the kernel and the kernel taking some global locks to access its global address space management related data structures.

最重要的是,这些交互可能相当复杂.如果您遇到这些瓶颈,我建议您只是预分配"您的内存.这将涉及分配它,然后触摸所有它(全部来自一个线程),以便您以后可以在所有线程中使用该内存,而不会在用户或内核级别遇到锁争用.

Bottomline, these interactions could be fairly complicated. If you are running into these bottlenecks I would suggest that you simply "pre-allocate" your memory. This would involve allocating it and then touching all of it (all from a single thread) so that you can use that memory later from all your threads without running into lock contention at user or kernel level.

这篇关于多线程环境中的Malloc性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆