多线程环境中的Malloc性能 [英] Malloc performance in a multithreaded environment

查看：480 发布时间：2020/5/5 12:31:40 multithreading performance malloc openmp

本文介绍了多线程环境中的Malloc性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经在openmp框架上进行了一些实验，发现一些奇怪的结果，我不确定我是否会解释.

I've been running some experiments with the openmp framework and found some odd results I'm not sure I know how to explain.

我的目标是创建一个巨大的矩阵，然后将其填充值.我做了一些代码部分，例如并行循环，以从多线程环境中获得性能.我正在具有2个四核xeon处理器的计算机上运行此程序，因此我可以安全地在其中放置8个并发线程.

My goal is to create this huge matrix and then fill it with values. I made some parts of my code like parallel loops in order to gain performance from my multithreaded enviroment. I'm running this in a machine with 2 quad-core xeon processors, so I can safely put up to 8 concurrent threads in there.

一切正常，但是由于某些原因，实际上仅在3个线程下运行时，for循环实际上在分配矩阵的行时却表现出奇怪的峰值性能.从那里开始，添加更多线程只会使我的循环花费更长的时间.实际上，只有8个线程占用了一个线程所需要的更多时间.

Everything works as expected, but for some reason the for loop actually allocating the rows of my matrix have an odd peak performance when running with only 3 threads. From there on, adding some more threads just makes my loop take longer. With 8 threads taking actually more time that it would need with only one.

这是我的并行循环:

 int width = 11;
 int height = 39916800;
 vector<vector<int> > matrix;
 matrix.resize(height);    
 #pragma omp parallel shared(matrix,width,height) private(i) num_threads(3)
 {
   #pragma omp for schedule(dynamic,chunk)
   for(i = 0; i < height; i++){
     matrix[i].resize(width);
   }
 } /* End of parallel block */

这让我感到奇怪:在多线程环境中调用malloc(我认为这是矢量模板类的resize方法实际上在调用什么)时是否存在已知的性能问题?我发现有些文章说了在多线程环境中释放堆空间时性能会下降的情况，但是在这种情况下，关于分配新空间没有具体说明.

This made me wonder: is there a known performance problem when calling malloc (which I suppose is what the resize method of the vector template class is actually calling) in a multithreaded enviroment? I found some articles saying something about performance loss in freeing heap space in a mutithreaded enviroment, but nothing specific about allocating new space as in this case.

仅举一个例子，我在下面的图表中说明了循环完成所需的时间，该时间是分配循环和仅从中读取数据的普通循环的线程数的函数以后，这个巨大的矩阵.

Just to give you an example, I'm placing below a graph of the time it takes for the loop to finish as a function of the number of threads for both the allocation loop, and a normal loop that just reads data from this huge matrix later on.

这两个时间都是使用gettimeofday函数测量的，并且似乎在不同的执行实例之间返回非常相似且准确的结果.所以，有人有很好的解释吗?

Both times where measured using the gettimeofday function and seem to return very similar and accurate results across different execution instances. So, anyone has a good explanation?

多线程环境中的Malloc性能 [英] Malloc performance in a multithreaded environment

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

多线程环境中的Malloc性能 [英] Malloc performance in a multithreaded environment

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭