多线程是否强调内存碎片? [英] Does multithreading emphasize memory fragmentation?

查看:126
本文介绍了多线程是否强调内存碎片?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

描述



当使用openmp的parallel并行构造来分配和释放具有4个或更多线程的随机大小的内存块时,程序似乎开始泄露大量的内存测试程序运行时的后半部分。因此,它将其消耗的内存从1050 MB增加到1500 MB或更多,而实际上不使用额外的内存。



由于valgrind没有显示任何问题,是一个内存泄漏实际上是一个强调的内存碎片的效果。



有趣的是,如果2个线程每个10000个分配,但它显示强烈if 4个线程每个进行5000次分配。此外,如果分配的块的最大大小减少到256kb(从1mb),效果会变弱。



重并发会强调分段吗?



测试计划描述



演示程序是从堆中获取总共256 MB的随机大小的内存块,进行5000次分配。如果命中了内存限制,首先分配的块将被释放,直到内存消耗低于限制。一旦执行了5000个分配,则释放所有存储器并且循环结束。所有这些工作都是由openmp生成的每个线程完成的。



这种内存分配方案允许我们预计每个线程的内存消耗大约为260 MB(包括一些簿记数据) 。



演示程序



由于这真的是你想测试的东西,你可以下载示例程序具有来自保管箱的简单makefile。



当运行该程序时,您应该至少有1400 MB的RAM可用。您可以随意调整代码中的常量以满足您的需要。



为了完整性,实际代码如下:

  #include< stdlib.h> 
#include< stdio.h>
#include< iostream>
#include< vector>
#include< deque>

#include< omp.h>
#include< math.h>

typedef unsigned long long uint64_t;

void runParallelAllocTest()
{
// constants
const int NUM_ALLOCATIONS = 5000; // alloc's per thread
const int NUM_THREADS = 4; //多少线程?
const int NUM_ITERS = NUM​​_THREADS; //多少次重复

const bool USE_NEW = true; // use new或malloc? ,似乎没有什么区别(因为它应该)
const bool DEBUG_ALLOCS = false; //调试输出

//预存储分配大小
const int NUM_PRE_ALLOCS = 20000;
const uint64_t MEM_LIMIT =(1024 * 1024)* 256; //每个进程的x MB
const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1;

srand(1);
std :: vector< size_t>分配;
allocations.resize(NUM_PRE_ALLOCS);
for(int i = 0; i allocations [i] = rand()%MAX_CHUNK_SIZE; //使用高达x MB的块
}


#pragma omp parallel num_threads(NUM_THREADS)
#pragma omp for
for(int i = 0; i uint64_t long totalAllocBytes = 0;
uint64_t currAllocBytes = 0;

std :: deque< std :: pair< char *,uint64_t> >指针;
const int myId = omp_get_thread_num();

for(int j = 0; j //新分配
const size_t allocSize = allocations [(myId * 100 + j) %NUM_PRE_ALLOCS];

char * pnt = NULL;
if(USE_NEW){
pnt = new char [allocSize];
} else {
pnt =(char *)malloc(allocSize);
}
pointers.push_back(std :: make_pair(pnt,allocSize));

totalAllocBytes + = allocSize;
currAllocBytes + = allocSize;

//填充值以添加delay
for(int fill = 0; fill<(int)allocSize; ++ fill){
pnt [fill] =(char)(j%255);
}


if(DEBUG_ALLOCS){
std :: cout< Id< myId<< New alloc<< pointers.size()<< ,bytes:< allocSize<< at<< (uint64_t)pnt<< \\\
;
}

//全部或只是一点
if(((j%5)== 0)||(j ==(NUM_ALLOCATIONS - 1))) {
int frees = 0;

//保持这个分配
//上次检查,所有的
uint64_t memLimit = MEM_LIMIT;
if(j == NUM​​_ALLOCATIONS - 1){
std :: cout<< Id< myId<< 即将释放所有存储器:< (currAllocBytes /(double)(1024 * 1024))< MB< std :: endl;
memLimit = 0;
}
// MEM_LIMIT = 0; // DEBUG

while(pointers.size()> 0&&&(currAllocBytes> memLimit)){
//释放第一个条目之一, tolive更长
currAllocBytes - = pointers.front()。second;
char * pnt = pointers.front()。first;

//可用内存
if(USE_NEW){
delete [] pnt;
} else {
free(pnt);
}

//更新数组
pointers.pop_front();

if(DEBUG_ALLOCS){
std :: cout< Id< myId<< Free'd< pointers.size()<< at<< (uint64_t)pnt<< \\\
;
}
frees ++;
}
if(DEBUG_ALLOCS){
std :: cout< Frees<< frees<< ,< currAllocBytes<< /<< MEM_LIMIT<< ,< totalAllocBytes<< \\\
;
}
}
} //对于每个分配

if(currAllocBytes!= 0){
std :: cerr< 不是所有free'd!\\\
;
}

std :: cout<< Id< myId<< done,total alloc'ed< ((double)totalAllocBytes /(double)(1024 * 1024))< MB \\\
;
} //每次迭代

exit(1);
}

int main(int argc,char ** argv)
{
runParallelAllocTest();

return 0;
}



测试系统



从我到目前为止,我看到的硬件很重要。如果在更快的计算机上运行,​​测试可能需要调整。

 英特尔®Core™2 Duo CPU T7300 @ 2.00 GHz 
Ubuntu 10.04 LTS 64位
gcc 4.3,4.4,4.6
3988.62 Bogomips



测试



一旦你执行了makefile,你应该得到一个名为 ompmemtest 的文件。为了查询内存使用情况,我使用了以下命令:

  ./ ompmemtest& 
top -b | grep ompmemtest

这会产生令人印象深刻的碎片 4个线程的预期内存消耗为 1090 MB,随着时间变为 1500 MB:

  PID用户PR NI VIRT RES SHR S%CPU%MEM TIME + COMMAND 
11626 byron 20 0 204m 99m 1000 R 27 2.5 0:00.81 ompmemtest
11626 byron 20 0 992m 832m 1004 R 195 21.0 0:06.69 ompmemtest
11626 byron 20 0 1118m 1.0g 1004 R 189 26.1 0:12.40 ompmemtest
11626 byron 20 0 1218m 1.0g 1004 R 190 27.1 0:18.13 ompmemtest
11626 byron 20 0 1282m 1.1g 1004 R 195 29.6 0:24.06 ompmemtest
11626 byron 20 0 1471m 1.3g 1004 R 195 33.5 0:29.96 ompmemtest
11626 byron 20 0 1469m 1.3g 1004 R 194 33.5 0:35.85 ompmemtest
11626 byron 20 0 1469m 1.3g 1004 R 195 33.6 0:41.75 ompmemtest
11626 byron 20 0 1636m 1.5g 1004 R 194 37.8 0:47.62 ompmemtest
11626 byron 20 0 1660m 1.5g 1004 R 195 38.0 0:53.54 ompmemtest
11626 byron 20 0 1669m 1.5g 1004 R 195 38.2 0:59.45 ompmemtest
11626 byron 20 0 1664m 1.5g 1004 R 194 38.1 1:05.32 ompmemtest
11626 byron 20 0 1724m 1.5g 1004 R 195 40.0 1:11.21 ompmemtest
11626 byron 20 0 1724m 1.6g 1140 S 193 40.1 1:17.07 ompmemtest

请注意:使用 gcc 4.3,4.4和4.6(中继)进行编译时,我可以重现此问题。 p>

解决方案

好吧,拿起诱饵。



这是系统上的

  Intel (TM)2 Quad CPU Q9550 @ 2.83GHz 
4x5666.59 bogomips

Linux meerkat 2.6.35-28-generic-pae#50-Ubuntu SMP Fri Mar 18 20:43:15 UTC 2011 i686 GNU / Linux

gcc版本4.4.5

总共使用的可用共享缓冲区高速缓存
Mem:8127172 4220560 3906612 0 374328 2748796
- / + buffers / cache:1097436 7029736
Swap:0 0 0



h2>

我刚刚运行

  time ./ompmemtest 
Id 0要释放所有内存:258.144 MB
Id 0完成,总分配-1572.7MB
Id 3要释放所有内存:257.854 MB
Id 3完成,总分配-1569.6MB
Id 1要释放所有内存:257.339 MB
Id 2要释放所有内存:257.043 MB
Id 1完成,总分配-1570.42MB
Id 2 done,total alloc'ed -1569.96MB

real 0m13.429s
用户0m44.619s
sys 0m6.000s

没有什么壮观。这里是 vmstat -SM 1的同时输出



Vmstat原始数据



  procs ----------- memory ---------- --- swap-- ----- io -----系统 -  ---- cpu ---- 
0 0 0 3892 364 2669 0 0 24 0 701 1487 2 1 97 0
4 0 0 3421 364 2669 0 0 0 0 1317 1953 53 7 40 0
4 0 0 2858 364 2669 0 0 0 0 2715 5030 79 16 5 0
4 0 0 2861 364 2669 0 0 0 0 6164 12637 76 15 9 0
4 0 0 2853 364 2669 0 0 0 0 4845 8617 77 13 10 0
4 0 0 2848 364 2669 0 0 0 0 3782 7084 79 13 8 0
5 0 0 2842 364 2669 0 0 0 0 3723 6120 81 12 7 0
4 0 0 2835 364 2669 0 0 0 0 3477 4943 84 9 7 0
4 0 0 2834 364 2669 0 0 0 0 3273 4950 81 10 9 0
5 0 0 2828 364 2669 0 0 0 0 3226 4812 84 11 6 0
4 0 0 2823 364 2669 0 0 0 0 3250 4889 83 10 7 0
4 0 0 2826 364 2669 0 0 0 0 3023 4353 85 10 6 0
4 0 0 2817 364 2669 0 0 0 0 3176 4284 83 10 7 0
4 0 0 2823 364 2669 0 0 0 0 3008 4063 84 10 6 0
0 0 0 3893 364 2669 0 0 0 0 4023 4228 64 10 26 0

对你有什么吗?



Google线程缓存Malloc



现在为了真正的乐趣,添加一点香料

  LD_PRELOAD =/ usr / lib / libtcmalloc.so./ompmemtest 
Id 1要释放所有内存:257.339 MB
Id 1完成,总分配-1570.42MB
Id 3要释放所有内存:257.854 MB
Id 3完成,总分配-1569.6MB
Id 2要释放所有内存:257.043 MB
Id 2完成,总分配 - 1569.96MB
Id 0要释放所有内存:258.144 MB
Id 0完成,总分配-1572.7MB

实数0m11.663s
用户0m44。 255s
sys 0m1.028s

看起来更快,不是吗?

  procs ----------- memory ---------- --- swap-- ---- -io ----- system ---- ---- cpu ---- 
4 0 0 3562 364 2684 0 0 0 0 1041 1676 28 7 64 0
4 2 0 2806 364 2684 0 0 0 172 1641 1843 84 14 1 0
4 0 0 2758 364 2685 0 0 0 0 1520 1009 98 2 1 0
4 0 0 2747 364 2685 0 0 0 0 1504 859 98 2 0 0
5 0 0 2745 364 2685 0 0 0 0 1575 1073 98 2 0 0
5 0 0 2739 364 2685 0 0 0 0 1415 743 99 1 0 0
4 0 0 2738 364 2685 0 0 0 0 1526 981 99 2 0 0
4 0 0 2731 364 2685 0 0 0 684 1536 927 98 2 0 0
4 0 0 2730 364 2685 0 0 0 0 1584 1010 99 1 0 0
5 0 0 2730 364 2685 0 0 0 0 1461 917 99 2 0 0
4 0 0 2729 364 2685 0 0 0 0 1561 1036 99 1 0 0
4 0 0 2729 364 2685 0 0 0 0 1406 756 100 1 0 0
0 0 0 3819 364 2685 0 0 0 4 1159 1476 26 3 71 0

如果您想要比较vmstat输出



Valgrind --tool massif



这是 ms_print 之后的输出的开头valgrind --tool = massif ./ ompmemtest (默认malloc):

  ------------- -------------------------------------------------- ----------------- 
命令:./ompmemtest
块参数:(无)
ms_print参数:massif.out.beforetcmalloc
------------------------------------------------ --------------------------------


GB
1.009 ^:
| ::::::::::::::::::::::::::: ::: @ :::::: @ :::
| #:::@ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :::::: @ :::
| #:::@ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :::::: @ :::
| :#:::@ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :::::: @ :::
| :#:::@ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :::::: @ :::
| :#:::@ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :::::: @ ::::
| :: :: :: :: @ :::: :: @ :: :::: @ ::: @ :: @ :::: @ :: :: :: ::::: :: :: ::::: :@ ::::
| :: :: :: :: @ :::: :: @ :: :::: @ ::: @ :: @ :::: @ :: :: :: ::::: :: :: ::::: :@ ::::
| :: :: :: :: @ :::: :: @ :: :::: @ ::: @ :: @ :::: @ :: :: :: ::::: :: :: ::::: :@ ::::
| :: :: :: :: @ :::: :: @ :: :::: @ ::: @ :: @ :::: @ :: :: :: ::::: :: :: ::::: :@ ::::
| :: :: :: :: @ :::: :: @ :: :::: @ ::: @ :: @ :::: @ :: :: :: ::::: :: :: ::::: :@ ::::
| :::: :: :: @ :::: :: @ :: ::: @ :: :: @ :::: @ @ :: :: :::::: :: ::: ::: @ ::::
| ::: :: :: :: @ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :::: :: @ ::::
| ::: :: :: :: @ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :::: :: @ ::::
| :: ::#:::@ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @:: ::: @ ::::
| :: ::#:::@ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @:: ::: @ ::::
| :: :: :: :: :: @ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :: :::: @ ::::
| :: :: :: :: :: @ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :: :::: @ ::::
| :: :: :: :: :: @ :::: :: @::::: @ ::: @ :: @ :::: @::: @ :::::: :: @ :: :::: @ ::::
0 + ----------------------------------- ------------------------------------> Gi
0 264.0

快照数:63
详细快照:[6(peak),10,17,23,27,30,35,39,48,56]



Google HEAPPROFILE



不幸的是,vanilla valgrind 不能使用 tcmalloc ,因此我切换了马midrace google-perftools 进行堆分析

  gcc openMpMemtest_Linux.cpp -fopenmp -lgomp -lstdc ++ -ltcmalloc -o ompmemtest 

time HEAPPROFILE = / tmp / heapprofile ./ompmemtest
开始跟踪堆
将堆配置文件转储到/tmp/heapprofile.0001.heap(当前使用的100 MB)
将堆配置文件转储到/tmp/heapprofile.0002.heap(当前使用的200 MB)
将堆配置文件转储到/ tmp / heapprofile .0003.heap(当前使用的300 MB)
将堆配置文件转储到/tmp/heapprofile.0004.heap(当前正在使用400 MB)
将堆配置文件转储到/tmp/heapprofile.0005.heap (目前使用的是501 MB)
将堆配置文件转储到/tmp/heapprofile.0006.heap(当前正在使用601 MB)
将堆配置文件转储到/tmp/heapprofile.0007.heap(目前为701 MB在使用中)
将堆配置文件转储到/tmp/heapprofile.0008.heap(当前正在使用801 MB)
将堆配置文件转储到/tmp/heapprofile.0009.heap(当前正在使用902 MB)
将堆配置文件转储到/tmp/heapprofile.0010.heap(当前正在使用1002 MB)
将堆配置文件转储到/tmp/heapprofile.0011.heap(累积分配2029 MB,当前使用1031 MB)
将堆配置文件转储到/tmp/heapprofile.0012.heap(累积分配3053 MB,当前正在使用1030 MB)
将堆配置文件转储到/tmp/heapprofile.0013.heap(累积分配4078 MB, 1031 MB当前使用)
将堆配置文件转储到/tmp/heapprofile.0014.heap(累计分配5102 MB,当前使用1031 MB)
将堆配置文件转储到/tmp/heapprofile.0015.heap (累计分配6126 MB,当前使用1033 MB)
将堆配置文件转储到/tmp/heapprofile.0016.heap(累计分配7151 MB,当前使用1029 MB)
将堆配置文件转储到/ tmp /heapprofile.0017.heap(累积分配8175 MB,目前使用1029 MB)
将堆配置文件转储到/tmp/heapprofile.0018.heap(累计分配9199 MB,当前使用1028 MB)
Id 0要释放所有内存:258.144 MB
Id 0完成,总分配-1572.7MB
Id 2要释放所有内存:257.043 MB
Id 2 done,total alloc' ed -1569.96MB
Id 3要释放所有内存:257.854 MB
Id 3完成,总分配-1569.6MB
Id 1要释放所有内存:257.339 MB
Id 1 done,total alloc'ed -1570.42MB
将堆配置文件转储到/tmp/heapprofile.0019.heap(退出)

real 0m11.981s
用户0m44。 455s
sys 0m1.124s



em>



更新



发表评论:我更新了程序

  --- omptest / openMpMemtest_Linux.cpp 2011-05-03 23:18:44.000000000 +0200 
+++ q / openMpMemtest_Linux.cpp 2011-05 -04 13:42:47.371726000 +0200
@@ -13,8 +13,8 @@
void runParallelAllocTest()
{
// constants
- const int NUM_ALLOCATIONS = 5000; // alloc's per threads
- const int NUM_THREADS = 4; //多少线程?
+ const int NUM_ALLOCATIONS = 55000; // alloc's per thread
+ const int NUM_THREADS = 8; //多少线程?
const int NUM_ITERS = NUM​​_THREADS; //多少次重复

const bool USE_NEW = true; // use new或malloc? ,似乎没有区别(因为它应该)

它跑了超过5m3s。接近尾声,htop的屏幕截图教会了,保留集稍微更高,朝2.3g:

  1 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| | 96.7%]任务:总共125个,2个运行
2 [||||||||||||||||||||||||||||||| |||||||||||||||| 96.7%]平均负载:8.09 5.24 2.37
3 [||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||| 97.4%]正常运行时间:01:54:22
4 [|| ||||||||||||||||||||||||||||||||||||||||||||| 96.1% ]
Mem [||||||||||||||||||||||||||||||||||| 3055 / 7936MB]
Swp [0 / 0MB]

PID用户NLWP PRI NI VIRT RES SHR S CPU%MEM%TIME +命令
4330 sehe 8 20 0 2635M 2286M 908 R 368. 28.8 15:35.01 ./ompmemtest

将结果与tcmalloc运行比较:4m12s,类似的顶级数据有微小的差异;最大的区别在于VIRT集合(但是除非每个进程的地址空间非常有限,否则这不是特别有用)。 RES集非常相似,如果你问我。 要注意的更重要的事情是平行度增加;所有内核现在已达到最大值。这显然是由于使用tcmalloc时减少了对堆操作的锁定需求:


如果自由列表是空的:(1)我们从这个大小类的中央自由列表中获取一堆对象(中央自由列表由所有线程共享)。 (2)将它们放在线程本地自由列表中。 (3)将新获取的对象之一返回给应用程序。




  1 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||| 100.0%]任务:总共172个,运行2个
2 [|||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 100.0%]负载平均: 7.39 2.92 1.11
3 [|||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||| 100.0%]正常运行时间:11:12:25
4 [||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| | 100.0%]
Mem [||||||||||||||||||||||||||||||||||||||||||| |||| 3278 / 7936MB]
Swp [0 / 0MB]

PID用户NLWP PRI NI VIRT RES SHR S CPU%MEM%TIME +命令
14391 sehe 8 20 0 2251M 2179M 1148 R 379. 27.5 8:08.92 ./ompmemtest


Description

When allocating and deallocating randomly sized memory chunks with 4 or more threads using openmp's parallel for construct, the program seems to start leaking considerable amounts of memory in the second half of the test-program's runtime. Thus it increases its consumed memory from 1050 MB to 1500 MB or more without actually making use of the extra memory.

As valgrind shows no issues, I must assume that what appears to be a memory leak actually is an emphasized effect of memory fragmentation.

Interestingly, the effect does not show yet if 2 threads make 10000 allocations each, but it shows strongly if 4 threads make 5000 allocations each. Also, if the maximum size of allocated chunks is reduced to 256kb (from 1mb), the effect gets weaker.

Can heavy concurrency emphasize fragmentation that much ? Or is this more likely to be a bug in the heap ?

Test Program Description

The demo program is build to obtain a total of 256 MB of randomly sized memory chunks from the heap, doing 5000 allocations. If the memory limit is hit, the chunks allocated first will be deallocated until the memory consumption falls below the limit. Once 5000 allocations where performed, all memory is released and the loop ends. All this work is done for each thread generated by openmp.

This memory allocation scheme allows us to expect a memory consumption of ~260 MB per thread (including some bookkeeping data).

Demo Program

As this is really something you might want to test, you can download the sample program with a simple makefile from dropbox.

When running the program as is, you should have at least 1400 MB of RAM available. Feel free to adjust the constants in the code to suit your needs.

For completeness, the actual code follows:

#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <deque>

#include <omp.h>
#include <math.h>

typedef unsigned long long uint64_t;

void runParallelAllocTest()
{
    // constants
    const int  NUM_ALLOCATIONS = 5000; // alloc's per thread
    const int  NUM_THREADS = 4;       // how many threads?
    const int  NUM_ITERS = NUM_THREADS;// how many overall repetions

    const bool USE_NEW      = true;   // use new or malloc? , seems to make no difference (as it should)
    const bool DEBUG_ALLOCS = false;  // debug output

    // pre store allocation sizes
    const int  NUM_PRE_ALLOCS = 20000;
    const uint64_t MEM_LIMIT = (1024 * 1024) * 256;   // x MB per process
    const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1;

    srand(1);
    std::vector<size_t> allocations;
    allocations.resize(NUM_PRE_ALLOCS);
    for (int i = 0; i < NUM_PRE_ALLOCS; i++) {
        allocations[i] = rand() % MAX_CHUNK_SIZE;   // use up to x MB chunks
    }


    #pragma omp parallel num_threads(NUM_THREADS)
    #pragma omp for
    for (int i = 0; i < NUM_ITERS; ++i) {
        uint64_t long totalAllocBytes = 0;
        uint64_t currAllocBytes = 0;

        std::deque< std::pair<char*, uint64_t> > pointers;
        const int myId = omp_get_thread_num();

        for (int j = 0; j < NUM_ALLOCATIONS; ++j) {
            // new allocation
            const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ];

            char* pnt = NULL;
            if (USE_NEW) {
                pnt = new char[allocSize];
            } else {
                pnt = (char*) malloc(allocSize);
            }
            pointers.push_back(std::make_pair(pnt, allocSize));

            totalAllocBytes += allocSize;
            currAllocBytes  += allocSize;

            // fill with values to add "delay"
            for (int fill = 0; fill < (int) allocSize; ++fill) {
                pnt[fill] = (char)(j % 255);
            }


            if (DEBUG_ALLOCS) {
                std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n";
            }

            // free all or just a bit
            if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) {
                int frees = 0;

                // keep this much allocated
                // last check, free all
                uint64_t memLimit = MEM_LIMIT;
                if (j == NUM_ALLOCATIONS - 1) {
                    std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl;
                    memLimit = 0;
                }
                //MEM_LIMIT = 0; // DEBUG

                while (pointers.size() > 0 && (currAllocBytes > memLimit)) {
                    // free one of the first entries to allow previously obtained resources to 'live' longer
                    currAllocBytes -= pointers.front().second;
                    char* pnt       = pointers.front().first;

                    // free memory
                    if (USE_NEW) {
                        delete[] pnt;
                    } else {
                        free(pnt);
                    }

                    // update array
                    pointers.pop_front();

                    if (DEBUG_ALLOCS) {
                        std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n";
                    }
                    frees++;
                }
                if (DEBUG_ALLOCS) {
                    std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n";
                }
            }
        } // for each allocation

        if (currAllocBytes != 0) {
            std::cerr << "Not all free'd!\n";
        }

        std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n";
    } // for each iteration

    exit(1);
}

int main(int argc, char** argv)
{
    runParallelAllocTest();

    return 0;
}

The Test-System

From what I see so far, the hardware matters a lot. The test might need adjustments if run on a faster machine.

Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz
Ubuntu 10.04 LTS 64 bit
gcc 4.3, 4.4, 4.6
3988.62 Bogomips

Testing

Once you have executed the makefile, you should get a file named ompmemtest. To query the memory usage over time, I used the following commands:

./ompmemtest &
top -b | grep ompmemtest

Which yields the quite impressive fragmentation or leaking behaviour. The expected memory consumption with 4 threads is 1090 MB, which became 1500 MB over time:

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11626 byron     20   0  204m  99m 1000 R   27  2.5   0:00.81 ompmemtest                                                                              
11626 byron     20   0  992m 832m 1004 R  195 21.0   0:06.69 ompmemtest                                                                              
11626 byron     20   0 1118m 1.0g 1004 R  189 26.1   0:12.40 ompmemtest                                                                              
11626 byron     20   0 1218m 1.0g 1004 R  190 27.1   0:18.13 ompmemtest                                                                              
11626 byron     20   0 1282m 1.1g 1004 R  195 29.6   0:24.06 ompmemtest                                                                              
11626 byron     20   0 1471m 1.3g 1004 R  195 33.5   0:29.96 ompmemtest                                                                              
11626 byron     20   0 1469m 1.3g 1004 R  194 33.5   0:35.85 ompmemtest                                                                              
11626 byron     20   0 1469m 1.3g 1004 R  195 33.6   0:41.75 ompmemtest                                                                              
11626 byron     20   0 1636m 1.5g 1004 R  194 37.8   0:47.62 ompmemtest                                                                              
11626 byron     20   0 1660m 1.5g 1004 R  195 38.0   0:53.54 ompmemtest                                                                              
11626 byron     20   0 1669m 1.5g 1004 R  195 38.2   0:59.45 ompmemtest                                                                              
11626 byron     20   0 1664m 1.5g 1004 R  194 38.1   1:05.32 ompmemtest                                                                              
11626 byron     20   0 1724m 1.5g 1004 R  195 40.0   1:11.21 ompmemtest                                                                              
11626 byron     20   0 1724m 1.6g 1140 S  193 40.1   1:17.07 ompmemtest

Please Note: I could reproduce this issue when compiling with gcc 4.3, 4.4 and 4.6(trunk).

解决方案

Ok, picked up the bait.

This is on a system with

Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
4x5666.59 bogomips

Linux meerkat 2.6.35-28-generic-pae #50-Ubuntu SMP Fri Mar 18 20:43:15 UTC 2011 i686 GNU/Linux

gcc version 4.4.5

             total       used       free     shared    buffers     cached
Mem:       8127172    4220560    3906612          0     374328    2748796
-/+ buffers/cache:    1097436    7029736
Swap:            0          0          0

Naive run

I just ran it

time ./ompmemtest 
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 1 about to release all memory: 257.339 MB
Id 2 about to release all memory: 257.043 MB
Id 1 done, total alloc'ed -1570.42MB 
Id 2 done, total alloc'ed -1569.96MB 

real    0m13.429s
user    0m44.619s
sys 0m6.000s

Nothing spectacular. Here is the simultaneous output of vmstat -S M 1

Vmstat raw data

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 0  0      0   3892    364   2669    0    0    24     0  701 1487  2  1 97  0
 4  0      0   3421    364   2669    0    0     0     0 1317 1953 53  7 40  0
 4  0      0   2858    364   2669    0    0     0     0 2715 5030 79 16  5  0
 4  0      0   2861    364   2669    0    0     0     0 6164 12637 76 15  9  0
 4  0      0   2853    364   2669    0    0     0     0 4845 8617 77 13 10  0
 4  0      0   2848    364   2669    0    0     0     0 3782 7084 79 13  8  0
 5  0      0   2842    364   2669    0    0     0     0 3723 6120 81 12  7  0
 4  0      0   2835    364   2669    0    0     0     0 3477 4943 84  9  7  0
 4  0      0   2834    364   2669    0    0     0     0 3273 4950 81 10  9  0
 5  0      0   2828    364   2669    0    0     0     0 3226 4812 84 11  6  0
 4  0      0   2823    364   2669    0    0     0     0 3250 4889 83 10  7  0
 4  0      0   2826    364   2669    0    0     0     0 3023 4353 85 10  6  0
 4  0      0   2817    364   2669    0    0     0     0 3176 4284 83 10  7  0
 4  0      0   2823    364   2669    0    0     0     0 3008 4063 84 10  6  0
 0  0      0   3893    364   2669    0    0     0     0 4023 4228 64 10 26  0

Does that information mean anything to you?

Google Thread Caching Malloc

Now for real fun, add a little spice

time LD_PRELOAD="/usr/lib/libtcmalloc.so" ./ompmemtest 
Id 1 about to release all memory: 257.339 MB
Id 1 done, total alloc'ed -1570.42MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 2 about to release all memory: 257.043 MB
Id 2 done, total alloc'ed -1569.96MB 
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 

real    0m11.663s
user    0m44.255s
sys 0m1.028s

Looks faster, not?

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 4  0      0   3562    364   2684    0    0     0     0 1041 1676 28  7 64  0
 4  2      0   2806    364   2684    0    0     0   172 1641 1843 84 14  1  0
 4  0      0   2758    364   2685    0    0     0     0 1520 1009 98  2  1  0
 4  0      0   2747    364   2685    0    0     0     0 1504  859 98  2  0  0
 5  0      0   2745    364   2685    0    0     0     0 1575 1073 98  2  0  0
 5  0      0   2739    364   2685    0    0     0     0 1415  743 99  1  0  0
 4  0      0   2738    364   2685    0    0     0     0 1526  981 99  2  0  0
 4  0      0   2731    364   2685    0    0     0   684 1536  927 98  2  0  0
 4  0      0   2730    364   2685    0    0     0     0 1584 1010 99  1  0  0
 5  0      0   2730    364   2685    0    0     0     0 1461  917 99  2  0  0
 4  0      0   2729    364   2685    0    0     0     0 1561 1036 99  1  0  0
 4  0      0   2729    364   2685    0    0     0     0 1406  756 100  1  0  0
 0  0      0   3819    364   2685    0    0     0     4 1159 1476 26  3 71  0

In case you wanted to compare vmstat outputs

Valgrind --tool massif

This is the head of output from ms_print after valgrind --tool=massif ./ompmemtest (default malloc):

--------------------------------------------------------------------------------
Command:            ./ompmemtest
Massif arguments:   (none)
ms_print arguments: massif.out.beforetcmalloc
--------------------------------------------------------------------------------


    GB
1.009^                                                                     :  
     |       ##::::@@:::::::@@::::::@@::::@@::@::::@::::@:::::::::@::::::@::: 
     |       # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |       # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   ::::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |  :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |  :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   264.0

Number of snapshots: 63
 Detailed snapshots: [6 (peak), 10, 17, 23, 27, 30, 35, 39, 48, 56]

Google HEAPPROFILE

Unfortunately, vanilla valgrind doesn't work with tcmalloc, so I switched horses midrace to heap profiling with google-perftools

gcc openMpMemtest_Linux.cpp -fopenmp -lgomp -lstdc++ -ltcmalloc -o ompmemtest

time HEAPPROFILE=/tmp/heapprofile ./ompmemtest
Starting tracking the heap
Dumping heap profile to /tmp/heapprofile.0001.heap (100 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0002.heap (200 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0003.heap (300 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0004.heap (400 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0005.heap (501 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0006.heap (601 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0007.heap (701 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0008.heap (801 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0009.heap (902 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0010.heap (1002 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0011.heap (2029 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0012.heap (3053 MB allocated cumulatively, 1030 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0013.heap (4078 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0014.heap (5102 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0015.heap (6126 MB allocated cumulatively, 1033 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0016.heap (7151 MB allocated cumulatively, 1029 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0017.heap (8175 MB allocated cumulatively, 1029 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0018.heap (9199 MB allocated cumulatively, 1028 MB currently in use)
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 
Id 2 about to release all memory: 257.043 MB
Id 2 done, total alloc'ed -1569.96MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 1 about to release all memory: 257.339 MB
Id 1 done, total alloc'ed -1570.42MB 
Dumping heap profile to /tmp/heapprofile.0019.heap (Exiting)

real    0m11.981s
user    0m44.455s
sys 0m1.124s

Contact me for full logs/details

Update

To the comments: I updated the program

--- omptest/openMpMemtest_Linux.cpp 2011-05-03 23:18:44.000000000 +0200
+++ q/openMpMemtest_Linux.cpp   2011-05-04 13:42:47.371726000 +0200
@@ -13,8 +13,8 @@
 void runParallelAllocTest()
 {
    // constants
-   const int  NUM_ALLOCATIONS = 5000; // alloc's per thread
-   const int  NUM_THREADS = 4;       // how many threads?
+   const int  NUM_ALLOCATIONS = 55000; // alloc's per thread
+   const int  NUM_THREADS = 8;        // how many threads?
    const int  NUM_ITERS = NUM_THREADS;// how many overall repetions

    const bool USE_NEW      = true;   // use new or malloc? , seems to make no difference (as it should)

It ran for over 5m3s. Close to the end, a screenshot of htop teaches that indeed, the reserved set is slightly higher, going towards 2.3g:

  1  [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%]     Tasks: 125 total, 2 running
  2  [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%]     Load average: 8.09 5.24 2.37 
  3  [||||||||||||||||||||||||||||||||||||||||||||||||||97.4%]     Uptime: 01:54:22
  4  [||||||||||||||||||||||||||||||||||||||||||||||||||96.1%]
  Mem[|||||||||||||||||||||||||||||||             3055/7936MB]
  Swp[                                                  0/0MB]

  PID USER     NLWP PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 4330 sehe        8  20   0 2635M 2286M   908 R 368. 28.8 15:35.01 ./ompmemtest

Comparing results with a tcmalloc run: 4m12s, similar top stats has minor differences; the big difference is in the VIRT set (but that isn't particularly useful unless you have a very limited address space per process?). The RES set is quite similar, if you ask me. The more important thing to note is parallellism is increased; all cores are now maxed out. This is obviously due to reduced need to lock for heap operations when using tcmalloc:

If the free list is empty: (1) We fetch a bunch of objects from a central free list for this size-class (the central free list is shared by all threads). (2) Place them in the thread-local free list. (3) Return one of the newly fetched objects to the applications.

  1  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Tasks: 172 total, 2 running
  2  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Load average: 7.39 2.92 1.11 
  3  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Uptime: 11:12:25
  4  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  Mem[||||||||||||||||||||||||||||||||||||||||||||              3278/7936MB]
  Swp[                                                                0/0MB]

  PID USER     NLWP PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
14391 sehe        8  20   0 2251M 2179M  1148 R 379. 27.5  8:08.92 ./ompmemtest

这篇关于多线程是否强调内存碎片?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆