Pthreads无法解释的分段错误 [英] Pthreads unexplained segmentation fault

查看:75
本文介绍了Pthreads无法解释的分段错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从Cormen的著名文本中实现了一种并行合并排序算法.我使用pthreads在C语言中编写了该代码,并在Win7 x64上使用MinGW进行了编译(稍后也在Ubuntu中的GCC上进行了测试,结果相同).我在并行化方面的第一个方法很幼稚……我在每个递归级别上都产生了一个新线程(这实际上是Cormen的伪代码所暗示的).但是,这通常会花费很长时间,或者由于分段错误而崩溃(我可以假设系统可以处理多少个线程存在一定的硬性限制).这似乎是递归并行化的一个新手常见错误,实际上我发现了一个类似的

在下面添加了新图像,其中显示了每个版本使用的最大"和总" RAM(最大值表示最高同时分配/使用量,总数表示在程序生命周期内所有分配请求的总和).这些建议使用N = 1E8和threshold = 1E7时,我应该获得最大3.2GB的使用量(我的系统应能够支持该使用量).但是,再次……我想这与pthread库中的其他一些限制有关……与我的实际系统资源无关.

解决方案

似乎内存不足.在您的示例中,如果代码按顺序运行,则一次分配的最大内存为1.6GB.使用线程时,它使用的内存超过3GB.我在malloc/free函数周围放了一些包装,并得到了以下结果:

Allocation of 12500000 bytes failed with 3074995884 bytes already allocated.

很容易看出,线程化后的内存使用量会更多.在那种情况下,它将同时对整个数组的左侧和右侧进行排序,并分配两个大的临时缓冲区来执行此操作.如果按顺序运行,则在对右半部分进行排序之前,将释放左半部分的临时缓冲区.

这是我使用的包装器:

static size_t total_allocated = 0;
static size_t max_allocated = 0;
static pthread_mutex_t total_allocated_mutex;

static void *allocate(int n)
{
  void *result = 0;
  pthread_mutex_lock(&total_allocated_mutex);
  result = malloc(n);
  if (!result) {
    fprintf(stderr,"Allocation of %d bytes failed with %u bytes already allocated\n",n,total_allocated);
  }
  assert(result);
  total_allocated += n;
  if (total_allocated>max_allocated) {
    max_allocated = total_allocated;
  }
  pthread_mutex_unlock(&total_allocated_mutex);
  return result;
}


static void *deallocate(void *p,int n)
{
  pthread_mutex_lock(&total_allocated_mutex);
  total_allocated -= n;
  free(p);
  pthread_mutex_unlock(&total_allocated_mutex);
}

I implemented a parallel merge sort algorithm from Cormen's well-known text. I wrote it in C using pthreads, and compiled with MinGW on Win7 x64 (also tested later with GCC in Ubuntu with same results). My first approach at the parallelization was naïve... I spawned a new thread at every recursion level (which is actually what Cormen's pseudocode implies). However this usually ends up either taking way too long or crashing due to segmentation fault (I can assume there is some hard limit to how many threads the system can handle). This seems to be a common newbie mistake for recursive parallelization, in fact I found a similar DISCUSSION on this site. So I instead used the recommendation in that thread, namely setting a threshold for problem size, and if the function that spawns new threads is given a set smaller than the threshold (say 10,000 elements) then it just operates on the elements directly, rather than creating a new thread for such a small set.

Now everything seemed to be working fine. I tabulated some of my results below. N is problem size (a set of integers [1, 2, 3, ..., N] thoroughly scrambled) and threshold is the value below which my parallel sort and parallel merge functions refuse to spawn new threads. The first table shows sort time in ms, the second shows how many sort/merge worker threads were spawned in each case. Looking at the N=1E6 and N=1E7 rows in the bottom table, you can see that anytime I lower the threshold such that more than ~8000 merge workers are allowed, I get segmentation fault. Again, I assume that is due to some limit the system gives on threads, and I'd be happy to hear more about that, but it's not my main question.

The main question, is why does the final row get segfault when trying to use a fairly high threshold, which would have spawned an expected 15/33 worker threads (following pattern from previous rows). Surely this is not too many threads for my system to handle. The one instance which did complete (lower right cell in table) used about 1.2GB RAM (my system has 6GB), and the threaded versions never seem to take more RAM compared to the ones with 0 threads at the right of each row.

  • I don't think I am hitting any sort of heap limit... tons of RAM available and it should only take ~1GB even if it was allowed to spawn the 15/33 threads.
  • I also don't think it is a stack problem. I designed the program to use minimal stack, and I don't think the footprint each thread would be related to problem size N at all, only the heap. I'm pretty inexperienced with this... but I did a core dump stack backtrace in gdb and the addresses from top to bottom of stack seem close enough to rule out overflow there.
  • I tried reading the return values of pthread_create... in Windows I got a value of 11 a few times before the crash (but it didn't seem to trigger the crash, since there were a few 11's, then a few 0's, i.e. no error, then another 11). That error code is EAGAIN, resources unavailable... but I am not sure what it really means here. Moreover, in Ubuntu the error code was 0 every time even right up to the crash.
  • I tried Valgrind and got a lot of messages about memory leaks, but I am not sure those are legit since I know Valgrind requires extra resources, and I was able to get those types of errors on other problem set sizes that worked fine without Valgrind.

It's pretty obvious it's related to problem size and system resources... I'm hoping there's some piece of general knowledge I'm missing that makes the answer really clear.

Any ideas? Sorry for the long wall of text... thanks if you've read this far! I can post the source if it seems relevant.

EDIT: Source added for reference:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <pthread.h>

const int               N = 100000000;
const int  SORT_THRESHOLD = 10000000;
const int MERGE_THRESHOLD = 10000000;

int  sort_thread_count = 0;
int merge_thread_count = 0;

typedef struct s_pmergesort_args {
    int *vals_in, p, r, *vals_out, s;
} pmergesort_args;

typedef struct s_pmerge_args {
    int *temp, p1, r1, p2, r2, *vals_out, p3;
} pmerge_args;

void *p_merge_sort(void *v_pmsa);
void *p_merge(void *v_pma);
int binary_search(int val, int *temp, int p, int r);

int main() {
    int *values, i, rand1, rand2, temp, *sorted;
    long long rand1a, rand1b, rand2a, rand2b;
    struct timeval start, end;

    /* allocate values on heap and initialize */
    values = malloc(N * sizeof(int));
    sorted = malloc(N * sizeof(int));
    for (i = 0; i < N; i++) {
        values[i] = i + 1;
        sorted[i] = 0;
    }

    /* scramble
     *  - complicated logic to maximize swapping
     *  - lots of testing (not shown) was done to verify optimal swapping */
    srand(time(NULL));
    for (i = 0; i < N/10; i++) {
        rand1a = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand1b = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand1 = (int)((rand1a * rand1b + rand()) % N);
        rand2a = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand2b = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand2 = (int)((rand2a * rand2b + rand()) % N);
        temp = values[rand1];
        values[rand1] = values[rand2];
        values[rand2] = temp;
    }

    /* set up args for p_merge_sort */
    pmergesort_args pmsa;
    pmsa.vals_in = values;
    pmsa.p = 0;
    pmsa.r = N-1;
    pmsa.vals_out = sorted;
    pmsa.s = 0;

    /* sort */
    gettimeofday(&start, NULL);
    p_merge_sort(&pmsa);
    gettimeofday(&end, NULL);

    /* verify sorting */
    for (i = 1; i < N; i++) {
        if (sorted[i] < sorted[i-1]) {
            fprintf(stderr, "Error: array is not sorted.\n");
            exit(0);
        }
    }
    printf("Success: array is sorted.\n");
    printf("Sorting took %dms.\n", (int)(((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec))/1000));

    free(values);
    free(sorted);

    printf("(  sort threads created: %d )\n", sort_thread_count);
    printf("( merge threads created: %d )\n", merge_thread_count);

    return 0;
}

void *p_merge_sort(void *v_pmsa) {
    pmergesort_args pmsa = *((pmergesort_args *) v_pmsa);
    int *vals_in = pmsa.vals_in;
    int p = pmsa.p;
    int r = pmsa.r;
    int *vals_out = pmsa.vals_out;
    int s = pmsa.s;

    int n = r - p + 1;
    pthread_t worker;

    if (n > SORT_THRESHOLD) {
        sort_thread_count++;
    }

    if (n == 1) {
        vals_out[s] = vals_in[p];
    } else {
        int *temp = malloc(n * sizeof(int));
        int q = (p + r) / 2;
        int q_ = q - p + 1;

        pmergesort_args pmsa_l;
        pmsa_l.vals_in = vals_in;
        pmsa_l.p = p;
        pmsa_l.r = q;
        pmsa_l.vals_out = temp;
        pmsa_l.s = 0;

        pmergesort_args pmsa_r;
        pmsa_r.vals_in = vals_in;
        pmsa_r.p = q+1;
        pmsa_r.r = r;
        pmsa_r.vals_out = temp;
        pmsa_r.s = q_;

        if (n > SORT_THRESHOLD) {
            pthread_create(&worker, NULL, p_merge_sort, &pmsa_l);
        } else {
            p_merge_sort(&pmsa_l);
        }
        p_merge_sort(&pmsa_r);

        if (n > SORT_THRESHOLD) {
            pthread_join(worker, NULL);
        }

        pmerge_args pma;
        pma.temp = temp;
        pma.p1 = 0;
        pma.r1 = q_ - 1;
        pma.p2 = q_;
        pma.r2 = n - 1;
        pma.vals_out = vals_out;
        pma.p3 = s;
        p_merge(&pma);
        free(temp);
    }
}

void *p_merge(void *v_pma) {
    pmerge_args pma = *((pmerge_args *) v_pma);
    int *temp = pma.temp;
    int p1 = pma.p1;
    int r1 = pma.r1;
    int p2 = pma.p2;
    int r2 = pma.r2;
    int *vals_out = pma.vals_out;
    int p3 = pma.p3;

    int n1 = r1 - p1 + 1;
    int n2 = r2 - p2 + 1;
    int q1, q2, q3, t;
    pthread_t worker;

    if (n1 < n2) {
        t = p1; p1 = p2; p2 = t;
        t = r1; r1 = r2; r2 = t;
        t = n1; n1 = n2; n2 = t;
    }
    if (n1 > MERGE_THRESHOLD) {
        merge_thread_count++;
    }

    if (n1 == 0) {
        return;
    } else {

        q1 = (p1 + r1) / 2;
        q2 = binary_search(temp[q1], temp, p2, r2);
        q3 = p3 + (q1 - p1) + (q2 - p2);
        vals_out[q3] = temp[q1];

        pmerge_args pma_l;
        pma_l.temp = temp;
        pma_l.p1 = p1;
        pma_l.r1 = q1-1;
        pma_l.p2 = p2;
        pma_l.r2 = q2-1;
        pma_l.vals_out = vals_out;
        pma_l.p3 = p3;

        if (n1 > MERGE_THRESHOLD) {
            pthread_create(&worker, NULL, p_merge, &pma_l);
        } else {        
            p_merge(&pma_l);
        }        

        pmerge_args pma_r;
        pma_r.temp = temp;
        pma_r.p1 = q1+1;
        pma_r.r1 = r1;
        pma_r.p2 = q2;
        pma_r.r2 = r2;
        pma_r.vals_out = vals_out;
        pma_r.p3 = q3+1;

        p_merge(&pma_r);

        if (n1 > MERGE_THRESHOLD) {
            pthread_join(worker, NULL);
        }
    }
}

int binary_search(int val, int *temp, int p, int r) {
    int low = p;
    int mid;
    int high = (p > r+1)? p : r+1;

    while (low < high) {
        mid = (low + high) / 2;
        if (val <= temp[mid]) {
            high = mid;
        } else {
            low = mid + 1;
        }
    }
    return high;
}

EDIT 2: Added new image below showing "max" and "total" RAM used by each version (max meaning highest simultaneous allocation/usage and total meaning the sum of all allocation requests through the program's life). These suggest that with N=1E8 and threshold=1E7 I should get a max usage of 3.2GB (which my system should be able to support). But again... I guess it is related to some other limitation in the pthread library... not my actual system resources.

解决方案

Looks like it is running out of memory. In your example, if the code is run sequentially, then the most memory it has allocated at one time is 1.6GB. When using threads, it is using more than 3GB. I put some wrappers around the malloc/free functions, and got this result:

Allocation of 12500000 bytes failed with 3074995884 bytes already allocated.

It's easy to see that the memory usage would be more when threaded. In that case, it would be simultaneously sorting both the left and right sides of the overall array, and allocating two large temp buffers to do it. When run sequentially, the temp buffer for the left half would be freed before sorting the right half.

Here are the wrappers I used:

static size_t total_allocated = 0;
static size_t max_allocated = 0;
static pthread_mutex_t total_allocated_mutex;

static void *allocate(int n)
{
  void *result = 0;
  pthread_mutex_lock(&total_allocated_mutex);
  result = malloc(n);
  if (!result) {
    fprintf(stderr,"Allocation of %d bytes failed with %u bytes already allocated\n",n,total_allocated);
  }
  assert(result);
  total_allocated += n;
  if (total_allocated>max_allocated) {
    max_allocated = total_allocated;
  }
  pthread_mutex_unlock(&total_allocated_mutex);
  return result;
}


static void *deallocate(void *p,int n)
{
  pthread_mutex_lock(&total_allocated_mutex);
  total_allocated -= n;
  free(p);
  pthread_mutex_unlock(&total_allocated_mutex);
}

这篇关于Pthreads无法解释的分段错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆