Pthreads无法解释的分段错误 [英] Pthreads unexplained segmentation fault
问题描述
我从Cormen的著名文本中实现了一种并行合并排序算法.我使用pthreads在C语言中编写了该代码,并在Win7 x64上使用MinGW进行了编译(稍后也在Ubuntu中的GCC上进行了测试,结果相同).我在并行化方面的第一个方法很幼稚……我在每个递归级别上都产生了一个新线程(这实际上是Cormen的伪代码所暗示的).但是,这通常会花费很长时间,或者由于分段错误而崩溃(我可以假设系统可以处理多少个线程存在一定的硬性限制).这似乎是递归并行化的一个新手常见错误,实际上我发现了一个类似的 在下面添加了新图像,其中显示了每个版本使用的最大"和总" RAM(最大值表示最高同时分配/使用量,总数表示在程序生命周期内所有分配请求的总和).这些建议使用N = 1E8和threshold = 1E7时,我应该获得最大3.2GB的使用量(我的系统应能够支持该使用量).但是,再次……我想这与pthread库中的其他一些限制有关……与我的实际系统资源无关. 似乎内存不足.在您的示例中,如果代码按顺序运行,则一次分配的最大内存为1.6GB.使用线程时,它使用的内存超过3GB.我在malloc/free函数周围放了一些包装,并得到了以下结果: 很容易看出,线程化后的内存使用量会更多.在那种情况下,它将同时对整个数组的左侧和右侧进行排序,并分配两个大的临时缓冲区来执行此操作.如果按顺序运行,则在对右半部分进行排序之前,将释放左半部分的临时缓冲区. 这是我使用的包装器: I implemented a parallel merge sort algorithm from Cormen's well-known text. I wrote it in C using pthreads, and compiled with MinGW on Win7 x64 (also tested later with GCC in Ubuntu with same results). My first approach at the parallelization was naïve... I spawned a new thread at every recursion level (which is actually what Cormen's pseudocode implies). However this usually ends up either taking way too long or crashing due to segmentation fault (I can assume there is some hard limit to how many threads the system can handle). This seems to be a common newbie mistake for recursive parallelization, in fact I found a similar DISCUSSION on this site. So I instead used the recommendation in that thread, namely setting a threshold for problem size, and if the function that spawns new threads is given a set smaller than the threshold (say 10,000 elements) then it just operates on the elements directly, rather than creating a new thread for such a small set. Now everything seemed to be working fine. I tabulated some of my results below. N is problem size (a set of integers [1, 2, 3, ..., N] thoroughly scrambled) and threshold is the value below which my parallel sort and parallel merge functions refuse to spawn new threads. The first table shows sort time in ms, the second shows how many sort/merge worker threads were spawned in each case. Looking at the N=1E6 and N=1E7 rows in the bottom table, you can see that anytime I lower the threshold such that more than ~8000 merge workers are allowed, I get segmentation fault. Again, I assume that is due to some limit the system gives on threads, and I'd be happy to hear more about that, but it's not my main question. The main question, is why does the final row get segfault when trying to use a fairly high threshold, which would have spawned an expected 15/33 worker threads (following pattern from previous rows). Surely this is not too many threads for my system to handle. The one instance which did complete (lower right cell in table) used about 1.2GB RAM (my system has 6GB), and the threaded versions never seem to take more RAM compared to the ones with 0 threads at the right of each row. It's pretty obvious it's related to problem size and system resources... I'm hoping there's some piece of general knowledge I'm missing that makes the answer really clear. Any ideas? Sorry for the long wall of text... thanks if you've read this far! I can post the source if it seems relevant. EDIT: Source added for reference: EDIT 2: Added new image below showing "max" and "total" RAM used by each version (max meaning highest simultaneous allocation/usage and total meaning the sum of all allocation requests through the program's life). These suggest that with N=1E8 and threshold=1E7 I should get a max usage of 3.2GB (which my system should be able to support). But again... I guess it is related to some other limitation in the pthread library... not my actual system resources. Looks like it is running out of memory. In your example, if the code is run sequentially, then the most memory it has allocated at one time is 1.6GB. When using threads, it is using more than 3GB. I put some wrappers around the malloc/free functions, and got this result: It's easy to see that the memory usage would be more when threaded. In that case, it would be simultaneously sorting both the left and right sides of the overall array, and allocating two large temp buffers to do it. When run sequentially, the temp buffer for the left half would be freed before sorting the right half. Here are the wrappers I used:
这篇关于Pthreads无法解释的分段错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!Allocation of 12500000 bytes failed with 3074995884 bytes already allocated.
static size_t total_allocated = 0;
static size_t max_allocated = 0;
static pthread_mutex_t total_allocated_mutex;
static void *allocate(int n)
{
void *result = 0;
pthread_mutex_lock(&total_allocated_mutex);
result = malloc(n);
if (!result) {
fprintf(stderr,"Allocation of %d bytes failed with %u bytes already allocated\n",n,total_allocated);
}
assert(result);
total_allocated += n;
if (total_allocated>max_allocated) {
max_allocated = total_allocated;
}
pthread_mutex_unlock(&total_allocated_mutex);
return result;
}
static void *deallocate(void *p,int n)
{
pthread_mutex_lock(&total_allocated_mutex);
total_allocated -= n;
free(p);
pthread_mutex_unlock(&total_allocated_mutex);
}
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <pthread.h>
const int N = 100000000;
const int SORT_THRESHOLD = 10000000;
const int MERGE_THRESHOLD = 10000000;
int sort_thread_count = 0;
int merge_thread_count = 0;
typedef struct s_pmergesort_args {
int *vals_in, p, r, *vals_out, s;
} pmergesort_args;
typedef struct s_pmerge_args {
int *temp, p1, r1, p2, r2, *vals_out, p3;
} pmerge_args;
void *p_merge_sort(void *v_pmsa);
void *p_merge(void *v_pma);
int binary_search(int val, int *temp, int p, int r);
int main() {
int *values, i, rand1, rand2, temp, *sorted;
long long rand1a, rand1b, rand2a, rand2b;
struct timeval start, end;
/* allocate values on heap and initialize */
values = malloc(N * sizeof(int));
sorted = malloc(N * sizeof(int));
for (i = 0; i < N; i++) {
values[i] = i + 1;
sorted[i] = 0;
}
/* scramble
* - complicated logic to maximize swapping
* - lots of testing (not shown) was done to verify optimal swapping */
srand(time(NULL));
for (i = 0; i < N/10; i++) {
rand1a = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
rand1b = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
rand1 = (int)((rand1a * rand1b + rand()) % N);
rand2a = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
rand2b = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
rand2 = (int)((rand2a * rand2b + rand()) % N);
temp = values[rand1];
values[rand1] = values[rand2];
values[rand2] = temp;
}
/* set up args for p_merge_sort */
pmergesort_args pmsa;
pmsa.vals_in = values;
pmsa.p = 0;
pmsa.r = N-1;
pmsa.vals_out = sorted;
pmsa.s = 0;
/* sort */
gettimeofday(&start, NULL);
p_merge_sort(&pmsa);
gettimeofday(&end, NULL);
/* verify sorting */
for (i = 1; i < N; i++) {
if (sorted[i] < sorted[i-1]) {
fprintf(stderr, "Error: array is not sorted.\n");
exit(0);
}
}
printf("Success: array is sorted.\n");
printf("Sorting took %dms.\n", (int)(((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec))/1000));
free(values);
free(sorted);
printf("( sort threads created: %d )\n", sort_thread_count);
printf("( merge threads created: %d )\n", merge_thread_count);
return 0;
}
void *p_merge_sort(void *v_pmsa) {
pmergesort_args pmsa = *((pmergesort_args *) v_pmsa);
int *vals_in = pmsa.vals_in;
int p = pmsa.p;
int r = pmsa.r;
int *vals_out = pmsa.vals_out;
int s = pmsa.s;
int n = r - p + 1;
pthread_t worker;
if (n > SORT_THRESHOLD) {
sort_thread_count++;
}
if (n == 1) {
vals_out[s] = vals_in[p];
} else {
int *temp = malloc(n * sizeof(int));
int q = (p + r) / 2;
int q_ = q - p + 1;
pmergesort_args pmsa_l;
pmsa_l.vals_in = vals_in;
pmsa_l.p = p;
pmsa_l.r = q;
pmsa_l.vals_out = temp;
pmsa_l.s = 0;
pmergesort_args pmsa_r;
pmsa_r.vals_in = vals_in;
pmsa_r.p = q+1;
pmsa_r.r = r;
pmsa_r.vals_out = temp;
pmsa_r.s = q_;
if (n > SORT_THRESHOLD) {
pthread_create(&worker, NULL, p_merge_sort, &pmsa_l);
} else {
p_merge_sort(&pmsa_l);
}
p_merge_sort(&pmsa_r);
if (n > SORT_THRESHOLD) {
pthread_join(worker, NULL);
}
pmerge_args pma;
pma.temp = temp;
pma.p1 = 0;
pma.r1 = q_ - 1;
pma.p2 = q_;
pma.r2 = n - 1;
pma.vals_out = vals_out;
pma.p3 = s;
p_merge(&pma);
free(temp);
}
}
void *p_merge(void *v_pma) {
pmerge_args pma = *((pmerge_args *) v_pma);
int *temp = pma.temp;
int p1 = pma.p1;
int r1 = pma.r1;
int p2 = pma.p2;
int r2 = pma.r2;
int *vals_out = pma.vals_out;
int p3 = pma.p3;
int n1 = r1 - p1 + 1;
int n2 = r2 - p2 + 1;
int q1, q2, q3, t;
pthread_t worker;
if (n1 < n2) {
t = p1; p1 = p2; p2 = t;
t = r1; r1 = r2; r2 = t;
t = n1; n1 = n2; n2 = t;
}
if (n1 > MERGE_THRESHOLD) {
merge_thread_count++;
}
if (n1 == 0) {
return;
} else {
q1 = (p1 + r1) / 2;
q2 = binary_search(temp[q1], temp, p2, r2);
q3 = p3 + (q1 - p1) + (q2 - p2);
vals_out[q3] = temp[q1];
pmerge_args pma_l;
pma_l.temp = temp;
pma_l.p1 = p1;
pma_l.r1 = q1-1;
pma_l.p2 = p2;
pma_l.r2 = q2-1;
pma_l.vals_out = vals_out;
pma_l.p3 = p3;
if (n1 > MERGE_THRESHOLD) {
pthread_create(&worker, NULL, p_merge, &pma_l);
} else {
p_merge(&pma_l);
}
pmerge_args pma_r;
pma_r.temp = temp;
pma_r.p1 = q1+1;
pma_r.r1 = r1;
pma_r.p2 = q2;
pma_r.r2 = r2;
pma_r.vals_out = vals_out;
pma_r.p3 = q3+1;
p_merge(&pma_r);
if (n1 > MERGE_THRESHOLD) {
pthread_join(worker, NULL);
}
}
}
int binary_search(int val, int *temp, int p, int r) {
int low = p;
int mid;
int high = (p > r+1)? p : r+1;
while (low < high) {
mid = (low + high) / 2;
if (val <= temp[mid]) {
high = mid;
} else {
low = mid + 1;
}
}
return high;
}
Allocation of 12500000 bytes failed with 3074995884 bytes already allocated.
static size_t total_allocated = 0;
static size_t max_allocated = 0;
static pthread_mutex_t total_allocated_mutex;
static void *allocate(int n)
{
void *result = 0;
pthread_mutex_lock(&total_allocated_mutex);
result = malloc(n);
if (!result) {
fprintf(stderr,"Allocation of %d bytes failed with %u bytes already allocated\n",n,total_allocated);
}
assert(result);
total_allocated += n;
if (total_allocated>max_allocated) {
max_allocated = total_allocated;
}
pthread_mutex_unlock(&total_allocated_mutex);
return result;
}
static void *deallocate(void *p,int n)
{
pthread_mutex_lock(&total_allocated_mutex);
total_allocated -= n;
free(p);
pthread_mutex_unlock(&total_allocated_mutex);
}