pthread_cond_signal导致死锁 [英] pthread_cond_signal causing deadlock

查看:666
本文介绍了pthread_cond_signal导致死锁的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个程序,当其中一个线程调用pthread_cond_siganl(或广播)时,该程序会死锁. 这个问题在主程序中是100%可重现的.我无法弄清楚这是怎么回事,因此提取了等待和信号调用的代码.但是,死锁不能与所提取的问题一起再现.

I have a program that deadlocks when one of the threads calls pthread_cond_siganl (or broadcast). The problem is reproducible 100% in the main program. I could not figure out what is wrong with it and thus extracted the piece of code that wait and signal are called. However, the deadlock cannot be reproduced with the extracted problem.

在主程序上运行valgrind不会报告任何无效的读/写或内存泄漏.

Running valgrind on the main program does not report any invalid reads/writes or memory leaks.

我想知道调用pthread_cond_signal时出现死锁的可能原因是什么.

I want to know what are the possible reasons for a deadlock when calling pthread_cond_signal.

提取的代码段如下.

#include <pthread.h>
#include <math.h>
#include <syscall.h>
#include <assert.h>
#include <stdlib.h>
#include <iostream>

using namespace std;

void Task() {
    cerr << syscall(SYS_gettid) << " In Task, sleeping..." << endl;
    sleep(5);
}

pthread_mutex_t lock;
pthread_cond_t cond;
bool doingTheTask= false;

void* func(void* ) { 
    pthread_mutex_lock(&lock);
    if (doingTheTask) {
        cerr << syscall(SYS_gettid) << " wait... " << endl;
        while ( doingTheTask) {//spurious wake-up
            cerr << syscall(SYS_gettid) << " waiting..." << endl ;
            pthread_cond_wait(&cond, &lock);
            cerr << syscall(SYS_gettid) << " woke up!!!" << endl ;
        }
    }
    else {
        cerr << syscall(SYS_gettid) << " My Turn to do the task..." << endl;
        assert( ! doingTheTask );
        doingTheTask= true;
        pthread_mutex_unlock(&lock);
        Task();
        cerr << syscall(SYS_gettid) << " Before trying to acquire lock" << endl;
        pthread_mutex_lock(&lock);
        cerr << syscall(SYS_gettid) << " After acquiring lock" << endl ;
        assert( doingTheTask );
        doingTheTask = false;
        cerr << syscall(SYS_gettid) << " Before broadcast" << endl;
        pthread_cond_broadcast(&cond);
        cerr << syscall(SYS_gettid) << " After broadcast" << endl;
    }
    pthread_mutex_unlock(&lock);
    return NULL;
}


int main() {
    pthread_mutex_init(&lock,NULL);
    pthread_cond_init(&cond,NULL);
    pthread_t thread[2];

    for ( int i = 0 ;  i < 2 ; i ++ ) {
        if (0 != pthread_create(&thread[i], NULL, func, NULL) ) {
            cerr << syscall(SYS_gettid) << " Error creating thread" << endl;
            exit(1);
        }
    } 

    for ( int i = 0 ;  i < 2 ; i ++ ) {
        pthread_join(thread[i],NULL);
    }
    pthread_mutex_destroy(&lock);
    pthread_cond_destroy(&cond);

    return 0;
}

唯一重要的部分是func函数.只是介绍了其他部分以便进行编译.

The only important part is the func function. The other parts are just presented in order to compile.

正如我所说的,该问题在该程序中无法重现. 此代码段与主程序之间的区别是:

As I said the problem is not reproducible in this program. The difference between this snippet and the main program are:

  • 在主程序中,mutexcondvar是成员字段,而函数是成员方法.
  • 任务执行某些任务而不是睡觉.
  • 可能有多个线程在等待,我们应该广播而不是发出信号.但是,即使使用信号和一个等待线程,死锁也可以100%重现.
  • In the main program, the mutex and condvar are member fields and the function is a member method.
  • The task does some task instead of sleeping.
  • Multiple threads may wait and we should broadcast rather than signal. However, deadlock is 100% reproducible even when I use signal and one waiting thread.

我要用这段代码解决的问题是一种机制,当至少一个线程需要完成任务时,它就可以执行一次任务.但是没有两个线程应该并行执行任务,一旦其中一个线程执行了任务,其他线程就不需要执行此任务.这种方法的客户假定它阻塞直到任务完成为止(因此,在看到有人正在执行任务之后,我无法立即返回).

The problem that I am trying to solve with this piece of code is a mechanism to do the task once when at least one of the threads needs it to be done. But no two threads should do the task in parallel and once one of them does the task, the others do not need to do it. The clients of this method assume that it blocks until the task is done (thus I cannot return immediatly after seeing that someone is doing the task).

死锁线程的回溯是:

#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x00007ffff73e291c in pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:259

#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:136
#1  0x00007ffff73e30b1 in pthread_cond_signal@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_signal.S:142

pthread_cond_signal死锁是一个类似的问题.但似乎一个提出问题的人有内存损坏的问题.我没有内存损坏(例如valgrind).

pthread_cond_signal deadlocks is a similar problem. But seems like the one asking question had memory corruption. I do not have memory corruption (says valgrind).

该问题在我测试过的两台机器上都是100%可重现的. (最新的ArchLinux和Uubntu 10.04.3).

The problem is 100% reproducible on the two machines I tested it on. (ArchLinux latest and Uubntu 10.04.3).

下面是主程序的示例输出.它再次显示线程在调用pthread_cond_waitpthread_cond_signal之前阻塞. (第一列显示线程ID).

A sample output of the main program follows. It again shows that the threads block before calling pthread_cond_wait and pthread_cond_signal. (The first column shows the thread ids).

3967    In Task, sleeping...
3967    My Turn to do the task...
3967    In Task, sleeping...
3973    wait...
3973    waiting...
3976    <output from some other thread>
3967    Before trying to acquire lock
3967    After acquiring lock
3967    Before broadcast

主程序是C ++.但是我使用的是语言的C部分,因此避免使用C ++标记.

The main program is in C++. But I am using the C parts of the language and thus avoided using C++ tag.

推荐答案

愚蠢的错误. 我在执行信号并等待之前销毁了mutexcondvar. 要进行复制,只需在将线程连接到主函数之前移动destroy函数即可.

Stupid error. I was destroying the mutex and condvar before executing signal and wait. To reproduce, just move the destroy functions before the joining the threads in the main function.

令人惊讶的是,在我的两台机器上,这都会产生100%一致(和错误)的行为.

It is still surprising that on both of my machines, this produces 100% consistent (and wrong) behavior.

这篇关于pthread_cond_signal导致死锁的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆