为什么此代码会死锁? [英] Why does this code deadlock?

查看：126 发布时间：2020/4/25 11:26:49 linux-kernel kernel deadlock watchdog spinlock

本文介绍了为什么此代码会死锁?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在可加载模块中创建了2个Linux内核线程，并将它们绑定到在双核Android设备上运行的单独的CPU内核.运行几次后，我注意到设备通过硬件看门狗定时器复位而重启.我一直都在解决这个问题.造成僵局的原因是什么?

I created 2 Linux kernel threads in my loadable module and I bind them to separate CPU cores running on a dual core Android device. After I run this few times, I noticed that the device reboots with a HW watchdog timer reset. I hit the issue consistently. What could be causing the deadlock?

基本上，我需要做的是确保两个线程在不同的内核上同时运行do_something()，而不会有人窃取CPU周期(即禁用中断).我为此使用了一个自旋锁和一个volatile变量.我也有一个让父线程在子线程上等待的信号灯.

Basically, what i need to do is, make sure both the threads run do_something() at the same time on different cores without anybody stealing the cpu cycles(i.e. interrupts are disabled). I am using a spinlock and a volatile variable for this. I also have a semaphore for parent thread to wait on child thread.

#define CPU_COUNT 2

/* Globals */
spinlock_t lock;
struct semaphore sem;
volatile unsigned long count;

/* Thread util function for binding the thread to CPU*/
struct task_struct* thread_init(kthread_fn fn, void* data, int cpu)
{
    struct task_struct *ts;

    ts=kthread_create(fn, data, "per_cpu_thread");
    kthread_bind(ts, cpu);
    if (!IS_ERR(ts)) {
        wake_up_process(ts);
    }
    else {
        ERR("Failed to bind thread to CPU %d\n", cpu);
    }
    return ts;
}

/* Sync both threads */
void thread_sync()
{   
    spin_lock(&lock);
    ++count;
    spin_unlock(&lock); 

    while (count != CPU_COUNT);
}

void do_something()
{
}

/* Child thread */
int per_cpu_thread_fn(void* data)
{
    int i = 0;
    unsigned long flags = 0;
    int cpu = smp_processor_id();

    DBG("per_cpu_thread entering (cpu:%d)...\n", cpu);

    /* Disable local interrupts */
    local_irq_save(flags);

    /* sync threads */
    thread_sync();

    /* Do something */
    do_something();

    /* Enable interrupts */
    local_irq_restore(flags);

    /* Notify parent about exit */
    up(&sem);
    DBG("per_cpu_thread exiting (cpu:%d)...\n", cpu);
    return value;
}

/* Main thread */
int main_thread()
{
    int cpuB;
    int cpu = smp_processor_id();
    unsigned long flags = 0;

    DBG("main thread running (cpu:%d)...\n", cpu);

    /* Init globals*/
    sema_init(&sem, 0);
    spin_lock_init(&lock);
    count = 0;

    /* Launch child thread and bind to the other CPU core */
    if (cpu == 0) cpuB = 1; else cpuB = 0;        
    thread_init(per_cpu_thread_fn, NULL, cpuB);

    /* Disable local interrupts */
    local_irq_save(flags);

    /* thread sync */
    thread_sync();

    /* Do something here */
    do_something();

    /* Enable interrupts */
    local_irq_restore(flags);

    /* Wait for child to join */
    DBG("main thread waiting for all child threads to finish ...\n");
    down_interruptible(&sem);
}

推荐答案

我不确定，这是真正的原因，但是您的代码包含一些严重的错误.

I'm not sure, this is a real reason, but your code contains some serious errors.

第一.除非读取是原子的，否则您必须持有锁才能读取共享变量.使用count不能保证是这样.

First in while (count != CPU_COUNT);. You must not read shared variable without holding a lock, unless read is atomic. With count it isn't guaranteed to be.

您必须用锁保护对count的读取.您可以将以下内容替换为while循环:

You must protect read of count with lock. You can replace your while-loop with following:

unsigned long local_count;
do {
    spin_lock(&lock);
    local_count = count;
    spin_unlock(&lock);
} while (local_count != CPU_COUNT);

或者，您可以使用原子类型.注意没有锁定

Alternatively, you could use atomic types. Notice absence of locking

atomic_t count = ATOMIC_INIT(0);

...

void thread_sync() {
    atomic_inc(&count);
    while (atomic_read(&count) != CPU_COUNT);
}

第二个中断问题.我认为，您不了解自己在做什么.

Second problem with interrupts. I think, you don't understand what you are doing.

local_irq_save()保存和禁用中断.然后，使用local_irq_disable()再次禁用中断.完成一些工作后，您可以使用local_irq_restore()恢复以前的状态，并使用local_irq_enable()启用中断.这种启用是完全错误的.无论中断的先前状态如何，您都可以启用它们.

local_irq_save() saves and disables interrupts. Then, you disable interrupts again with local_irq_disable(). After some work, you restore previous state with local_irq_restore(), and enable interrupts with local_irq_enable(). This enabling is totally wrong. You enable interrupts, regardless of theirs previous state.

第三问题.如果主线程未绑定到cpu，则除非确定在获得cpu编号后内核不会立即重新安排调度，否则不应使用smp_processor_id().最好使用get_cpu()，它禁用内核抢占，然后返回cpu id.完成后，呼叫put_cpu().

Third problem. If main thread isn't binded to a cpu, you should not use smp_processor_id() unless you are sure that kernel will not reschedule right after you get a cpu number. It's better to use get_cpu(), which disables kernel preemption and then returns cpu id. When done, call put_cpu().

但是，当您调用get_cpu()时，这是创建和运行其他线程的错误.因此，您应该设置主线程的亲和力.

But, when you call get_cpu(), this is a bug to create and run other threads. That's why you should set affinity of main thread.

第四. local_irq_save()和local_irq_restore()宏使用变量，而不是指向unsigned long的指针. (我有一个错误和一些警告传递了指针.我想知道您是如何编译代码的).删除引用

Fourth. local_irq_save() and local_irq_restore() macros that takes a variable, not a pointer to unsigned long. (I've got an error and some warnings passing pointers. I wonder how did you compile your code). Remove referencing

此处提供了最终代码: http://pastebin.com/Ven6wqWf

The final code is available here: http://pastebin.com/Ven6wqWf

这篇关于为什么此代码会死锁?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么此代码会死锁? [英] Why does this code deadlock?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么此代码会死锁? [英] Why does this code deadlock?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭