如何调度/创建用户级线程,以及如何创建内核级线程? [英] How are user-level threads scheduled/created, and how are kernel level threads created?

查看:179
本文介绍了如何调度/创建用户级线程,以及如何创建内核级线程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果这个问题是愚蠢的,抱歉。我试图在网上找到一个答案了一段时间,但不能,因此我在这里问。我正在学习线索,我已经通过此链接这个Linux Plumbers Conference 2013视频关于内核级别和用户级线程,据我所知,使用pthreads创建线程用户空间和内核不知道这一点,并将其视为一个单独的进程,不知道有多少线程内。在这种情况下,




  • 在进程获取时间片期间决定这些用户线程的调度,因为内核将其视为单个进程和不知道线程,以及如何调度完成?

  • 如果pthreads创建用户级线程,如何根据用户空间程序创建内核级别或操作系统线程?

  • ,它说操作系统内核提供系统调用来创建和管理线程。因此, clone()系统调用是否创建内核级线程或用户级线程?


    • 如果创建了内核级线程,则 strace ://computing.llnl.gov/tutorials/pthreads/samples/hello.c> pthreads程序还显示了在执行时使用clone(),但是为什么会被认为是用户级线程?

    • 如果不创建内核级线程,那么如何从用户空间程序创建内核线程?


  • 根据链接,它说它需要一个完整的线程控制块(TCB)为每个线程维护信息有关线程。结果是有重大的开销和增加内核复杂性,所以在内核级线程,只有



编辑:



我问的是用户级线程的创建,以及它的调度,因为这里,一个对多对一模型的引用,其中许多用户级线程被映射到一个内核级线程,线程管理在线程库在用户空间完成。我只是看到引用使用pthreads,但不确定是否创建用户级或内核级线程。

解决方案



您正在阅读的文档是通用的[不是linux特有的],有点过时。而且,更重要的是,它使用不同的术语。也就是说,我相信,混乱的根源。所以,阅读...



它称之为用户级线程是我称之为[过时的] LWP线程。它称为内核级线程是linux中的一个 native 线程。在linux下,所谓的内核线程是另外一些东西[见下文]。


使用pthreads在用户空间创建线程,并且内核不知道这一点,并将其视为单个进程,不知道有多少线程。


如何在 NPTL (本机posix线程库)之前完成用户空间线程。这也是SunOS / Solaris称为 LWP 轻量级进程的原因。



线程。 IIRC,它被称为线程主进程[或一些]。内核不是意识到这一点。



但是,因为这些轻量级线程是由基于用户空间的线程主(也称为轻量级进程调度程序)[只是一个特殊的用户程序/进程],他们很难切换上下文。



的本机线程,您可能有10个进程。每个进程获得10%的CPU。如果其中一个进程是一个LWP有10个线程,这些线程必须共享这10%,因此,只有1%的CPU每个。



全部这被内核的调度程序意识到的本地线程所替代。



现在,在上面的例子中,我们有20个线程/进程,每个进程占CPU的5%。而且,上下文切换要快得多。



在本机线程下仍然可以有一个LWP系统,但现在,这是一个设计选择,而不是必要性。



此外,如果每个线程协作,LWP工作的很好。也就是说,每个线程循环周期性地对上下文切换功能进行显式调用。



但是, glibc中的pre-NPTL实现方式是非自动的也必须[强制]抢占LWP线程(即实现时间片)。我不记得确切的机制,但是,这里是一个例子。线程主机必须设置闹钟,进入睡眠,唤醒,然后向活动线程发送信号。信号处理程序将影响上下文切换。


Joachim提到了 pthread_create 函数创建内核线程


这是[技术上]不正确的 >线程。 pthread_create 创建一个本机线程。这在用户空间中运行,并与时间片竞争,与进程平等。一旦创建,线程和进程之间就没有区别。



主要的区别是进程有自己唯一的地址空间。然而,一个线程是一个进程,它与同一线程组的一部分的其他进程/线程共享其地址空间。


如果它不创建内核级线程,那么如何从用户空间程序创建内核线程?


内核线程不是用户空间线程,NPTL,本机或其他。它们由内核通过 kernel_thread 函数创建。它们作为内核的一部分运行,并且不是与任何用户空间程序/进程/线程相关联的。他们可以完全访问机器。设备,MMU等。内核线程运行在最高特权级别:环0.它们也在内核的地址空间中运行,并且不是任何用户进程/线程的地址空间。



用户空间程序/进程可能不会创建内核线程。记住,它使用 pthread_create 创建一个本机线程,它会调用 clone 所以。



线程是有用的做事情,即使是内核。所以,它运行在各种线程中的一些代码。你可以通过 ps ax 来查看这些线程。看看,你会看到 kthreadd,ksoftirqd,kworker,rcu_sched,rcu_bh,watchdog,migration 等。这些是内核线程和




b blockquote>

您提到内核不知道用户线程。


请记住, ,有两个eras。



(1)内核获得线程支持之前(大约2004年?这使用线程主(在这里,我将调用LWP调度程序)。内核只有 fork 系统调用。



(2)之后的所有内核 em>理解线程。有没有主线程,但是我们有 pthreads clone 系统调用。现在, fork 实现为 clone clone 类似于 fork ,但是需要一些参数。注意, flags 参数和 child_stack 参数。




那么,用户级线程如何具有单独的堆栈呢?


处理器堆栈没有什么魔术。我将讨论[主要]限于x86,但这将适用于任何架构,甚至那些甚至没有堆栈寄存器(例如,1970年代IBM大型机,如IBM System 370)



在x86下,堆栈指针为%rsp 。 x86具有 push pop 指令。我们使用这些来保存和恢复内容: push%rcx 和[later] pop%rcx

但是,假设x86没有
%rsp push / pop 说明?我们还能有堆栈吗?当然,通过惯例。我们[作为程序员]同意(例如)%rbx 是堆栈指针。



%rcx 的推送将是[使用AT& T汇编器]:

  subq $ 8,%rbx 
movq%rcx,0(%rbx)

而且,%rcx 的pop将是:

  movq 0(%rbx),%rcx 
addq $ 8,%rbx

,我要切换到C伪代码。下面是上面的伪代码中的push / pop:

  // push%ecx 
%rbx - = 8;
0(%rbx)=%ecx;

// pop%ecx
%ecx = 0(%rbx);
%rbx + = 8;






要创建线程,LWP调度程序必须使用 malloc 创建堆栈区域。然后它必须将这个指针保存在每个线程的结构中,然后启动子LWP。实际的代码有点棘手,假设我们有一个类似于 pthread_create LWP_create / p>

  typedef void *(* LWP_func)(void *); 

//每线程控制
typedef struct tsk tsk_t;
struct tsk {
tsk_t * tsk_next; //
tsk_t * tsk_prev; //
void * tsk_stack; // stack base
u64 tsk_regsave [16];
};

//任务列表
typedef struct tsklist tsklist_t;
struct tsklist {
tsk_t * tsk_next; //
tsk_t * tsk_prev; //
};

tsklist_t tsklist; //任务列表

tsk_t * tskcur; //当前线程

// LWP_switch - 从一个任务切换到另一个任务
void
LWP_switch(tsk_t * to)
{

//注意:我们使用(即)烧录寄存器值,因为我们做我们的工作。在真实的
//实现中,我们必须以特殊的方式push / pop这些。所以,只是
//假装我们这样做...

//将所有寄存器保存到tskcur-> tsk_regsave
tskcur-> tsk_regsave [RAX] =% ;
// ...

tskcur = to;

//从tskcur-> tsk_regsave
%恢复大多数寄存器rax = tskcur-> tsk_regsave [RAX];
// ...

//将堆栈指针指向新任务的堆栈
%rsp = tskcur-> tsk_regsave [RSP];

//为任务设置恢复地址
push(%rsp,tskcur-> tsk_regsave [RIP]);

//发出ret指令
ret();
}

// LWP_create - 启动一个新的LWP
tsk_t *
LWP_create(LWP_func start_routine,void * arg)
{
tsk_t * tsknew;

//为新任务获取每线程结构
tsknew = calloc(1,sizeof(tsk_t));
append_to_tsklist(tsknew);

//获取新任务的栈
tsknew-> tsk_stack = malloc(0x100000)
tsknew-> tsk_regsave [RSP] = tsknew-> tsk_stack;

//给任务其参数
tsknew-> tsk_regsave [RDI] = arg;

//切换到新任务
LWP_switch(tsknew);

return tsknew;
}

// LWP_destroy - 销毁LWP
无效
LWP_destroy(tsk_t * tsk)
{

/ /释放任务的栈
free(tsk-> tsk_stack);

remove_from_tsklist(tsk);

//死死任务的自由线程结构
free(tsk);
}






线程,我们使用 pthread_create clone ,但我们仍然必须创建新线程堆栈。内核不会为新线程创建/分配堆栈。 clone syscall接受 child_stack 参数。因此, pthread_create 必须为新线程分配一个堆栈,并将其传递给 clone

  // pthread_create  - 启动一个新的本地线程
tsk_t *
pthread_create(LWP_func start_routine,void * arg)
{
tsk_t * tsknew;

//为新任务获取每线程结构
tsknew = calloc(1,sizeof(tsk_t));
append_to_tsklist(tsknew);

//获取新任务的栈
tsknew-> tsk_stack = malloc(0x100000)

//启动线程
clone(start_routine,tsknew - > tsk_stack,CLONE_THREAD,arg);

return tsknew;
}

// pthread_join - 销毁LWP
void
pthread_join(tsk_t * tsk)
{

/ /等待线程死掉...

//释放任务的栈
free(tsk-> tsk_stack);

remove_from_tsklist(tsk);

//死死任务的自由线程结构
free(tsk);
}






线程由内核分配其初始堆栈,通常在高内存地址。因此,如果进程不使用线程,通常只使用预先分配的堆栈。



但是,如果一个线程被创建,一个LWP或一个本机,启动进程/线程必须用 malloc 注意:使用 malloc 是正常的方式,但是线程创建者可能只有一个大的全局内存池: char stack_area [MAXTASK] [0x100000]; 如果它希望这样做。



如果我们有一个普通的程序, >不是使用任何类型的线程,它可能希望覆盖给定的默认堆栈。



该进程可以决定使用 malloc 和上面的汇编工具创建一个更大的堆栈,如果它是一个巨大的递归函数。



请参阅我的回答:用户定义的堆栈和内置堆栈在使用内存之间有什么区别?


Apologies if this question is stupid. I tried to find an answer online for quite some time, but couldn't and hence I'm asking here. I am learning threads, and I've been going through this link and this Linux Plumbers Conference 2013 videoabout kernel level and user level threads, and as far as I understood, using pthreads create threads in the userspace, and the kernel is not aware about this and view it as a single process only, unaware of how many threads are inside. In such a case,

  • who decides the scheduling of these user threads during the timeslice the process gets, as the kernel sees it as a single process and is unaware of the threads, and how is the scheduling done?
  • If pthreads create user level threads, how are kernel level or OS threads created from user space programs, if required?
  • According to the above link, it says Operating Systems kernel provides system call to create and manage threads. So does a clone() system call creates a kernel level thread or user level thread?
    • If it creates a kernel level thread, then strace of a simple pthreads program also shows using clone() while executing, but then why would it be considered user level thread?
    • If it doesn't create a kernel level thread, then how are kernel threads created from userspace programs?
  • According to the link, it says "It require a full thread control block (TCB) for each thread to maintain information about threads. As a result there is significant overhead and increased in kernel complexity.", so in kernel level threads, only the heap is shared, and the rest all are individual to the thread?

Edit:

I was asking about the user-level thread creation, and it's scheduling because here, there is a reference to Many to One Model where many user level threads are mapped to one Kernel-level thread, and Thread management is done in user space by the thread library. I've been only seeing references to using pthreads, but unsure if it creates user-level or kernel-level threads.

解决方案

This is prefaced by the top comments.

The documentation you're reading is generic [not linux specific] and a bit outdated. And, more to the point, it is using different terminology. That is, I believe, the source of the confusion. So, read on ...

What it calls a "user-level" thread is what I'm calling an [outdated] LWP thread. What it calls a "kernel-level" thread is what is called a native thread in linux. Under linux, what is called a "kernel" thread is something else altogether [See below].

using pthreads create threads in the userspace, and the kernel is not aware about this and view it as a single process only, unaware of how many threads are inside.

This was how userspace threads were done prior to the NPTL (native posix threads library). This is also what SunOS/Solaris called an LWP lightweight process.

There was one process that multiplexed itself and created threads. IIRC, it was called the thread master process [or some such]. The kernel was not aware of this. The kernel didn't yet understand or provide support for threads.

But, because, these "lightweight" threads were switched by code in the userspace based thread master (aka "lightweight process scheduler") [just a special user program/process], they were very slow to switch context.

Also, before the advent of "native" threads, you might have 10 processes. Each process gets 10% of the CPU. If one of the processes was an LWP that had 10 threads, these threads had to share that 10% and, thus, got only 1% of the CPU each.

All this was replaced by the "native" threads that the kernel's scheduler is aware of. This changeover was done 10-15 years ago.

Now, with the above example, we have 20 threads/processes that each get 5% of the CPU. And, the context switch is much faster.

It is still possible to have an LWP system under a native thread, but, now, that is a design choice, rather than a necessity.

Further, LWP works great if each thread "cooperates". That is, each thread loop periodically makes an explicit call to a "context switch" function. It is voluntarily relinquishing the process slot so another LWP can run.

However, the pre-NPTL implementation in glibc also had to [forcibly] preempt LWP threads (i.e. implement timeslicing). I can't remember the exact mechanism used, but, here's an example. The thread master had to set an alarm, go to sleep, wake up and then send the active thread a signal. The signal handler would effect the context switch. This was messy, ugly, and somewhat unreliable.

Joachim mentioned pthread_create function creates a kernel thread

That is [technically] incorrect to call it a kernel thread. pthread_create creates a native thread. This is run in userspace and vies for timeslices on an equal footing with processes. Once created there is little difference between a thread and a process.

The primary difference is that a process has its own unique address space. A thread, however, is a process that shares its address space with other process/threads that are part of the same thread group.

If it doesn't create a kernel level thread, then how are kernel threads created from userspace programs?

Kernel threads are not userspace threads, NPTL, native, or otherwise. They are created by the kernel via the kernel_thread function. They run as part of the kernel and are not associated with any userspace program/process/thread. They have full access to the machine. Devices, MMU, etc. Kernel threads run in the highest privilege level: ring 0. They also run in the kernel's address space and not the address space of any user process/thread.

A userspace program/process may not create a kernel thread. Remember, it creates a native thread using pthread_create, which invokes the clone syscall to do so.

Threads are useful to do things, even for the kernel. So, it runs some of its code in various threads. You can see these threads by doing ps ax. Look and you'll see kthreadd, ksoftirqd, kworker, rcu_sched, rcu_bh, watchdog, migration, etc. These are kernel threads and not programs/processes.


UPDATE:

You mentioned that kernel doesn't know about user threads.

Remember that, as mentioned above, there are two "eras".

(1) Before the kernel got thread support (circa 2004?). This used the thread master (which, here, I'll call the LWP scheduler). The kernel just had the fork syscall.

(2) All kernels after that which do understand threads. There is no thread master, but, we have pthreads and the clone syscall. Now, fork is implemented as clone. clone is similar to fork but takes some arguments. Notably, a flags argument and a child_stack argument.

More on this below ...

then, how is it possible for user level threads to have individual stacks?

There is nothing "magic" about a processor stack. I'll confine discussion [mostly] to x86, but this would be applicable to any architecture, even those that don't even have a stack register (e.g. 1970's era IBM mainframes, such as the IBM System 370)

Under x86, the stack pointer is %rsp. The x86 has push and pop instructions. We use these to save and restore things: push %rcx and [later] pop %rcx.

But, suppose the x86 did not have %rsp or push/pop instructions? Could we still have a stack? Sure, by convention. We [as programmers] agree that (e.g.) %rbx is the stack pointer.

In that case, a "push" of %rcx would be [using AT&T assembler]:

subq    $8,%rbx
movq    %rcx,0(%rbx)

And, a "pop" of %rcx would be:

movq    0(%rbx),%rcx
addq    $8,%rbx

To make it easier, I'm going to switch to C "pseudo code". Here are the above push/pop in pseudo code:

// push %ecx
    %rbx -= 8;
    0(%rbx) = %ecx;

// pop %ecx
    %ecx = 0(%rbx);
    %rbx += 8;


To create a thread, the LWP scheduler had to create a stack area using malloc. It then had to save this pointer in a per-thread struct, and then kick off the child LWP. The actual code is a bit tricky, assume we have an (e.g.) LWP_create function that is similar to pthread_create:

typedef void * (*LWP_func)(void *);

// per-thread control
typedef struct tsk tsk_t;
struct tsk {
    tsk_t *tsk_next;                    //
    tsk_t *tsk_prev;                    //
    void *tsk_stack;                    // stack base
    u64 tsk_regsave[16];
};

// list of tasks
typedef struct tsklist tsklist_t;
struct tsklist {
    tsk_t *tsk_next;                    //
    tsk_t *tsk_prev;                    //
};

tsklist_t tsklist;                      // list of tasks

tsk_t *tskcur;                          // current thread

// LWP_switch -- switch from one task to another
void
LWP_switch(tsk_t *to)
{

    // NOTE: we use (i.e.) burn register values as we do our work. in a real
    // implementation, we'd have to push/pop these in a special way. so, just
    // pretend that we do that ...

    // save all registers into tskcur->tsk_regsave
    tskcur->tsk_regsave[RAX] = %rax;
    // ...

    tskcur = to;

    // restore most registers from tskcur->tsk_regsave
    %rax = tskcur->tsk_regsave[RAX];
    // ...

    // set stack pointer to new task's stack
    %rsp = tskcur->tsk_regsave[RSP];

    // set resume address for task
    push(%rsp,tskcur->tsk_regsave[RIP]);

    // issue "ret" instruction
    ret();
}

// LWP_create -- start a new LWP
tsk_t *
LWP_create(LWP_func start_routine,void *arg)
{
    tsk_t *tsknew;

    // get per-thread struct for new task
    tsknew = calloc(1,sizeof(tsk_t));
    append_to_tsklist(tsknew);

    // get new task's stack
    tsknew->tsk_stack = malloc(0x100000)
    tsknew->tsk_regsave[RSP] = tsknew->tsk_stack;

    // give task its argument
    tsknew->tsk_regsave[RDI] = arg;

    // switch to new task
    LWP_switch(tsknew);

    return tsknew;
}

// LWP_destroy -- destroy an LWP
void
LWP_destroy(tsk_t *tsk)
{

    // free the task's stack
    free(tsk->tsk_stack);

    remove_from_tsklist(tsk);

    // free per-thread struct for dead task
    free(tsk);
}


With a kernel that understands threads, we use pthread_create and clone, but we still have to create the new thread's stack. The kernel does not create/assign a stack for a new thread. The clone syscall accepts a child_stack argument. Thus, pthread_create must allocate a stack for the new thread and pass that to clone:

// pthread_create -- start a new native thread
tsk_t *
pthread_create(LWP_func start_routine,void *arg)
{
    tsk_t *tsknew;

    // get per-thread struct for new task
    tsknew = calloc(1,sizeof(tsk_t));
    append_to_tsklist(tsknew);

    // get new task's stack
    tsknew->tsk_stack = malloc(0x100000)

    // start up thread
    clone(start_routine,tsknew->tsk_stack,CLONE_THREAD,arg);

    return tsknew;
}

// pthread_join -- destroy an LWP
void
pthread_join(tsk_t *tsk)
{

    // wait for thread to die ...

    // free the task's stack
    free(tsk->tsk_stack);

    remove_from_tsklist(tsk);

    // free per-thread struct for dead task
    free(tsk);
}


Only a process or main thread is assigned its initial stack by the kernel, usually at a high memory address. So, if the process does not use threads, normally, it just uses that pre-assigned stack.

But, if a thread is created, either an LWP or a native one, the starting process/thread must pre-allocate the area for the proposed thread with malloc. Side note: Using malloc is the normal way, but the thread creator could just have a large pool of global memory: char stack_area[MAXTASK][0x100000]; if it wished to do it that way.

If we had an ordinary program that does not use threads [of any type], it may wish to "override" the default stack it has been given.

That process could decide to use malloc and the above assembler trickery to create a much larger stack if it were doing a hugely recursive function.

See my answer here: What is the difference between user defined stack and built in stack in use of memory?

这篇关于如何调度/创建用户级线程,以及如何创建内核级线程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆