为什么Linux的调度程序通过HyperThreading将两个线程放在处理器的同一物理内核上? [英] Why does Linux's scheduler put two threads onto the same physical core on processors with HyperThreading?

查看:104
本文介绍了为什么Linux的调度程序通过HyperThreading将两个线程放在处理器的同一物理内核上?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在多个地方都读过Linux的默认调度程序在多核计算机上具有超线程意识,这意味着如果您的计算机具有2个真实内核(4个HT),则不会将两个繁忙的线程调度到逻辑内核上,使它们都运行在相同的物理内核上(在许多情况下,这将导致2倍的性能成本).

但是当我在Intel i5-2520M上运行stress -c 2(产生两个线程以在100%CPU上运行)时,它经常将调度(并保持)两个线程到HT核心1和2,它们映射到相同的物理核心.即使系统处于空闲状态,也是如此.

这在真实程序中也会发生(我在这里使用stress,是因为它易于重现),并且当发生这种情况时,我的程序可以花两倍的时间运行.手动设置taskset的亲和力可以解决我的程序中的问题,但是我希望HT感知调度程序可以自己正确地做到这一点.

您可以使用egrep "processor|physical id|core id" /proc/cpuinfo | sed 's/^processor/\nprocessor/g'找到 HT->物理核心组件.

所以我的问题是:为什么调度程序将线程放在这里的同一物理核心上?


注意:

  • 此问题与其他问题非常相似,表示 Linux具有相当复杂的线程调度程序,可以识别HT .如上所述,我无法观察到这一事实(请使用stress -c进行检查),并且想知道为什么.
  • 我知道我可以为自己的程序手动设置处理器相似性,例如使用taskset工具或sched_setaffinity功能.这不是我想要的,我希望调度程序自己知道将两个繁忙的线程映射到一个物理核心并使一个物理核心完全为空不是一个好主意.
  • 我知道在某些情况下一些情况您可能希望将线程调度到相同的物理内核上,而让另一个内核空闲,但是调度器大约会在这种情况下执行1/4似乎是荒谬的.在我看来,它选择的HT内核是完全随机的,或者可能是那些调度时活动最少的HT内核,但是考虑到程序清晰地具有通过在单独的物理内核上运行而受益.

解决方案

我认为是时候从评论中总结一些知识了.

Linux调度程序对HyperThreading是意识到的-与之相关的信息应从BIOS/UEFI提供的ACPI SRAT/SLIT表中读取-比Linux构建 https://www.kernel.org/doc/Documentation/scheduler/sched-domains.txt

您可以通过比较/proc/schedstat中的数字来监视负载均衡器的活动.我为此编写了一个脚本: schedstat.py

计数器alb_pushed显示负载均衡器已成功移出任务:

Sun Apr 12 14:15:52 2015              cpu0    cpu1    ...    cpu6    cpu7    cpu8    cpu9    cpu10   ...
.domain1.alb_count                                    ...      1       1                       1  
.domain1.alb_pushed                                   ...      1       1                       1  
.domain2.alb_count                              1     ...                                         
.domain2.alb_pushed                             1     ...

但是,负载均衡器的逻辑很复杂,因此很难确定什么原因可以阻止负载均衡器正常工作,以及它们与schedstat计数器的关系.我和@thatotherguy都无法重现您的问题.

我认为这种行为有两种可能性:

  • 您有一些积极的节能策略,该策略试图节省一个内核以减少CPU的功耗.
  • 您确实遇到了调度子系统的错误,因此应该转到 LKML 并仔细共享您的发现(包括mpstatschedstat数据)

I've read in multiple places that Linux's default scheduler is hyperthreading aware on multi-core machines, meaning that if you have a machine with 2 real cores (4 HT), it won't schedule two busy threads onto logical cores in a way that they both run on the same physical cores (which would lead to 2x performance cost in many cases).

But when I run stress -c 2 (spawns two threads to run on 100% CPU) on my Intel i5-2520M, it often schedules (and keeps) the two threads onto HT cores 1 and 2, which map to the same physical core. Even if the system is idle otherwise.

This also happens with real programs (I'm using stress here because it makes it easy to reproduce), and when that happens, my program understandably takes twice as long to run. Setting affinity manually with taskset fixes that for my program, but I'd expect the a HT aware scheduler to do that correctly by itself.

You can find the HT->physical core assgnment with egrep "processor|physical id|core id" /proc/cpuinfo | sed 's/^processor/\nprocessor/g'.

So my question is: Why does the scheduler put my threads onto the same physical core here?


Notes:

  • This question is very similar to this other question, the answers to which say that Linux has quite a sophisticated thread scheduler which is HT aware. As described above, I cannot observe this fact (check for yourself with stress -c), and would like to know why.
  • I know that I can set processors affinity manually for my programs, e.g. with the taskset tool or with the sched_setaffinity function. This is not what I'm looking for, I would expect the scheduler to know by itself that mapping two busy threads to a physical core and leaving one physical core completely empty is not a good idea.
  • I'm aware that there are some situations in which you would prefer threads to be scheduled onto the same physical core and leave the other core free, but it seems nonsensical that the scheduler would do that roughly 1/4 of the cases. It seems to me that the HT cores that it picks are completely random, or maybe those HT cores that had least activity at the time of scheduling, but that wouldn't be very hyperthreading aware, given how clearly programs with the characteristics of stress benefit from running on separate physical cores.

解决方案

I think it's time to summarize some knowledge from comments.

Linux scheduler is aware of HyperThreading -- information about it should be read from ACPI SRAT/SLIT tables, which are provided by BIOS/UEFI -- than Linux builds scheduler domains from that.

Domains have hierarchy -- i.e. on 2-CPU servers you will get three layers of domains: all-cpus, per-cpu-package, and per-cpu-core domain. You may check it from /proc/schedstat:

$ awk '/^domain/ { print $1, $2; } /^cpu/ { print $1; }' /proc/schedstat
cpu0
domain0 0000,00001001     <-- all cpus from core 0
domain1 0000,00555555     <-- all cpus from package 0
domain2 0000,00ffffff     <-- all cpus in the system

Part of CFS scheduler is load balancer -- the beast that should steal tasks from your busy core to another core. Here are its description from the Kernel documentation:

While doing that, it checks to see if the current domain has exhausted its rebalance interval. If so, it runs load_balance() on that domain. It then checks the parent sched_domain (if it exists), and the parent of the parent and so forth.

Initially, load_balance() finds the busiest group in the current sched domain. If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in that group. If it manages to find such a runqueue, it locks both our initial CPU's runqueue and the newly found busiest one and starts moving tasks from it to our runqueue. The exact number of tasks amounts to an imbalance previously computed while iterating over this sched domain's groups.

From: https://www.kernel.org/doc/Documentation/scheduler/sched-domains.txt

You can monitor for activities of load balancer by comparing numbers in /proc/schedstat. I wrote a script for doing that: schedstat.py

Counter alb_pushed shows that load balancer was successfully moved out task:

Sun Apr 12 14:15:52 2015              cpu0    cpu1    ...    cpu6    cpu7    cpu8    cpu9    cpu10   ...
.domain1.alb_count                                    ...      1       1                       1  
.domain1.alb_pushed                                   ...      1       1                       1  
.domain2.alb_count                              1     ...                                         
.domain2.alb_pushed                             1     ...

However, logic of load balancer is complex, so it is hard to determine what reasons can stop it from doing its work well and how they are related with schedstat counters. Neither me nor @thatotherguy can reproduce your issue.

I see two possibilities for that behavior:

  • You have some aggressive power saving policy that tries to save one core to reduce power consumption of CPU.
  • You really encountered a bug with scheduling subsystem, than you should go to LKML and carefully share your findings (including mpstat and schedstat data)

这篇关于为什么Linux的调度程序通过HyperThreading将两个线程放在处理器的同一物理内核上?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆