C ++ STL(ExecutionPolicy)算法如何确定要使用多少个并行线程? [英] How do the C++ STL (ExecutionPolicy) algorithms determine how many parallel threads to use?

查看:586
本文介绍了C ++ STL(ExecutionPolicy)算法如何确定要使用多少个并行线程?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

C ++ 17通过使用可选的ExecutionPolicy参数(作为第一个参数),对69种STL算法进行了升级,以支持并行性.例如.

C++17 upgraded 69 STL algorithms to support parallelism, by the use of an optional ExecutionPolicy parameter (as the 1st argument). eg.

std::sort(std::execution::par, begin(v), end(v));

我怀疑C ++ 17标准故意不提如何实施多线程算法,而是由库作者决定最好的方法(并允许他们改变自己的方法).稍后).尽管如此,我还是希望从高层次上理解并行STL算法的实现中正在考虑的问题.

I suspect the C++17 standard deliberately says nothing about how to implement the multi-threaded algorithms, leaving it up to the library writers to decide what is best (and allowing them to change their minds, later). Still, I'm keen to understand at a high level what issues are being considered in the implementation of the parallel STL algorithms.

我想到的一些问题包括(但不限于!):

Some questions on my mind include (but are not limited to!):

  • (由C ++应用程序使用的)最大线程数与计算机上的CPU和/或GPU内核数有什么关系?
  • 每种算法使用的线程数量有什么区别? (每种算法在每种情况下都会始终使用相同数量的线程吗?)
  • 是否考虑了其他线程(在同一应用程序内)上的其他并行STL调用? (例如,如果某个线程调用std :: for_each(par,...),它将使用更多/更少/相同的线程,具体取决于std :: sort(par,...)是否已在其他线程上运行(s)?也许有一个线程池?)
  • 是否考虑到内核因外部因素而繁忙? (例如,如果1个内核非常繁忙,例如分析SETI信号,那么C ++应用程序会减少它使用的线程数量吗?)
  • 某些算法仅使用CPU内核吗?还是只有GPU内核?
  • 我怀疑各个库的实现会有所不同(从编译器到编译器?),甚至有关此的细节也会很有趣.

我意识到这些并行算法的目的是使程序员不必担心这些细节.但是,任何能使我对库调用内部发生的事情有一个高层次的了解的信息,将不胜感激.

I realise the point of these parallel algorithms is to shield the Programmer from having to worry about these details. However, any info that gives me a high-level mental picture of what's going on inside the library calls would be appreciated.

推荐答案

到目前为止,这些问题大多数都无法由标准来回答.但是,据我所知,您的问题包含两个概念:

Most of these questions can not be answered by the standard as of today. However, your question, as I understand it, mixes two concepts:

C1.并行算法的约束

C1. Constraints on parallel algorithms

C2.算法的执行

所有C ++ 17并行STL都是关于C1的:它限制了如何在并行计算中对指令和/或线程如何进行交织/转换.另一方面,C2即将标准化,关键字为executor(稍后会对此进行详细介绍).

All the C++17 parallel STL thing is about C1: it sets constraints on how instructions and/or threads could be interleaved/transformed in a parallel computation. On the other hand, C2 is about being standardized, the keyword is executor (more on this later).

对于C1,有3个标准策略(在std::execution::seqparpar_unseq中)对应于任务和指令并行性的每种组合.例如,当执行整数累加时,可以使用par_unseq,因为顺序并不重要.但是,对于浮点算术,其中加法不是关联的,至少要获得确定性的结果才能更好地拟合seq.简而言之:策略对并行计算设置了约束,而智能编译器可能会利用这些约束.

For C1, there are 3 standard policies (in std::execution::seq, par and par_unseq) that correspond to every combination of task and instruction parallelism. For example, when performing an integer accumulation, par_unseq could be used, since the order is not important. However, for float point arithmetic, where addition is not associative, a better fit would be seq to, at least, get a deterministic result. In short: policies set constraints on parallel computation and these constraints could be potentially exploited by a smart compiler.

另一方面,一旦有了并行算法及其约束(并可能在进行一些优化/转换之后),executor将找到执行该算法的方法.有默认的执行程序(例如,对于CPU),或者您可以创建自己的执行程序,然后就可以设置有关线程数量,工作量,处理单元等的所有配置.

On the other hand, once you have a parallel algorithm and its constraints (and possibly after some optimization/transformation), the executor will find a way to execute it. There are default executors (for CPU for example) or you can create your own, then, all that configuration regarding number of threads, workload, processing unit, etc... can be set.

到目前为止,C1是标准版本,但不是C2,因此,如果将C1与兼容的编译器一起使用,您将无法指定所需的执行配置文件和库实现会为您决定(也许通过扩展程序).

As of today, C1 is in the standard, but not C2, so if you use C1 with a compliant compiler, you will not be able to specify which execution profile you want and the library implementation will decide for you (maybe through extensions).

所以,要解决您的问题:

So, to address your questions:

(关于您的前5个问题)根据定义,为了允许可能的数据流转换,C ++ 17并行STL库不定义任何计算,仅定义数据相关性.所有这些问题都将由executor回答(希望),您可以看到当前的提案

(Regarding your first 5 questions) By definition, C++17 parallel STL library does not define any computation, just data dependency, in order to allow for possible data flow transformations. All these questions will be answered (hopefully) by executor, you can see the current proposal here. It will look something like:

executor = get_executor();
sort( std::execution::par.on(executor), vec.begin(), vec.end());

该建议中已经定义了您的一些问题.

Some of your questions are already defined in that proposal.

(第六篇)那里已经有许多库已经实现了类似的概念(C ++ executor确实受到其中一些启发),AFAIK:hpx,Thrust或Boost.Compute.我不知道最后两个的实际实现方式,但是对于hpx,它们使用轻量级线程,您可以配置执行配置文件.另外,上面针对C ++ 17的代码预期的(尚未标准化的)语法本质上与hpx中的语法相同(在很大程度上受其启发).

(For the 6th) There are a number of libraries out there that already implement similar concepts (C++ executor was inspired by some of them indeed), AFAIK: hpx, Thrust or Boost.Compute. I do not know how the last two are actually implemented, but for hpx they use lightweight threads and you can configure execution profile. Also, the expected (not yet standardized) syntax of the code above for C++17 is essentially the same as in (was heavily inspired by) hpx.

参考文献:

  1. C ++ 17并行算法和>布莱斯·阿德尔斯坦·莱尔巴赫(Bryce Adelstein lelbach)
  2. ……的未来Wong
  3. 撰写的ISO C ++异构计算
  1. C++17 Parallel Algorithms and Beyond by Bryce Adelstein lelbach
  2. The future of ISO C++ Heterogeneous Computing by Michael Wong
  3. Keynote C++ executors to enable heterogeneous computing in tomorrow's C++ today by Michael Wong
  4. Executors for C++ - A Long Story by Detlef Vollmann

这篇关于C ++ STL(ExecutionPolicy)算法如何确定要使用多少个并行线程?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆