TParallel.For default ThreadPool 的奇怪行为 [英] Strange behaviour of TParallel.For default ThreadPool

查看:22
本文介绍了TParallel.For default ThreadPool 的奇怪行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在试用 Delphi XE7 Update 1 的并行编程功能.

我创建了一个简单的 TParallel.For 循环,它基本上会执行一些虚假操作来打发时间.

我在 AWS 实例 (c4.8xlarge) 的 36 个 vCPU 上启动了该程序,以尝试了解并行编程的好处.

当我第一次启动程序并执行 TParallel.For 循环时,我看到了显着的收益(尽管使用 36 个 vCPU 时确实比我预期的要少得多):

并行匹配:23077072 in 242ms单线程匹配:2314ms 内 23077072

如果我不关闭程序并在不久之后(例如,立即或大约 10-20 秒后)在 36 vCPU 机器上再次运行 pass,Parallel pass 会恶化很多:

并行匹配:2322ms 内 23077169单线程匹配:2316ms 内 23077169

如果我不关闭程序并等待几分钟(不是几秒钟,而是几分钟),然后再次运行传递,我将再次获得第一次启动程序时获得的结果(改进了 10 倍)响应时间).

启动程序后的第一遍在 36 个 vCPU 的机器上总是更快,所以这种效果似乎只在程序中第二次调用 TParallel.For 时发生.

这是我正在运行的示例代码:

unit ParallelTests;界面用途Winapi.Windows、Winapi.Messages、System.SysUtils、System.Variants、System.Classes、Vcl.Graphics、System.Threading, System.SyncObjs, System.Diagnostics,Vcl.Controls、Vcl.Forms、Vcl.Dialogs、Vcl.StdCtrls;类型TForm1 = 类(TForm)Button1:TButton;备忘录1:TMemo;SingleThreadCheckBox:TCheckBox;ParallelCheckBox:TCheckBox;单位TEdit;标签1:T标签;过程 Button1Click(Sender: TObject);私人的{ 私人声明}上市{ 公开声明 }结尾;无功Form1:TForm1;执行{$R *.dfm}过程 TForm1.Button1Click(Sender: TObject);无功匹配:整数;i,j:整数;sw: TStopWatch;maxItems:整数;参考字符串:字符串;开始sw := TStopWatch.Create;最大项目:= 5000;随机化;SetLength(referenceStr,120000);对于 i := 1 到 120000 做 referenceStr[i] := Chr(Ord('a') + Random(26));如果 ParallelCheckBox.Checked 然后开始匹配:= 0;sw.重置;sw. 开始;TParallel.For(1, MaxItems,过程(值:整数)无功索引:整数;找到:整数;开始发现:= 0;对于 index := 1 到 length(referenceStr) 开始if (((Value mod 26) + ord('a')) = ord(referenceStr[index])) 然后开始公司(发现);结尾;结尾;TInterlocked.Add(matches, found);结尾);sw.停止;Memo1.Lines.Add('并行匹配:' + IntToStr(matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');结尾;如果 SingleThreadCheckBox.Checked 然后开始匹配:= 0;sw.重置;sw. 开始;对于 i := 1 到 MaxItems 开始对于 j := 1 到 length(referenceStr) 开始如果 (((i mod 26) + ord('a')) = ord(referenceStr[j])) 然后开始公司(匹配);结尾;结尾;结尾;sw.停止;Memo1.Lines.Add('单线程匹配:' + IntToStr(Matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');结尾;结尾;结尾.

这是否按设计工作?我发现这篇文章(http://delphiaball.co.uk/tag/parallel-programming/)推荐我让库决定线程池,但如果我必须在请求之间等待几分钟才能更快地处理请求,我看不到使用并行编程的意义.

我是否遗漏了关于应该如何使用 TParallel.For 循环的任何信息?

请注意,我无法在 AWS m3.large 实例(根据 AWS 为 2 个 vCPU)上重现此内容.在那种情况下,我总是得到轻微的改进,并且在随后不久的 TParallel.For 调用中我没有得到更糟的结果.

并行匹配:23077054 in 2057ms单线程匹配:2900 毫秒内 23077054

因此,当有许多可用内核(36 个)时,似乎会出现这种效果,这很遗憾,因为并行编程的全部意义在于从许多内核中受益.我想知道这是否是一个库错误,因为内核数过多,或者在这种情况下内核数不是 2 的幂.

<块引用>

更新:在使用不同 vCPU 的各种实例对其进行测试后在 AWS 中很重要,这似乎是行为:

  • 36 个 vCPU (c4.8xlarge).您必须在后续调用之间等待几分钟才能调用 vanilla TParallel 调用(这使其无法用于生产)
  • 32 个 vCPU (c3.8xlarge).您必须在后续调用之间等待几分钟才能调用 vanilla TParallel 调用(这使其无法用于生产)
  • 16 个 vCPU (c3.4xlarge).您必须等待第二次.如果负载低但响应时间仍然很重要,它可以使用
  • 8 个 vCPU (c3.2xlarge).好像可以正常使用
  • 4 个 vCPU (c3.xlarge).好像可以正常使用
  • 2 个 vCPU (m3.large).好像可以正常使用

解决方案

我根据您的程序创建了两个测试程序,以比较 System.ThreadingOTL.我使用 XE7 update 1 和 OTL r1397 构建.我使用的 OTL 源对应于 3.04 版.我使用 32 位 Windows 编译器构建,使用发布构建选项.

我的测试机器是运行 Windows 7 x64 的双 Intel Xeon E5530.该系统有两个四核处理器.总共有 8 个处理器,但系统说由于超线程有 16 个.经验告诉我,超线程只是营销噱头,我从未见过在这台机器上扩展超过 8 倍.

现在是两个几乎相同的程序.

系统线程

program SystemThreadingTest;{$APPTYPE 控制台}用途系统诊断,系统线程;常量maxItems = 5000;数据大小 = 100000;程序DoTest;无功匹配:整数;i,j:整数;sw: TStopWatch;参考字符串:字符串;开始随机化;SetLength(referenceStr, DataSize);对于 i := low(referenceStr) 到 high(referenceStr) 做referenceStr[i] := Chr(Ord('a') + Random(26));//平行线匹配:= 0;sw := TStopWatch.StartNew;TParallel.For(1, maxItems,过程(值:整数)无功索引:整数;找到:整数;开始发现:= 0;对于 index := low(referenceStr) 到 high(referenceStr) 做如果 (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) 那么公司(发现);AtomicIncrement(匹配,找到);结尾);Writeln('并行匹配:', 匹配, ' in ', sw.ElapsedMilliseconds, 'ms');//串行匹配:= 0;sw := TStopWatch.StartNew;对于 i := 1 到 maxItems 做对于 j := low(referenceStr) 到 high(referenceStr) 做如果 (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) 那么公司(匹配);Writeln('串行匹配:', 匹配, ' in ', sw.ElapsedMilliseconds, 'ms');结尾;开始而真正做做测试;结尾.

OTL

程序OTLTest;{$APPTYPE 控制台}用途Winapi.Windows,Winapi.Messages,系统诊断,OtlParallel;常量maxItems = 5000;数据大小 = 100000;过程 ProcessThreadMessages;无功味精:TMsg;开始而 PeekMessage(Msg, 0, 0, 0, PM_REMOVE) 和 (Msg.Message <> WM_QUIT) 开始翻译消息(Msg);DispatchMessage(Msg);结尾;结尾;程序DoTest;无功匹配:整数;i,j:整数;sw: TStopWatch;参考字符串:字符串;开始随机化;SetLength(referenceStr, DataSize);对于 i := low(referenceStr) 到 high(referenceStr) 做referenceStr[i] := Chr(Ord('a') + Random(26));//平行线匹配:= 0;sw := TStopWatch.StartNew;Parallel.For(1, maxItems).Execute(过程(值:整数)无功索引:整数;找到:整数;开始发现:= 0;对于 index := low(referenceStr) 到 high(referenceStr) 做如果 (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) 那么公司(发现);AtomicIncrement(匹配,找到);结尾);Writeln('并行匹配:', 匹配, ' in ', sw.ElapsedMilliseconds, 'ms');进程线程消息;//串行匹配:= 0;sw := TStopWatch.StartNew;对于 i := 1 到 maxItems 做对于 j := low(referenceStr) 到 high(referenceStr) 做如果 (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) 那么公司(匹配);Writeln('串行匹配:', 匹配, ' in ', sw.ElapsedMilliseconds, 'ms');结尾;开始而真正做做测试;结尾.

现在是输出.

System.Threading 输出

<前>并行匹配:19230817 in 374ms串行匹配:19230817 in 2423ms并行匹配:374ms 内 19230698串行匹配:19230698 in 2409ms并行匹配:19230556 in 368ms串行匹配:2433 毫秒内为 19230556并行匹配:19230635 in 2412ms串行匹配:19230635 in 2430ms并行匹配:2441ms 内 19230843串行匹配:19230843 in 2413ms并行匹配:19230905 in 2493ms串行匹配:19230905 in 2423ms并行匹配:2430ms 内 19231032串行匹配:19231032 in 2443ms并行匹配:2440ms 内 19230669串行匹配:19230669 in 2473ms并行匹配:19230811 in 2404ms串行匹配:19230811 in 2432ms....

OTL 输出

<前>并行匹配:19230667 in 422ms串行匹配:19230667 in 2475ms并行匹配:335ms 内 19230663串行匹配:19230663 in 2438ms并行匹配:395ms 内 19230889串行匹配:19230889 in 2461ms并行匹配:19230874 in 391ms串行匹配:2441 毫秒内为 19230874并行匹配:19230617 in 385ms串行匹配:19230617 in 2524ms并行匹配:19231021 in 368ms串行匹配:19231021 in 2455ms并行匹配:19230904 in 357ms串行匹配:19230904 in 2537ms并行匹配:19230568 in 373ms串行匹配:19230568 in 2456ms并行匹配:333ms 内 19230758串行匹配:19230758 in 2710ms并行匹配:371ms 内 19230580串行匹配:19230580 in 2532ms并行匹配:19230534 in 336ms串行匹配:19230534 in 2436ms并行匹配:368ms 内 19230879串行匹配:19230879 in 2419ms并行匹配:19230651 in 409ms串行匹配:19230651 in 2598ms并行匹配:19230461 in 357ms....

我让 OTL 版本运行了很长时间,模式从未改变.并行版本始终比串行版本快 7 倍左右.

结论

代码非常简单.唯一可以得出的合理结论是System.Threading的实现有缺陷.

有许多关于新的 System.Threading 库的错误报告.所有的迹象都表明它的质量很差.Embarcadero 在发布不合标准的库代码方面有着悠久的记录.我在考虑 TMonitor,XE3 字符串助手,System.IOUtils 的早期版本,FireMonkey.名单还在继续.

很明显,Embarcadero 的质量是一个大问题.发布的代码很明显没有经过充分测试,如果有的话.这对于线程库来说尤其麻烦,因为那里的错误可能处于休眠状态,并且只能在特定的硬件/软件配置中暴露出来.TMonitor 的经验让我相信 Embarcadero 没有足够的专业知识来生成高质量、正确的线程代码.

我的建议是你不应该使用当前形式的 System.Threading.直到可以看出它具有足够的质量和正确性时,才应该避免它.建议你用OTL.

<小时>

程序的原始 OTL 版本存在实时内存泄漏,这是由于丑陋的实现细节而发生的.Parallel.For 使用 .Unobserved 修饰符创建任务.这会导致所述任务仅在某些内部消息窗口收到任务已终止"消息时才被销毁.该窗口是在与 Parallel.For 调用者相同的线程中创建的 - 即在这种情况下在主线程中.由于主线程不处理消息,任务永远不会被破坏,内存消耗(加上其他资源)只会堆积起来.有可能是因为那个程序在一段时间后挂了.

I am trying out the Parallel Programming features of Delphi XE7 Update 1.

I created a simple TParallel.For loop that basically does some bogus operations to pass the time.

I launched the program on a 36 vCPU at an AWS instance (c4.8xlarge) to try to see what the gain of Parallel Programming could be.

When I first launch the program and execute the TParallel.For loop, I see a significant gain (although admitelly a lot less than I anticipated with 36 vCPUs):

Parallel matches: 23077072 in 242ms
Single Threaded matches: 23077072 in 2314ms

If I do not close the program and run the pass again on the 36 vCPU machine shortly after (for example, immediately or some 10-20 seconds later), the Parallel pass worsens a lot:

Parallel matches: 23077169 in 2322ms
Single Threaded matches: 23077169 in 2316ms

If I don't close the program and I wait a few minutes (not a few seconds, but a few minutes) before running the pass again, I get again the results I get when first launching the program (10x improvement in response time).

The very first pass right after launching the program is always faster on the 36 vCPUs machine, so it seems that this effect only happens the second time a TParallel.For is called in the program.

This is the sample code I'm running:

unit ParallelTests;

interface

uses
  Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
  System.Threading, System.SyncObjs, System.Diagnostics,
  Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls;

type
  TForm1 = class(TForm)
    Button1: TButton;
    Memo1: TMemo;
    SingleThreadCheckBox: TCheckBox;
    ParallelCheckBox: TCheckBox;
    UnitsEdit: TEdit;
    Label1: TLabel;
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}

procedure TForm1.Button1Click(Sender: TObject);
var
  matches: integer;
  i,j: integer;
  sw: TStopWatch;
  maxItems: integer;
  referenceStr: string;

 begin
  sw := TStopWatch.Create;

  maxItems := 5000;

  Randomize;
  SetLength(referenceStr,120000); for i := 1 to 120000 do referenceStr[i] := Chr(Ord('a') + Random(26)); 

  if ParallelCheckBox.Checked then begin
    matches := 0;
    sw.Reset;
    sw.Start;
    TParallel.For(1, MaxItems,
      procedure (Value: Integer)
        var
          index: integer;
          found: integer;
        begin
          found := 0;
          for index := 1 to length(referenceStr) do begin
            if (((Value mod 26) + ord('a')) = ord(referenceStr[index])) then begin
              inc(found);
            end;
          end;
          TInterlocked.Add(matches, found);
        end);
    sw.Stop;
    Memo1.Lines.Add('Parallel matches: ' + IntToStr(matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
  end;

  if SingleThreadCheckBox.Checked then begin
    matches := 0;
    sw.Reset;
    sw.Start;
    for i := 1 to MaxItems do begin
      for j := 1 to length(referenceStr) do begin
        if (((i mod 26) + ord('a')) = ord(referenceStr[j])) then begin
          inc(matches);
        end;
      end;
    end;
    sw.Stop;
    Memo1.Lines.Add('Single Threaded matches: ' + IntToStr(Matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
  end;
end;

end.

Is this working as designed? I found this article (http://delphiaball.co.uk/tag/parallel-programming/) recommending that I let the library decide the thread pool, but I do not see the point of using Parallel Programming if I have to wait minutes from request to request so that the request is served faster.

Am I missing anything on how a TParallel.For loop is supposed to be used?

Please note that I cannot reproduce this on a AWS m3.large instance (2 vCPU according to AWS). In that instance, I always get a slight improvement, and I do not get a worse result in subsequent calls of TParallel.For shortly after.

Parallel matches: 23077054 in 2057ms
Single Threaded matches: 23077054 in 2900ms

So it seems that this effect occurs when there are many cores available (36), which is a pity because the whole point of Parallel Programming is to benefit from many cores. I wonder if this is a library bug because of the high count of cores or the fact that the core count is not a power of 2 in this case.

UPDATE: After testing it with various instances of different vCPU counts in AWS, this seems to be the behaviour:

  • 36 vCPUs (c4.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for production)
  • 32 vCPUs (c3.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for production)
  • 16 vCPUs (c3.4xlarge). You have to wait sub second times. It could be usable if load is low but response time still important
  • 8 vCPUs (c3.2xlarge). It seems to work normally
  • 4 vCPUs (c3.xlarge). It seems to work normally
  • 2 vCPUs (m3.large). It seems to work normally

解决方案

I created two test programs, based on yours, to compare System.Threading and OTL. I built with XE7 update 1, and OTL r1397. The OTL source that I used corresponds to release 3.04. I built with the 32 bit Windows compiler, using release build options.

My test machine is a dual Intel Xeon E5530 running Windows 7 x64. The system has two quad core processors. That's 8 processors in total, but the system says there are 16 due to hyper-threading. Experience tells me that hyper-threading is just marketing guff and I've never seen scaling beyond a factor of 8 on this machine.

Now for the two programs, which are almost identical.

System.Threading

program SystemThreadingTest;

{$APPTYPE CONSOLE}

uses
  System.Diagnostics,
  System.Threading;

const
  maxItems = 5000;
  DataSize = 100000;

procedure DoTest;
var
  matches: integer;
  i, j: integer;
  sw: TStopWatch;
  referenceStr: string;
begin
  Randomize;
  SetLength(referenceStr, DataSize);
  for i := low(referenceStr) to high(referenceStr) do
    referenceStr[i] := Chr(Ord('a') + Random(26));

  // parallel
  matches := 0;
  sw := TStopWatch.StartNew;
  TParallel.For(1, maxItems,
    procedure(Value: integer)
    var
      index: integer;
      found: integer;
    begin
      found := 0;
      for index := low(referenceStr) to high(referenceStr) do
        if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
          inc(found);
      AtomicIncrement(matches, found);
    end);
  Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');

  // serial
  matches := 0;
  sw := TStopWatch.StartNew;
  for i := 1 to maxItems do
    for j := low(referenceStr) to high(referenceStr) do
      if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
        inc(matches);
  Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;

begin
  while True do
    DoTest;
end.

OTL

program OTLTest;

{$APPTYPE CONSOLE}

uses
  Winapi.Windows,
  Winapi.Messages,
  System.Diagnostics,
  OtlParallel;

const
  maxItems = 5000;
  DataSize = 100000;

procedure ProcessThreadMessages;
var
  msg: TMsg;
begin
  while PeekMessage(Msg, 0, 0, 0, PM_REMOVE) and (Msg.Message <> WM_QUIT) do begin
    TranslateMessage(Msg);
    DispatchMessage(Msg);
  end;
end;

procedure DoTest;
var
  matches: integer;
  i, j: integer;
  sw: TStopWatch;
  referenceStr: string;
begin
  Randomize;
  SetLength(referenceStr, DataSize);
  for i := low(referenceStr) to high(referenceStr) do
    referenceStr[i] := Chr(Ord('a') + Random(26));

  // parallel
  matches := 0;
  sw := TStopWatch.StartNew;
  Parallel.For(1, maxItems).Execute(
    procedure(Value: integer)
    var
      index: integer;
      found: integer;
    begin
      found := 0;
      for index := low(referenceStr) to high(referenceStr) do
        if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
          inc(found);
      AtomicIncrement(matches, found);
    end);
  Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');

  ProcessThreadMessages;

  // serial
  matches := 0;
  sw := TStopWatch.StartNew;
  for i := 1 to maxItems do
    for j := low(referenceStr) to high(referenceStr) do
      if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
        inc(matches);
  Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;

begin
  while True do
    DoTest;
end.

And now the output.

System.Threading output

Parallel matches: 19230817 in 374ms
Serial matches: 19230817 in 2423ms
Parallel matches: 19230698 in 374ms
Serial matches: 19230698 in 2409ms
Parallel matches: 19230556 in 368ms
Serial matches: 19230556 in 2433ms
Parallel matches: 19230635 in 2412ms
Serial matches: 19230635 in 2430ms
Parallel matches: 19230843 in 2441ms
Serial matches: 19230843 in 2413ms
Parallel matches: 19230905 in 2493ms
Serial matches: 19230905 in 2423ms
Parallel matches: 19231032 in 2430ms
Serial matches: 19231032 in 2443ms
Parallel matches: 19230669 in 2440ms
Serial matches: 19230669 in 2473ms
Parallel matches: 19230811 in 2404ms
Serial matches: 19230811 in 2432ms
....

OTL output

Parallel matches: 19230667 in 422ms
Serial matches: 19230667 in 2475ms
Parallel matches: 19230663 in 335ms
Serial matches: 19230663 in 2438ms
Parallel matches: 19230889 in 395ms
Serial matches: 19230889 in 2461ms
Parallel matches: 19230874 in 391ms
Serial matches: 19230874 in 2441ms
Parallel matches: 19230617 in 385ms
Serial matches: 19230617 in 2524ms
Parallel matches: 19231021 in 368ms
Serial matches: 19231021 in 2455ms
Parallel matches: 19230904 in 357ms
Serial matches: 19230904 in 2537ms
Parallel matches: 19230568 in 373ms
Serial matches: 19230568 in 2456ms
Parallel matches: 19230758 in 333ms
Serial matches: 19230758 in 2710ms
Parallel matches: 19230580 in 371ms
Serial matches: 19230580 in 2532ms
Parallel matches: 19230534 in 336ms
Serial matches: 19230534 in 2436ms
Parallel matches: 19230879 in 368ms
Serial matches: 19230879 in 2419ms
Parallel matches: 19230651 in 409ms
Serial matches: 19230651 in 2598ms
Parallel matches: 19230461 in 357ms
....

I left the OTL version running for a long time and the pattern never changed. The parallel version was always around 7 times faster than the serial.

Conclusion

The code is astonishingly simple. The only reasonable conclusion that can be drawn is that the implementation of System.Threading is defective.

There have been numerous bug reports relating to the new System.Threading library. All the signs are that its quality is poor. Embarcadero have a long track record of releasing sub-standard library code. I'm thinking of TMonitor, the XE3 string helper, earlier versions of System.IOUtils, FireMonkey. The list goes on.

It seems clear that quality is a big problem with Embarcadero. Code is released that quite clearly has not been tested adequately, if at all. This is especially troublesome for a threading library where bugs can lie dormant and only be exposed in specific hardware/software configurations. The experience from TMonitor leads me to believe that Embarcadero do not have sufficient expertise to produce high quality, correct, threading code.

My advice is that you should not use System.Threading in its current form. Until such a time as it can be seen to have sufficient quality and correctness, it should be shunned. I suggest that you use OTL.


EDIT: Original OTL version of the program had a live memory leak which occurred because of an ugly implementation detail. Parallel.For creates tasks with the .Unobserved modifier. That causes said tasks to only be destroyed when some internal message window receives a 'task has terminated' message. This window is created in the same thread as the Parallel.For caller - i.e. in the main thread in this case. As the main thread was not processing messages, tasks were never destroyed and memory consumption (plus other resources) just piled up. It is possible that because of that program hanged after some time.

这篇关于TParallel.For default ThreadPool 的奇怪行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆