TParallel的奇怪行为。对于默认ThreadPool [英] Strange behaviour of TParallel.For default ThreadPool

查看:187
本文介绍了TParallel的奇怪行为。对于默认ThreadPool的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试Delphi XE7 Update 1的并行编程功能。



我创建了一个简单的 TParallel.For 循环,基本上做了一些虚假的操作来传递时间。



我在AWS实例(c4.8xlarge)上的36个vCPU上启动了程序,试图看到平行编程可以获得什么。



当我第一次启动程序并执行 TParallel.For 循环,我看到一个显着的收益(尽管比36个vCPU预期的要少得多):

 并行匹配:23077072在242ms 
单线程匹配:23077072在2314ms

如果我不关闭程序并运行不久之后(例如立即或大约10-20秒钟),36号vCPU机器再次通过,并行通路恶化了很多:

 平行比赛:23077169在2322毫秒
单线程匹配:23077169在2316毫秒

如果我没有关闭程序,我等待几分钟(不是几秒钟,但几分钟),再次运行通行证之前,我再次获得结果当第一次启动程序(响应时间提高10倍)。



在36个vCPU机器上启动程序后,第一次通过速度总是更快,所以似乎这个效果只发生在程序中第二次调用 TParallel.For



这是示例代码我正在运行:

  unit ParallelTests; 

接口

使用
Winapi.Windows,Winapi.Messages,System.SysUtils,System.Variants,System.Classes,Vcl.Graphics,
System.Threading,System.SyncObjs,System.Diagnostics,
Vcl.Controls,Vcl.Forms,Vcl.Dialogs,Vcl.StdCtrls;

type
TForm1 = class(TForm)
Button1:TButton;
Memo1:TMemo;
SingleThreadCheckBox:TCheckBox;
ParallelCheckBox:TCheckBox;
UnitsEdit:TEdit;
Label1:TLabel;
procedure Button1Click(Sender:TObject);
private
{私人声明}
public
{公开声明}
end;

var
Form1:TForm1;

实现

{$ R * .dfm}

程序TForm1.Button1Click(发件人:TObject);
var
matches:integer;
i,j:integer;
sw:TStopWatch;
maxItems:integer;
referenceStr:string;

begin
sw:= TStopWatch.Create;

maxItems:= 5000;

随机化;
SetLength(ReferenceStr,120000);对于i:= 1到120000 do referenceStr [i]:= Chr(Ord('a')+ Random(26));

如果ParallelCheckBox.Checked然后开始
matches:= 0;
sw.Reset;
sw.Start;
TParallel.For(1,MaxItems,
procedure(Value:Integer)
var
index:integer;
found:integer;
begin
found:= 0;
for index:= 1 to length(referenceStr)do begin
if(((mod mod 26)+ ord('a'))= ord(referenceStr [index] ))然后开始
inc(found);
end;
end;
TInterlocked.Add(matches,found);
end);
sw.Stop;
Memo1.Lines.Add('并行匹配:'+ IntToStr(matches)+'in'+ IntToStr(sw.ElapsedMilliseconds)+'ms');
结束

如果SingleThreadCheckBox.Checked然后开始
matches:= 0;
sw.Reset;
sw.Start;
for i:= 1 to MaxItems do begin
for j:= 1 to length(referenceStr)do begin
if(((i mod 26)+ ord('a'))= ord(referenceStr [j]))然后开始
inc(matches);
结束
结束
结束
sw.Stop;
Memo1.Lines.Add('Single Threaded matches:'+ IntToStr(Matches)+'in'+ IntToStr(sw.ElapsedMilliseconds)+'ms');
结束
结束

结束。

这是否按照设计工作?我发现这篇文章( http://delphiaball.co.uk/tag/parallel-programming/ )推荐我让图书馆决定线程池,但是如果我不得不等待几分钟的请求来请求,那么我看不到使用并行编程的那一点,以便更快地提供请求。



我没有任何关于如何使用 TParallel.For 循环的任何内容?



请请注意,我无法在AWS m3.large实例(根据AWS的2个vCPU)上重现此信息。在这种情况下,我总是有一点点改善,不久之后,我不会再接受 TParallel.For 的调用。

 平行匹配:23077054在2057ms 
单线程匹配:23077054在2900ms

所以似乎这样的效果发生在有很多核心可用的时候(36),这是可惜的,因为并行编程的全部要受益于很多核心。我想知道这是否是一个图书馆的错误,因为在这种情况下核心数量很高,核心数不是2的这个数字。


更新:在AWS中使用不同vCPU
计数的各种实例进行测试后,这似乎是行为:




  • 36个vCPU(c4.8xlarge)。您必须等待几分钟后续呼叫香草TParallel呼叫(这使得它不可用于
    生产)

  • 32 vCPU(c3.8xlarge) 。您必须等待几分钟后续呼叫到香草TParallel呼叫(这使得它不可用于
    生产)

  • 16 vCPU(c3.4xlarge) 。你必须等待第二次。如果负载不足但响应时间仍然很重要,则可以使用它。

  • 8个vCPU(c3.2xlarge)。似乎正常工作

  • 4个vCPU(c3.xlarge)。似乎正常工作

  • 2个vCPU(m3.large)。好像正常工作



解决方案

程序,根据你的,比较 System.Threading OTL 。我用XE7更新1和OTL r1397构建。我使用的OTL源对应于3.04版本。我使用32位Windows编译器,使用发行版构建选项构建。



我的测试机是运行Windows 7 x64的双Intel Xeon E5530。该系统有两个四核处理器。这是总共8个处理器,但系统说由于超线程而有16个处理器。经验告诉我,超线程只是营销guff,我从来没有看到在这台机器上超过8倍的扩展。



现在对于这两个程序几乎相同。



System.Threading

 程序SystemThreadingTest; 

{$ APPTYPE CONSOLE}

使用
System.Diagnostics,
System.Threading;

const
maxItems = 5000;
DataSize = 100000;

程序DoTest;
var
matches:integer;
i,j:integer;
sw:TStopWatch;
referenceStr:string;
begin
随机化;
SetLength(referenceStr,DataSize);
for i:= low(referenceStr)to high(referenceStr)do
referenceStr [i]:= Chr(Ord('a')+ Random(26));

// parallel
matches:= 0;
sw:= TStopWatch.StartNew;
TParallel.For(1,maxItems,
procedure(Value:integer)
var
index:integer;
found:integer;
begin
found:= 0;
for index:= low(referenceStr)to high(referenceStr)do
if(((Value mod 26)+ Ord('a'))= Ord(referenceStr [ index]))然后
inc(found);
AtomicIncrement(matches,found);
end);
Writeln('Parallel matches:',matches,'in',sw.ElapsedMilliseconds,'ms');

// serial
matches:= 0;
sw:= TStopWatch.StartNew;
for i:= 1 to maxItems do
for j:= low(referenceStr)to high(referenceStr)do
if(((i mod 26)+ Ord('a')) = Ord(referenceStr [j]))然后
inc(matches);
Writeln('Serial matches:',matches,'in',sw.ElapsedMilliseconds,'ms');
结束

begin
,而True do
DoTest;
结束。

OTL

 程序OTLTest; 

{$ APPTYPE CONSOLE}

使用
Winapi.Windows,
Winapi.Messages,
System.Diagnostics,
Otl平行

const
maxItems = 5000;
DataSize = 100000;

procedure ProcessThreadMessages;
var
msg:TMsg;
begin
,而PeekMessage(Msg,0,0,0,PM_REMOVE)和(Msg.Message<> WM_QUIT)开始
TranslateMessage(Msg);
DispatchMessage(Msg);
结束
结束

程序DoTest;
var
matches:integer;
i,j:integer;
sw:TStopWatch;
referenceStr:string;
begin
随机化;
SetLength(referenceStr,DataSize);
for i:= low(referenceStr)to high(referenceStr)do
referenceStr [i]:= Chr(Ord('a')+ Random(26));

// parallel
matches:= 0;
sw:= TStopWatch.StartNew;
Parallel.For(1,maxItems).Execute(
procedure(Value:integer)
var
index:integer;
found:integer;
开始
found:= 0;
for index:= low(referenceStr)to high(referenceStr)do
if(((Value mod 26)+ Ord('a'))= Ord (referenceStr [index]))然后
inc(found);
AtomicIncrement(matches,found);
end);
Writeln('Parallel matches:',matches,'in',sw.ElapsedMilliseconds,'ms');

ProcessThreadMessages;

// serial
matches:= 0;
sw:= TStopWatch.StartNew;
for i:= 1 to maxItems do
for j:= low(referenceStr)to high(referenceStr)do
if(((i mod 26)+ Ord('a')) = Ord(referenceStr [j]))然后
inc(matches);
Writeln('Serial matches:',matches,'in',sw.ElapsedMilliseconds,'ms');
结束

begin
,而True do
DoTest;
结束。

现在输出。



strong> System.Threading输出

 
平行匹配:在374ms中的19230817
串行匹配:19230817在2423ms
平行匹配:19230698在374ms
串行匹配:19230698在2409ms
平行匹配:19230556在368ms
串行匹配:19230556在2433ms
平行匹配:19230635在2412ms
串行匹配:19230635在2430ms
平行匹配:19230843在2441ms
串行匹配:19230843在2413ms
平行匹配:19230905在2493ms
串行匹配:19230905在2423ms
平行匹配:19231032在2430ms
序列匹配:19231032在2443ms
平行匹配:19230669在2440ms
串行匹配:19230669在2473ms
平行匹配:19230811在2404ms
串行匹配:19230811 in 2432ms
....

OTL输出

 
平行匹配:19230667 in 422ms
串行匹配:19230667 in 2475ms
平行匹配:19230663在335ms
串行匹配:19230663在2438ms
平行匹配:19230889在395ms
串行匹配:19230889在2461ms
平行匹配:19230874在391ms
串行匹配:19230874在2441ms
平行匹配:19230617在385ms
串行匹配:19230617在2524ms
并行匹配:19231021在368ms
序列比赛:19231021在2455ms
平行比赛:19230904在357毫秒
序列匹配:19230904在2537毫秒
平行匹配:19230568在373毫秒
串行匹配:19230568在2456ms
平行匹配:19230758在333ms
串行匹配:19230758在2710ms
平行匹配:19230580在371ms
串行匹配:19230580在2532ms
平行匹配:19230534在336ms
串行匹配:19230534 in 2436ms
平行匹配:19230879在368ms
串行匹配:19230879在2419ms
平行匹配:19230651在409ms
串行匹配:19230651在2598ms
平行匹配:19230461在357ms
....

我离开了OTL版本运行时间长,图案从未改变。并行版本总是比串行版快7倍。



结论



代码非常简单。可以得出的唯一合理的结论是执行 System.Threading 是有缺陷的。



有关新的 System.Threading 库的许多错误报告。所有的迹象表明它的质量差。 Embarcadero在发布子标准库代码方面有着悠久的历史。我在考虑 TMonitor ,XE3字符串帮助器,早期版本的 System.IOUtils ,FireMonkey。列表继续。



很明显,质量是Embarcadero的一个大问题。代码释放相当明确地没有被充分测试,如果有的话。这对于线程库而言尤其麻烦,其中的错误可能处于休眠状态,只能在特定的硬件/软件配置中公开。 TMonitor 的经验使我相信Embarcadero没有足够的专业知识来生产高品质,正确的线程代码。



我的建议是不要以当前的形式使用 System.Threading 。在这样一个时间,可以看出有足够的质量和正确性,应该避免。我建议你使用OTL。






编辑:程序的原始OTL版本有一个活的内存泄漏,因为一个丑陋的实施细节。 Parallel.For使用.Unobserved修饰符创建任务。这导致所述任务仅在某些内部消息窗口接收到任务已终止消息时被破坏。该窗口与Parallel.For调用者的线程相同,即在这种情况下在主线程中创建。由于主线程没有处理消息,任务从未被破坏,内存消耗(加上其他资源)刚刚堆积。可能是因为该程序在一段时间后挂起。


I am trying out the Parallel Programming features of Delphi XE7 Update 1.

I created a simple TParallel.For loop that basically does some bogus operations to pass the time.

I launched the program on a 36 vCPU at an AWS instance (c4.8xlarge) to try to see what the gain of Parallel Programming could be.

When I first launch the program and execute the TParallel.For loop, I see a significant gain (although admitelly a lot less than I anticipated with 36 vCPUs):

Parallel matches: 23077072 in 242ms
Single Threaded matches: 23077072 in 2314ms

If I do not close the program and run the pass again on the 36 vCPU machine shortly after (for example, immediately or some 10-20 seconds later), the Parallel pass worsens a lot:

Parallel matches: 23077169 in 2322ms
Single Threaded matches: 23077169 in 2316ms

If I don't close the program and I wait a few minutes (not a few seconds, but a few minutes) before running the pass again, I get again the results I get when first launching the program (10x improvement in response time).

The very first pass right after launching the program is always faster on the 36 vCPUs machine, so it seems that this effect only happens the second time a TParallel.For is called in the program.

This is the sample code I'm running:

unit ParallelTests;

interface

uses
  Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
  System.Threading, System.SyncObjs, System.Diagnostics,
  Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls;

type
  TForm1 = class(TForm)
    Button1: TButton;
    Memo1: TMemo;
    SingleThreadCheckBox: TCheckBox;
    ParallelCheckBox: TCheckBox;
    UnitsEdit: TEdit;
    Label1: TLabel;
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}

procedure TForm1.Button1Click(Sender: TObject);
var
  matches: integer;
  i,j: integer;
  sw: TStopWatch;
  maxItems: integer;
  referenceStr: string;

 begin
  sw := TStopWatch.Create;

  maxItems := 5000;

  Randomize;
  SetLength(referenceStr,120000); for i := 1 to 120000 do referenceStr[i] := Chr(Ord('a') + Random(26)); 

  if ParallelCheckBox.Checked then begin
    matches := 0;
    sw.Reset;
    sw.Start;
    TParallel.For(1, MaxItems,
      procedure (Value: Integer)
        var
          index: integer;
          found: integer;
        begin
          found := 0;
          for index := 1 to length(referenceStr) do begin
            if (((Value mod 26) + ord('a')) = ord(referenceStr[index])) then begin
              inc(found);
            end;
          end;
          TInterlocked.Add(matches, found);
        end);
    sw.Stop;
    Memo1.Lines.Add('Parallel matches: ' + IntToStr(matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
  end;

  if SingleThreadCheckBox.Checked then begin
    matches := 0;
    sw.Reset;
    sw.Start;
    for i := 1 to MaxItems do begin
      for j := 1 to length(referenceStr) do begin
        if (((i mod 26) + ord('a')) = ord(referenceStr[j])) then begin
          inc(matches);
        end;
      end;
    end;
    sw.Stop;
    Memo1.Lines.Add('Single Threaded matches: ' + IntToStr(Matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
  end;
end;

end.

Is this working as designed? I found this article (http://delphiaball.co.uk/tag/parallel-programming/) recommending that I let the library decide the thread pool, but I do not see the point of using Parallel Programming if I have to wait minutes from request to request so that the request is served faster.

Am I missing anything on how a TParallel.For loop is supposed to be used?

Please note that I cannot reproduce this on a AWS m3.large instance (2 vCPU according to AWS). In that instance, I always get a slight improvement, and I do not get a worse result in subsequent calls of TParallel.For shortly after.

Parallel matches: 23077054 in 2057ms
Single Threaded matches: 23077054 in 2900ms

So it seems that this effect occurs when there are many cores available (36), which is a pity because the whole point of Parallel Programming is to benefit from many cores. I wonder if this is a library bug because of the high count of cores or the fact that the core count is not a power of 2 in this case.

UPDATE: After testing it with various instances of different vCPU counts in AWS, this seems to be the behaviour:

  • 36 vCPUs (c4.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for production)
  • 32 vCPUs (c3.8xlarge). You have to wait minutes between subsequent calls to a vanilla TParallel call (it makes it unusable for production)
  • 16 vCPUs (c3.4xlarge). You have to wait sub second times. It could be usable if load is low but response time still important
  • 8 vCPUs (c3.2xlarge). It seems to work normally
  • 4 vCPUs (c3.xlarge). It seems to work normally
  • 2 vCPUs (m3.large). It seems to work normally

解决方案

I created two test programs, based on yours, to compare System.Threading and OTL. I built with XE7 update 1, and OTL r1397. The OTL source that I used corresponds to release 3.04. I built with the 32 bit Windows compiler, using release build options.

My test machine is a dual Intel Xeon E5530 running Windows 7 x64. The system has two quad core processors. That's 8 processors in total, but the system says there are 16 due to hyper-threading. Experience tells me that hyper-threading is just marketing guff and I've never seen scaling beyond a factor of 8 on this machine.

Now for the two programs, which are almost identical.

System.Threading

program SystemThreadingTest;

{$APPTYPE CONSOLE}

uses
  System.Diagnostics,
  System.Threading;

const
  maxItems = 5000;
  DataSize = 100000;

procedure DoTest;
var
  matches: integer;
  i, j: integer;
  sw: TStopWatch;
  referenceStr: string;
begin
  Randomize;
  SetLength(referenceStr, DataSize);
  for i := low(referenceStr) to high(referenceStr) do
    referenceStr[i] := Chr(Ord('a') + Random(26));

  // parallel
  matches := 0;
  sw := TStopWatch.StartNew;
  TParallel.For(1, maxItems,
    procedure(Value: integer)
    var
      index: integer;
      found: integer;
    begin
      found := 0;
      for index := low(referenceStr) to high(referenceStr) do
        if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
          inc(found);
      AtomicIncrement(matches, found);
    end);
  Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');

  // serial
  matches := 0;
  sw := TStopWatch.StartNew;
  for i := 1 to maxItems do
    for j := low(referenceStr) to high(referenceStr) do
      if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
        inc(matches);
  Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;

begin
  while True do
    DoTest;
end.

OTL

program OTLTest;

{$APPTYPE CONSOLE}

uses
  Winapi.Windows,
  Winapi.Messages,
  System.Diagnostics,
  OtlParallel;

const
  maxItems = 5000;
  DataSize = 100000;

procedure ProcessThreadMessages;
var
  msg: TMsg;
begin
  while PeekMessage(Msg, 0, 0, 0, PM_REMOVE) and (Msg.Message <> WM_QUIT) do begin
    TranslateMessage(Msg);
    DispatchMessage(Msg);
  end;
end;

procedure DoTest;
var
  matches: integer;
  i, j: integer;
  sw: TStopWatch;
  referenceStr: string;
begin
  Randomize;
  SetLength(referenceStr, DataSize);
  for i := low(referenceStr) to high(referenceStr) do
    referenceStr[i] := Chr(Ord('a') + Random(26));

  // parallel
  matches := 0;
  sw := TStopWatch.StartNew;
  Parallel.For(1, maxItems).Execute(
    procedure(Value: integer)
    var
      index: integer;
      found: integer;
    begin
      found := 0;
      for index := low(referenceStr) to high(referenceStr) do
        if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
          inc(found);
      AtomicIncrement(matches, found);
    end);
  Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');

  ProcessThreadMessages;

  // serial
  matches := 0;
  sw := TStopWatch.StartNew;
  for i := 1 to maxItems do
    for j := low(referenceStr) to high(referenceStr) do
      if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
        inc(matches);
  Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;

begin
  while True do
    DoTest;
end.

And now the output.

System.Threading output

Parallel matches: 19230817 in 374ms
Serial matches: 19230817 in 2423ms
Parallel matches: 19230698 in 374ms
Serial matches: 19230698 in 2409ms
Parallel matches: 19230556 in 368ms
Serial matches: 19230556 in 2433ms
Parallel matches: 19230635 in 2412ms
Serial matches: 19230635 in 2430ms
Parallel matches: 19230843 in 2441ms
Serial matches: 19230843 in 2413ms
Parallel matches: 19230905 in 2493ms
Serial matches: 19230905 in 2423ms
Parallel matches: 19231032 in 2430ms
Serial matches: 19231032 in 2443ms
Parallel matches: 19230669 in 2440ms
Serial matches: 19230669 in 2473ms
Parallel matches: 19230811 in 2404ms
Serial matches: 19230811 in 2432ms
....

OTL output

Parallel matches: 19230667 in 422ms
Serial matches: 19230667 in 2475ms
Parallel matches: 19230663 in 335ms
Serial matches: 19230663 in 2438ms
Parallel matches: 19230889 in 395ms
Serial matches: 19230889 in 2461ms
Parallel matches: 19230874 in 391ms
Serial matches: 19230874 in 2441ms
Parallel matches: 19230617 in 385ms
Serial matches: 19230617 in 2524ms
Parallel matches: 19231021 in 368ms
Serial matches: 19231021 in 2455ms
Parallel matches: 19230904 in 357ms
Serial matches: 19230904 in 2537ms
Parallel matches: 19230568 in 373ms
Serial matches: 19230568 in 2456ms
Parallel matches: 19230758 in 333ms
Serial matches: 19230758 in 2710ms
Parallel matches: 19230580 in 371ms
Serial matches: 19230580 in 2532ms
Parallel matches: 19230534 in 336ms
Serial matches: 19230534 in 2436ms
Parallel matches: 19230879 in 368ms
Serial matches: 19230879 in 2419ms
Parallel matches: 19230651 in 409ms
Serial matches: 19230651 in 2598ms
Parallel matches: 19230461 in 357ms
....

I left the OTL version running for a long time and the pattern never changed. The parallel version was always around 7 times faster than the serial.

Conclusion

The code is astonishingly simple. The only reasonable conclusion that can be drawn is that the implementation of System.Threading is defective.

There have been numerous bug reports relating to the new System.Threading library. All the signs are that its quality is poor. Embarcadero have a long track record of releasing sub-standard library code. I'm thinking of TMonitor, the XE3 string helper, earlier versions of System.IOUtils, FireMonkey. The list goes on.

It seems clear that quality is a big problem with Embarcadero. Code is released that quite clearly has not been tested adequately, if at all. This is especially troublesome for a threading library where bugs can lie dormant and only be exposed in specific hardware/software configurations. The experience from TMonitor leads me to believe that Embarcadero do not have sufficient expertise to produce high quality, correct, threading code.

My advice is that you should not use System.Threading in its current form. Until such a time as it can be seen to have sufficient quality and correctness, it should be shunned. I suggest that you use OTL.


EDIT: Original OTL version of the program had a live memory leak which occurred because of an ugly implementation detail. Parallel.For creates tasks with the .Unobserved modifier. That causes said tasks to only be destroyed when some internal message window receives a 'task has terminated' message. This window is created in the same thread as the Parallel.For caller - i.e. in the main thread in this case. As the main thread was not processing messages, tasks were never destroyed and memory consumption (plus other resources) just piled up. It is possible that because of that program hanged after some time.

这篇关于TParallel的奇怪行为。对于默认ThreadPool的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆