/ CLR浮点性能,组装函数调用性能 [英] /CLR floating point performance, inter-assembly function call performance

查看:75
本文介绍了/ CLR浮点性能,组装函数调用性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经进行了一项实验,试图在托管C ++中学习浮点数/ b
性能。我正在使用Visual Studio

2003.我希望能够感觉到从托管代码到本机代码是否有效b / b
使用

IJW)为了做一些浮点工作,如果是这样的话,

一定数量的浮点工作是

大约。


为了尝试这样做,我制作了一个程序,将一个3x3矩阵应用于一个数组

的3D点数(所有双打都在这里)。程序

包含一个函数,它将10个不同的矩阵应用于相同的测试数据

设置的5,000,000个3D点。它通过调用

执行实际浮点运算的另一个主力函数来实现这一点。

该函数采用3D点输入数组,

输出3D点阵列,点数和要使用的矩阵。这个程序中有

没有__gc类型。它只是指针和

结构和本机数组。外部测试函数如下所示:


void test_applyMatrixToDPoints(TestData * tdP,int ptsPerMultiply)

{

int jIterations = tdP-> pointCnt / ptsPerMultiply;

for(int i = 0; i< tdP-> matrixCnt; ++ i)

{

for(int j = 0; j< jIterations; ++ j)

{

//在V2中发生托管到原生的转换/>
DMatrix3d_multiplyDPoint3dArray(tdP-> matrices + i,

& tdP-> outPts [j * ptsPerMultiply],

& tdP-> ; inPts [j * ptsPerMultiply],

ptsPerMultiply);

}

}

}


程序调用上述例程8次并记录每次调用过程中经过的时间

。在第一次调用时,上面的函数

只为10个矩阵中的每个矩阵调用一次主力函数。用

换句话说,它会在测试数据集中的所有5,000,000

点上应用矩阵,只需一次调用另一个主力点

功能。在下一次调用上面的函数时,它通过

每次调用只有50,000点到另一个例程,然后是5,000,然后是500,等等,直到我们得到所有直到5的方式,然后

终于1,其中有一个函数调用

DMatrix3d_multiplyDPoint3dArray()为每个5,000,000 3D

指向

测试数据集。


我希望有人可以帮助解释结果。起初我做了这个程序的3

版本。在所有这三个版本中,

DMatrix3d_multiplyDPoint3dArray函数位于geometry.dll中,其余代码位于我的test.exe中。 3个版本

只是两个

可执行文件的原生与IL的不同组合:


test.exe geometry.dll (包含主力函数)

-------- ----------------

v1)原生本地

v2)托管本地

v3)托管管理


以下是结果。所有数字都是以秒为单位的时间,用于调用

描述的外部函数。


Native-> Native:

0.953

0.968

0.968

0.953

0.968

0.952

1.093

1.39

最后一次运行是第一次运行的146%。

最后一次运行是之前运行的127%


Managed-> Native

0.968

0.968

0.968

0.969

0.968

0.968

1.124

1.952

最后一次运行是第一次的202%运行。

最后一次运行是上次运行的174%


管理 - >管理

0.984

1.016

0.985

1

1

1.032

1.516

4.469

最终运行是第一次运行的454%。

最终运行是之前运行的295%


这让我感到很惊讶。首先,我认为对于版本2,由托管 - >原生过渡所施加的

惩罚将会更糟糕。就在那里,你可以看到性能下降更多,因为电话

粒度在最后变得非常精细,但它并不像我那么多b $ b可能已经猜到了。更令人惊讶的是,

托管 - >托管版本,没有任何

manged->原生过渡根本放慢了它的速度,远远落后了更糟糕的!

早期调用测试函数比较版本2和3之间非常接近
,这表明原始浮动点

表现为管理与本土的主力

功能非常相似。所以这似乎指向功能

呼叫开销。由于某种原因,函数调用开销

对于托管代码而言比本机更高?在预感中,我决定

制作一个4rth版本的程序,这也是

managed-> managed但是它消除了组装间的调用。相反,我将
只是将geometry.dll的所有内容链接到

test.exe。它产生了很大的不同。结果如下。是否有一些

安全/堆栈行走的东西在DLL之间

的情况下可能会发生什么?或者管理的,组装间的调用真的有意义吗?
比同等的
组装内电话慢得多?欢迎解释。当调用粒度很好时,组装版本

需要217%的时间内组装版本

进行最后一次调用。这似乎很可怕。


托管 - >托管(一个大的test.exe)

1

0.999

0.984

1.015

0.984

1.015

1.093

2.061

最后一次运行是首次运行的206%。

最后一次运行是之前运行的189%。

即使通过消除组装间调用产生的改进,

具有

的版本之间的相对性能来进行托管 - >原生转换并且所有管理版本很难

让我理解。什么是

托管 - >托管函数调用开销,甚至比

托管 - >本机函数调用开销更糟糕?


我试图确保页面错误不会影响我的测试运行,并且我得到的结果非常一致。


Bern McCarty

Bentley Sytems,Inc。


PS对于好奇,这里是DMatrix3d_multiplyDPoint3dArray看起来像

之类的。没有函数调用,它全部编译成IL。


Public void DMatrix3d_multiplyDPoint3dArray



const DMatrix3d * pMatrix ,

DPoint3d * pResult,

const DPoint3d * pPoint,

int numPoint



{

int i;

double x,y,z;

DPoint3d * pResultPoint;


for(i = 0,pResultPoint = pResult;

i< numPoint;

i ++,pResultPoint ++



{

x = pPoint [i] .x;

y = pPoint [i] .y;

z = pPoint [i] .z;


pResultPoint-> x = pMatrix->专栏[0] .x * x

+ pMatrix->专栏[1] .x * y

+ pMatrix->专栏[2] .x * z;


pResultPoint-> y = pMatrix->专栏[ 0] .y * x

+ pMatrix->专栏[1] .y * y

+ pMatrix->专栏[2] .y * z;


pResultPoint-> z = pMatrix-> col umn [0] .z * x

+ pMatrix->专栏[1] .z * y

+ pMatrix->专栏[2] .z * z ;


}

I have run an experiment to try to learn some things about floating point
performance in managed C++. I am using Visual Studio
2003. I was hoping to get a feel for whether or not it would make sense to
punch out from managed code to native code (I was using
IJW) in order to do some amount of floating point work and, if so, what that
certain amount of floating point work was
approximately.

To attempt to do this I made a program that applys a 3x3 matrix to an array
of 3D points (all doubles here folks). The program
contains a function that applies 10 different matrices to the same test data
set of 5,000,000 3D points. It does this by invoking
another workhorse function that does the actual floating point operations.
That function takes an input array of 3D points, an
output array of 3D points, a point count, and the matrix to use. There are
no __gc types in this program. It''s just pointers and
structs and native arrays. The outer test function looks like this:

void test_applyMatrixToDPoints(TestData *tdP, int ptsPerMultiply)
{
int jIterations = tdP->pointCnt / ptsPerMultiply;
for (int i = 0 ; i < tdP->matrixCnt ; ++i)
{
for (int j = 0 ; j < jIterations; ++j)
{
// managed-to-native transitions happen here in V2
DMatrix3d_multiplyDPoint3dArray(tdP->matrices + i,
&tdP->outPts[j*ptsPerMultiply],
&tdP->inPts[j*ptsPerMultiply],
ptsPerMultiply);
}
}
}

The program calls the above routine 8 times and records the time elapsed
during each call. On the first call the above function
calls the workhorse function only once for each of the 10 matrices. In
other words, it applies a matrix to all of the 5,000,000
points in the test data set with a single call to the other workhorse
function. In the next call to the above function it passes
only 50,000 points per-call to the other routine, then 5,000, then 500, et
cetera, until we get all of the way down to 5, and then
finally 1 where there is a function call to
DMatrix3d_multiplyDPoint3dArray() for each and every of the 5,000,000 3D
points in the
test data set.

I was hoping someone could help interpret the results. At first I made 3
versions of this program. In all 3 of these versions
the DMatrix3d_multiplyDPoint3dArray function was in a geometry.dll and the
rest of the code was in my test.exe. The 3 versions
were merely different combinations of native versus IL for the two
executables:

test.exe geometry.dll (contains workhorse function)
-------- ----------------
v1) native native
v2) managed native
v3) managed managed

Here are the results. All numbers are elapsed time in seconds for calls to
the outer function described.

Native->Native:
0.953
0.968
0.968
0.953
0.968
0.952
1.093
1.39
Final run is 146% of first run.
Final run is 127% of previous run

Managed->Native
0.968
0.968
0.968
0.969
0.968
0.968
1.124
1.952
Final run is 202% of first run.
Final run is 174% of previous run

Managed->Managed
0.984
1.016
0.985
1
1
1.032
1.516
4.469
Final run is 454% of first run.
Final run is 295% of previous run

This surprised me in two ways. First, I thought that for version 2 the
penalty imposed by managed->native transitions would be
worse. It''s there, you can see performance drop off more as the call
granularity becomes very fine toward the end, but it isn''t
as much as I might have guessed it would be. More surprising was that the
managed->managed version, which didn''t have any
manged->native transitions slowing it down at all, dropped off far worse!
The early calls to the test function compare very
closely between versions 2 and 3, suggesting that the raw floating point
performance of the managed versus native workhorse
function is quite similar. So this seemed to point the finger at function
call overhead. For some reason function call overhead
is just higher for managed code than for native? On a hunch I decided to
make a 4rth version of the program that was also
managed->managed but which eliminated the inter-assembly call. Instead I
just linked everything from geometry.dll right into
test.exe. It made a big difference. The results are below. Is there some
security/stack-walking stuff going on in the inter-DLL
case maybe? Or does it really make sense that managed, inter-assembly calls
are that much slower than the equivalent
intra-assembly call? Explanations welcomed. The inter-assembly version
takes 217% of the time that the intra-assembly version
takes on the final call when the call granularity is fine. That seems
awfully harsh.

Managed->Managed (one big test.exe)
1
0.999
0.984
1.015
0.984
1.015
1.093
2.061
Final run is 206% of first run.
Final run is 189% of previous run.

Even with the improvement yielded by eliminating the inter-assembly calls,
the relative performance between the version that has
to make managed->native transitions and the all managed version is difficult
for me to comprehend. What is it with
managed->managed function call overhead that seems worse even than
managed->native function call overhead?

I tried to make sure that page faults weren''t affecting my test runs and the
results I got were very consistent from run to run.

Bern McCarty
Bentley Sytems, Inc.

P.S. For the curious, here is what DMatrix3d_multiplyDPoint3dArray looks
like. There are no function calls made and it is all compiled into IL.

Public void DMatrix3d_multiplyDPoint3dArray
(
const DMatrix3d *pMatrix,
DPoint3d *pResult,
const DPoint3d *pPoint,
int numPoint
)
{
int i;
double x,y,z;
DPoint3d *pResultPoint;

for (i = 0, pResultPoint = pResult;
i < numPoint;
i++, pResultPoint++
)
{
x = pPoint[i].x;
y = pPoint[i].y;
z = pPoint[i].z;

pResultPoint->x = pMatrix->column[0].x * x
+ pMatrix->column[1].x * y
+ pMatrix->column[2].x * z;

pResultPoint->y = pMatrix->column[0].y * x
+ pMatrix->column[1].y * y
+ pMatrix->column[2].y * z;

pResultPoint->z = pMatrix->column[0].z * x
+ pMatrix->column[1].z * y
+ pMatrix->column[2].z * z;

}

推荐答案

Hello Bern,


一般来说,v1 JIT目前不会执行VC ++后端所做的所有特定于FP的特定优化,使浮点数b / b $ b运算现在更加昂贵。这可能就是为什么在你的测试中,托管 - >托管比托管 - >非托管的更贵......


因此对于大量使用浮点运算,请使用

分析器来挑选开销最耗费的片段,并且

将整个片段保存在非托管空间中。


此外,还要尽量减少转换次数。如果你有一些

非托管代码或一个循环调用,请使整个循环

不受管理。这样你只需支付转换费用两次,而不是每次迭代循环支付



通过查看ILCode,我们可以看到在InterOping时,还有一些额外的IL指令。因此,最大限度地减少转换次数可以节省许多IL指令并提高性能。


有关更多信息,请参阅本章在线:

第7章?提高互操作性能
http://msdn.microsoft.com/library/en...pt07.asp?frame

= true #scalenetchapt07_topic12


希望有所帮助。


祝你好运,

Yanhong Huang

微软社区支持


安全! - www.microsoft.com/security

发布是按原样提供的。没有保证,也没有授予任何权利。

Hello Bern,

Generally speaking, the v1 JIT does not currently perform all the
FP-specific optimizations that the VC++ backend does, making floating point
operations more expensive for now. That may be why managed->managed is more
expensive than managed->unmanaged in your test.

So for areas which make heavy use of floating point arithmetic, please use
profilers to pick the fragments where the overhead is costing you most, and
Keep the whole fragment in unmanaged space.

Also, work to minimize the number of transitions you make. If you have some
unmanaged code or an interop call sitting in a loop, make the entire loop
unmanaged. That way you''ll only pay the transition cost twice, rather than
for each iteration of the loop.

By looking into ILCode, we can see that when InterOping, there are some
extra IL instructions. So minimizing the number of transitions can save
many IL instructions and improve performance.

For some more information, you can refer to this chapter online:
"Chapter 7 ?a Improving Interop Performance"
http://msdn.microsoft.com/library/en...pt07.asp?frame
=true#scalenetchapt07 _topic12

Hope that helps.

Best regards,
Yanhong Huang
Microsoft Community Support

Get Secure! ¨C www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.


从阅读各种各样的东西,我已经认识到你所说的东西,b $ b状态为目前的传统智慧。我很难发布我的

结果,希望能得到一些反馈,说明为什么我的

结果会违背传统智慧。请考虑:


1)托管代码的浮点性能。至少在这个小小的b $ b测试场景中托管代码的浮点性能似乎根本不是问题。在测试运行中第一次调用8时,要求将
DMatrix3d_multiplyDPoint3dArray函数应用于每次调用高达5,000,000个3D点的

。所以它只是坐在那里在5,000,000迭代循环中执行
浮点运算,并且在该循环中根本没有

函数调用。在这种情况下,托管版本的版本仅比原始版本长3%。然后排除

浮点性能作为罪魁祸首,因为在后来的呼叫中,事情很快就会发生变化,其中呼叫粒度为

DMatrix3d_multiplyDPoint3dArray变得非常好。更有意义的是,

分配在函数调用上的细粒度调用情况中观察到的减速

开销,而不是浮点性能。

2)过渡费用。我究竟做错了什么?我的

测试程序的版本涉及来自

test_applyMatrixToDPoints-> DMatrix3d_multiplyDPoint3dArray的调用转换实际上比所有管理的更快
version(对于组件内部和

组装间调用情况都是如此)。此外,更细粒度的调用

越多,本机 - >托管版本的性能就越好于托管管理的
版本。既然我们已经确定DMatrix3d_multiplyDPoint3dArray函数内部循环的原始浮点性能非常可靠,那么在托管版本和本机版本以及传统版本之间相当于

智慧是原生的>管理过渡是昂贵和坏的,然后是什么

应该归咎于托管的不良相对性能>管理

版本?托管>托管版本完全被版本

击败,为每次调用进行转换。似乎有一些严重的惩罚与定期管理 - >管理

函数调用相关 - 而不是托管 - >本机调用。什么可能负责

它是否是我可以控制的东西?


3)组装间和成本之间惊人的成本差异/>
组件内托管 - >托管呼叫。有人可以解释这个差异

除了制作我的程序之外还有什么可以做的吗

一个巨大的可执行文件?


4)如何在发布可执行文件的

调试器中以汇编语言逐步执行JIT编译代码,以便我可以看到发生了什么?我希望JIT能够生成非调试。 x86指令,但我想通过它们步骤

来看看它们做了什么。提示赞赏。我可以用

VS.NET调试器吗? WinDbg的?怎么样?


Yan-Hong Huang [MSFT]" < YH ***** @ online.microsoft.com>在消息中写道

news:kG ************** @ cpmsftngxa10.phx.gbl ...
From reading various things I had already recognized the things that you
state as the current conventional wisdom. I went to the trouble to post my
results in the hopes of getting some feedback on why it might be that my
results run very much against that conventional wisdom. Please consider:

1) Floating point performance of managed code. At least in this little
test scenario floating point performance of managed code doesn''t seem to be
a problem at all. In the first call out of the 8 in a test run the
DMatrix3d_multiplyDPoint3dArray function is asked to apply the matrix to a
whopping 5,000,000 3D points per call. So it is just sitting there doing
floating point operations in a 5,000,000 iteration loop and there are no
function calls in that loop at all. The managed version took only 3% longer
in that case than the all native version. It seems logical then to rule out
floating point performance as the culprit when things quickly change for the
worse in the later calls where the call granularity to
DMatrix3d_multiplyDPoint3dArray becomes very fine. It makes more sense to
assign the slowdown observed in the fine-grained call cases on function call
overhead, not on floating point performance.

2) The expense of transitions. What am I doing wrong? The version of my
test program that involves a transition in the call from
test_applyMatrixToDPoints->DMatrix3d_multiplyDPoint3dArray is actually
FASTER than the all managed version (true for both the intra-assembly and
inter-assembly call cases). Furthermore, the more finely-grained the calls
are the more the native->managed version outperforms the managed-managed
versions. Since we already established that raw floating point performance
of the loop inside of the DMatrix3d_multiplyDPoint3dArray function is very
equivalent between the managed and native versions, and the conventional
wisdom is that native->managed transitions are expensive and bad, then what
is to blame for the poor relative performance of the managed->managed
versions? The managed->managed version is flat-out beaten by the version
that does a transition for each and every call. It would seem that there is
some serious penalty associated with making regular managed->managed
function calls - not managed->native calls. What might be responsible for
it and is it something I have any control over?

3) The surprising difference in cost between inter-assembly and
intra-assembly managed->managed calls. Can someone explain this difference
and is there anything that can be done about it besides making my program
one enormous executable?

4) How can I step through JIT compiled code in assembly language in a
debugger for a release executable so that I can see what is going on? I
want the JIT to produce "non debug" x86 instructions and yet I want to step
through them to see what they do. Tips appreciated. Can I do this with the
VS.NET debugger? Windbg? How?

"Yan-Hong Huang[MSFT]" <yh*****@online.microsoft.com> wrote in message
news:kG**************@cpmsftngxa10.phx.gbl...
Hello Bern,

一般来说,v1 JIT目前不会执行VC ++后端所做的所有特定于FP的优化,使浮动的
点操作现在更加昂贵。这可能就是为什么在你的测试中,托管 - >托管的价格比托管 - >非托管的价格高出


因此,对于大量使用浮点运算的区域,请使用并将整个片段保留在非托管空间中。

另外,尽量减少转换次数。如果你有
一些非托管代码或一个循环调用,请使整个循环不受管理。这样你只需支付过渡费用两次,而不是每次迭代循环。

通过查看ILCode,我们可以看到当InterOping时,有一些<额外的IL指令。因此,最大限度地减少转换次数可以节省许多IL指令并提高性能。

有关更多信息,您可以在线参考本章:
第7章?a提高互操作性能
http://msdn.microsoft.com/library/en...pt07.asp?frame = true #scalenetchapt07 _topic12
希望有所帮助。

此致,黄艳红
微软社区支持

获得安全! - www.microsoft.com/security
此帖子提供 ;按原样没有保证,也没有
权利。
Hello Bern,

Generally speaking, the v1 JIT does not currently perform all the
FP-specific optimizations that the VC++ backend does, making floating point operations more expensive for now. That may be why managed->managed is more expensive than managed->unmanaged in your test.

So for areas which make heavy use of floating point arithmetic, please use
profilers to pick the fragments where the overhead is costing you most, and Keep the whole fragment in unmanaged space.

Also, work to minimize the number of transitions you make. If you have some unmanaged code or an interop call sitting in a loop, make the entire loop
unmanaged. That way you''ll only pay the transition cost twice, rather than
for each iteration of the loop.

By looking into ILCode, we can see that when InterOping, there are some
extra IL instructions. So minimizing the number of transitions can save
many IL instructions and improve performance.

For some more information, you can refer to this chapter online:
"Chapter 7 ?a Improving Interop Performance"
http://msdn.microsoft.com/library/en...pt07.asp?frame =true#scalenetchapt07 _topic12

Hope that helps.

Best regards,
Yanhong Huang
Microsoft Community Support

Get Secure! ¨C www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.



嗨伯尔尼,


使用ildasm .exe,您可以查看程序集的IL代码,看看

组装间和组装内管理的差异 - >管理

调用。


与此同时,我已将您的问题转发给我们的产品团队,以获得他们对此的意见。我会尽快回到这里。


谢谢。


祝你好运,

Yanhong Huang

微软社区支持


安全! - www.microsoft.com/security

发布是按原样提供的。没有保证,也没有授予任何权利。

Hi Bern,

By using ildasm.exe, you can look into the IL code of the assembly to see
the difference between inter-assembly and intra-assembly managed->managed
calls.

At the same time, I have forwarded your questions to our product team for
their opinions on it. I will return here as soon as possilble.

Thanks.

Best regards,
Yanhong Huang
Microsoft Community Support

Get Secure! ¨C www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.


这篇关于/ CLR浮点性能,组装函数调用性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆