C#基本操作时间如何随数字的大小变化? [英] C# How do basic operation time vary with the size of the numbers?

查看：99 发布时间：2020/5/21 21:35:47 c# performance optimization multiplication micro-optimization

本文介绍了C#基本操作时间如何随数字的大小变化?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

此上下文是一个函数，每个帧几乎需要运行一次，因此在性能方面非常关键.此函数包含一个循环及其内部的操作.

Context of this is a function, which needs to run pretty much once per frame, and is therefore very critical performance-wise. This function contains a loop, and operations inside it.

private int MyFunction(int number)
{
    // Code
    for (int i = 0; i <= 10000; i++)
    {
        var value = i * number
        var valuePow2 = value * value;

        // Some code which uses valuePow2 several times
    }
    return 0; // Not actual line
}

现在，由于数学特性，我们知道(a * b)²等于a²*b²

Now, because of mathematical properties, we know that (a * b)² is equal to a² * b²

因此，有可能使我的功能成为这样:

So, it would be possible to make my function into this:

private int MyFunction(int number)
{
    // Code
    var numberPow2 = number * number;
    for (int i = 0; i <= 10000; i++)
    {
        var iPow2 = i * i
        var valuePow2 = numberPow2 * iPow2;

        // Some code which uses valuePow2 several times
    }
    return 0; // Not actual line
}

凭直觉，这似乎应该更快一些，因为number²不变，现在仅在循环外进行一次计算.至少，对于人类而言，这样做会快得多，因为x²操作是在循环过程中以较小的数量完成的.

intuitively, this seems like it should be faster, since number² does not vary, and is now only calculated once outside of the loop. At the very least, this would be much faster for a human to do, because the x² operation is done on a much smaller number during the loop.

我想知道的是，在C#中，当您使用诸如int之类的类型时，乘以较小的数字实际上会更快吗?

What I am wondering, is in C#, when you use types like int, will the multiplication actually be faster with smaller numbers?

例如，5 * 5的执行速度会比5000 * 5000快吗?

For example, will 5 * 5 execute faster than 5000 * 5000?

如果是这样，那么第二个版本就更好了，即使这样也是如此.

If so, then the second version is better, even if by a small margin, because of that.

但是，如果对于给定的数据类型，时间是恒定的，则该函数的第一个版本更好，因为一半的计算将在较小的数字上进行，因为我在循环中执行的乘法量相同两次，但是在第二个版本中，我在开始之前做了一个额外的乘法.

But if, for a given data type, the time is constant, then the first version of the function is better, because half of the calculations will be done on smaller numbers, because I do the same amount of multiplication in the loop both times, but in the second version I do one extra multiplication before the start.

我知道，就所有意图和目的而言，性能差异都是可以忽略的.在代码审查"中建议我使用第二个版本，因为该功能至关重要，而且找不到任何文档来支持这两种视图.

I am aware that for all intent and purposes, the performance difference is negligible. I was suggested the second version in a Code Review because the function is critical, and I can't find any documentation to support either view.

推荐答案

例如，5 * 5的执行速度是否会比5000 * 5000快?

For example, will 5 * 5 execute faster than 5000 * 5000?

对于编译时常量，5 * x比5000 * x便宜，因为前者可以用lea eax, [rdi + rdi*4]完成.

For compile-time constants, 5 * x is cheaper than 5000 * x because the former can be done with lea eax, [rdi + rdi*4].

但是对于运行时变量，唯一与数据相关的性能是整数指令.这适用于任何主流CPU:流水线非常重要，即使某些情况下可以以较低的延迟运行，它们通常不这样做，因为这会使调度变得更加困难. (您不能让同一个执行单元在同一周期内产生2个结果；相反，CPU只想知道将输入置于一个周期内肯定会导致答案在3个周期后出现.)

But for runtime variables, the only integer instruction with data-dependent performance is division. This applies on any mainstream CPU: pipelining is so important that even if some cases could run with lower latency, they typically don't because that makes scheduling harder. (You can't have the same execution unit produce 2 results in the same cycle; instead the CPU just wants to know that putting inputs in on one cycle will definitely result in the answer coming out 3 cycles later.)

(对于FP，同样，除法和sqrt在普通CPU上具有取决于数据的性能.)

(For FP, again only division and sqrt have data-dependent performance on normal CPUs.)

如果分支采用不同的方式，则使用整数或FP且具有任何与数据相关的分支的代码可能会慢得多. (例如，对分支预测进行训练"以进行二进制搜索的一个跳转序列；使用另一个关键字进行搜索会比较慢，因为它至少会误预测一次.)

Code using integers or FP that has any data-dependent branching can be much slower if the branches go a different way. (e.g. branch prediction is "trained" on one sequence of jumps for a binary search; searching with another key will be slower because it will mispredict at least once.)

根据记录，使用Math.Pow而不是整数*的建议是疯狂的.简单地将整数转换为double然后返回，比将其自身乘以整数乘法要慢.

And for the record, suggestions to use Math.Pow instead of integer * are insane. Simply converting an integer to double and back is slower than multiplying by itself with integer multiply.

Adam的答案链接了一个循环遍历大型数组的基准，并且可以进行自动矢量化. SSE/AVX2仅具有32位整数乘法. 64位占用更多的内存带宽.这也是为什么它显示16位和8位整数加速的原因.因此，它发现c=a*b在Haswell CPU上以半速运行，但这不是适用于您的循环情况.

Adam's answer links a benchmark that's looping over a big array, with auto-vectorization possible. SSE / AVX2 only has 32-bit integer multiply. And 64-bit takes more memory bandwidth. That's also why it shows speedups for 16 and 8-bit integers. So it finds c=a*b running at half speed on a Haswell CPU, but that does not apply to your loop case.

在标量代码中，imul r64, r64在Intel主流CPU(至少从Nehalem起)和Ryzen(

In scalar code, imul r64, r64 has identical performance to imul r32, r32 on Intel mainstream CPUs (since at least Nehalem), and on Ryzen (https://agner.org/optimize/). Both 1 uop, 3 cycle latency, 1/clock throughput.

这只是AMD Bulldozer系列以及AMD Atom和Silvermont，它们的64位标量乘法速度较慢. (当然假设是64位模式！在32位模式下，使用64位整数会更慢.)

It's only AMD Bulldozer-family, and AMD Atom and Silvermont, where 64-bit scalar multiply is slower. (Assuming 64-bit mode of course! In 32-bit mode, working with 64-bit integers is slower.)

对于number的固定值，而不是重新计算i*number，编译器可以并将其优化为inum += number.这称为 强度降低优化 ，因为加法是一种比乘法运算更弱"(便宜些).

For a fixed value of number, instead of recalculating i*number, compilers can and will optimize this to inum += number. This is called a strength-reduction optimization, because addition is a "weaker" (slightly cheaper) operation than multiplication.

for(...) {
    var value = i * number
    var valuePow2 = value * value;
}

可以编译成asm之类的东西

can be compiled into asm that does something like

var value = 0;
for(...) {
    var valuePow2 = value * value;

    ...

    value += number;
}

如果编译器没有为您这样做，您可以尝试用这种方式手工编写.

You might try writing it by hand that way, in case the compiler isn't doing it for you.

但是整数乘法非常便宜，尤其是在现代CPU上完全流水线化.它的延迟比添加的延迟略高，并且可以在更少的端口上运行(通常每个时钟吞吐量只有1个端口，而不是添加的4个端口)，但是您说您在valuePow2上做了大量工作.那应该让乱序的执行隐藏了延迟.

But integer multiplication is very cheap and fully pipelined on modern CPUs, especially. It has slightly higher latency than add, and can run on fewer ports (usually only 1 per clock throughput instead of 4 for add), but you say you're doing significant work with valuePow2. That should let out-of-order execution hide the latency.

如果您检查了asm，并且编译器正在使用单独的循环计数器加1，则还可以尝试使编译器优化循环，以将value用作循环计数器.

If you check the asm and the compiler is using a separate loop counter incrementing by 1, you could also try to hand-hold your compiler into optimizing the loop to use value as the loop counter.


var maxval = number * 10000;
for (var value = 0; i <= maxval; value += number) {
    var valuePow2 = value * value;

    ...
}

如果number*10000可能需要溢出以正确包装，请小心.在这种情况下，该循环将运行更少的迭代. (除非number太大，以致value += number也可以包装...)

Be careful if number*10000 can overflow, if you need it to wrap correctly. In that case this loop would run far fewer iterations. (Unless number is so big that value += number also wraps...)

这篇关于C#基本操作时间如何随数字的大小变化?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

C#基本操作时间如何随数字的大小变化? [英] C# How do basic operation time vary with the size of the numbers?

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

C#基本操作时间如何随数字的大小变化? [英] C# How do basic operation time vary with the size of the numbers?

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭