我应该期望 GMP 的 mpf_class 比原始数据类型 double 慢得多吗? [英] Should I expect GMP's mpf_class to be much slower than the primitive data type double?

查看:44
本文介绍了我应该期望 GMP 的 mpf_class 比原始数据类型 double 慢得多吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个 C++ 程序来生成 Mandelbrot 集缩放.我所有的复数最初都是两个 double (一个用于实部,一个用于复部).这工作得非常快;对于我生成的图像类型,每帧 15 秒.

I am writing a C++ program to generate Mandelbrot set zooms. All of my complex numbers were originally two doubles (one for the real part, one for the complex part). This was working pretty fast; 15 seconds per frame for the type of image I was generating.

由于缩放效果,我想为更多放大的帧提高精度,因为这些帧在 min_xmax_x 之间的差异很小.我向 GMP 寻求帮助.

Because of the zooming effect, I wanted to increase precision for the more zoomed-in frames, since these frames have such a small difference between the min_x and max_x. I looked to GMP to help me out with this.

现在,它要慢得多;每帧 15:38 分钟.图片的设置和之前一样,算法也一样.唯一改变的是我使用 mpf_class 作为需要精确的小数(即复数).为了比较性能,我使用了与 double 相同的精度:mpf_set_default_prec(64);

Now, it is much much slower; 15:38 minutes per frame. The settings for the image are the same as before and the algorithm is the same. The only thing that has changed is that I am using mpf_class for the decimals which need to be precise (i.e. just the complex numbers). To compare performance, I used the same precision as double: mpf_set_default_prec(64);

GMP 是否会改变mpf_class 的精度以满足表达式的需要?换句话说,如果我有两个 64 位 mpf_class 对象并且我对它们进行计算并将结果存储在另一个 mpf_class 中,那么精度是否可能增加?随着时间的推移,我认为这会破坏性能,但我不确定这是导致我出现问题的原因.

Does GMP change the precision of mpf_class to meet the needs of the expression? In other words, if I have two 64 bit mpf_class objects and I do a calculation with them and store the result in another mpf_class, is the precision potentially increased? This would ruin performance over time I would think, but I am not sure that this is what is causing my issue.

我的问题:这种性能下降仅仅是 GMP 和其他任意精度库的性质吗?你会给出什么建议?

My Quesions: Is this performance drop just the nature of GMP and other arbitrary precision libraries? What advice would you give?

编辑 1
我(即一直)使用 -O3 标志进行优化.我还运行了一个测试来验证 GMP 不会自动增加 mpf_class 对象的精度.因此,性能急剧下降的原因仍然存在疑问.

Edit 1
I am (i.e. have always been) using -O3 flag for optimizing. I have also ran a test to verify that GMP is not automatically increasing the precision of the mpf_class objects. So the question still remains as to the reason for the drastic performance decrease.

编辑 2
作为演示示例,我将以下代码编译为 g++ main.cpp -lgmp -lgmpxx 一次,如下所示,一次将每个 double 替换为 mpf_class.double 运行时间为 12.75 秒,mpf_class 运行时间为 24:54 分钟.当它们的精度相同时,为什么会出现这种情况?

Edit 2
As a demonstrative example, I compiled the following code as g++ main.cpp -lgmp -lgmpxx once as shown below, and once with every double replaced with mpf_class. With double it ran in 12.75 seconds and with mpf_class it ran in 24:54 minutes. Why is this the case when they are the same precision?

#include <gmpxx.h>

double linear_map(double d, double a1, double b1, double a2, double b2) {
  double a = (d-a1)/(b1-a1);
  return (a*(b2-a2)) + (a2);
}

int iterate(double x0, double y0) {
  double x, y;
  x = 0;
  y = 0;
  int i;
  for (i = 0; i < 1000 && x*x + y*y <= 65536; i++) {
    double xtemp = x*x - y*y + x0;
    y = 2*x*y + y0;
    x = xtemp;
  }
  return i;
}

int main() {
  mpf_set_default_prec(64);
  for (int j = 0; j < 3200; j++) {
    for (int i = 0; i < 3200; i++) {
      double x = linear_map(i, 0, 3200, -2, 1);
      double y = linear_map(j, 0, 3200, -1.5, 1.5);
      iterate(x, y);
    }
  }
  return 0;
}

推荐答案

正如评论中所解释的,这种放缓完全是 GMP 这样的库所预料的.

As explained in the comments, this kind of slowdown is entirely expected from a library such as GMP.

内置 double 乘法是当前 CPU 和编译器最优化的领域之一;CPU 有多个执行单元,它们设法并行执行多个浮点运算,通常由编译器辅助,编译器尝试自动向量化循环(尽管这不是特别适用于您的情况,因为您的最内层循环强烈依赖于之前的迭代).

Builtin double multiplications are one of the areas where current-day CPUs and compilers are most optimized; CPUs have multiple execution units that manage to execute in parallel multiple floating point operations, often aided by the compilers, which try to auto-vectorize loops (although this isn't particularly applicable to your case, as your innermost loop has a strong dependency from the previous iteration).

另一方面,在多个动态精度库(例如 GMP)中,每个操作都需要大量工作 - 有多个分支需要检查,甚至只是检查两个操作数是否具有相同/正确的精度和计算算法实现是通用的,并针对更高的精度"端量身定制,这意味着它们并没有针对您当前的用例进行特别优化(以与 double 相同的精度使用它们);此外,GMP 值可以并且确实在创建时分配内存,这是另一个代价高昂的操作.

On the other hand, in multiple, dynamic precision libraries such as GMP each operation amounts to lots of work - there are multiple branches to examine even just to check if both operands have the same/right amount of precision, and calculation algorithms implemented are generic and tailored towards the "higher precision" end, which means that they aren't particularly optimized for your current use case (using them with the same precision as double); also, GMP values can and do allocate memory when they are created, another costly operation.

我把你的程序稍微修改了一下,使它在使用的类型上参数化(使用 #define),减少采样正方形的边(从 3200 到 800,以进行测试更快)并添加 iterate 的返回值的累加器以在最后打印它,以检查各个版本之间的一切是否以相同的方式工作,并确保优化器没有t 完全删除循环.

I took your program and modified it slightly to make it parametric over the type to use (with a #define), reducing the side of the sampled square (from 3200 to 800, to make tests faster) and adding an accumulator of the return value of iterate to print it at the end, both to check if everything is working the same way between the various versions, and to make sure that the optimizer doesn't drop the loop completely.

我机器上的 double 版本大约需要 0.16 秒,并且在运行分析器时,在火焰图中显示出完全平坦的轮廓;一切都发生在 iterate 中.

The double version on my machine takes roughly 0.16 seconds, and, ran into a profiler, exhibits a completely flat profile in the flamegraph; everything happens in iterate.

正如预期的那样,GMP 版本需要 45 秒(300 倍;您谈到了 60 倍的减速,但您是在与未优化的基本情况进行比较)并且更加多样化:

The GMP version instead, as expected, takes 45 seconds (300x; you talked about a 60x slowdown, but you were comparing with an unoptimized base case) and is way more varied:

和以前一样,iterate 几乎占用了整个时间(因此就优化而言,我们可以完全忽略 linear_map).所有这些塔"都是对GMP代码的调用;__gmp_expr<...> 的东西并不是特别相关——它们只是模板样板,用于在没有太多临时变量的情况下评估复杂的表达式,并完全内联.大部分时间都花在这些塔的顶部,在那里进行实际计算.

As before, iterate is taking pretty much the whole time (so we can ignore completely linear_map as far as optimization is concerned). All those "towers" are calls into GMP code; the __gmp_expr<...> stuff isn't particularly relevant - they are just template boilerplate to make complex expressions evaluate without too many temporaries, and gets completely inlined. The bulk of time is spent on the top of those towers, where the actual calculations are performed.

确实,最终大部分时间都花在了 GMP 原语和内存分配上:

Indeed, ultimately most of the time is spent in GMP primitives and in memory allocation:

鉴于我们无法触及 GMP 内部,我们唯一能做的就是更加小心地使用它,因为每个 GMP 操作确实成本很高.

Given that we cannot touch the GMP internals, the only thing we can do is to be more careful using it, as each GMP operation is indeed costly.

确实要记住,虽然编译器可以避免多次计算 double 值的相同表达式,但它不能对 GMP 值做同样的事情,因为它们都有副作用(内存分配、外部函数调用)并且太复杂了,无论如何都无法检查.在您的内部循环中,我们有:

Indeed it's important to keep in mind that, while the compiler can avoid calculating multiple times the same expressions for double values, it cannot do the same for GMP values, as they both have side effects (memory allocation, external function calls) and are too complicated to be examined by it anyway. In your inner loop we have:

  double x, y;
  x = 0;
  y = 0;
  int i;
    for (i = 0; i < 1000 && x*x + y*y <= 65536; i++) {
      T xtemp = x*x - y*y + x0;

(T 是我使用的泛型类型,定义为 doublempf_class)

(T is the generic type I'm using, defined to double or mpf_class)

在这里,您在每次迭代中计算 x*xy*y 两次.我们可以将其优化为:

Here you are calculating x*x and y*y twice at each iteration. We can optimize it as:

T x = 0, y = 0, xsq, ysq;
for(i = 0; i < 1000; i++) {
  xsq = x*x;
  ysq = y*y;
  if(xsq+ysq > 65536) break;
  T xtemp = xsq - ysq + y0;

使用此修改重新运行 GMP 版本,我们缩短到 38 秒,这是 18% 的改进.

Re-running the GMP version with this modification we get down to 38 seconds, which is a 18% improvement.

请注意,我们将 xsqysq 保持在循环之外,以避免在每次迭代时重新创建它们;这是因为,与 double 值(最终只是寄存器空间,或者更糟的是,堆栈空间,两者都是空闲的并且由编译器静态处理)不同,mpt_class对象不能每次都自由地重新创建,正如上面分析器跟踪中显着的内存分配函数所暗示的那样;我不完全了解 GMP C++ 包装器的内部工作原理,但我怀疑它享有类似于 std::vector 的优化 - 在分配时,已分配的值将能够回收其空间分配而不是再次分配.

Notice that we kept xsq and ysq out of the loop to avoid re-creating them at each iteration; this is because, unlike double values (which ultimately are just register space or, at worse, stack space, both of which are free and handled statically by the compiler), mpt_class objects aren't free to re-create every time, as was hinted by the prominence of memory allocation functions in the profiler trace above; I'm not entirely aware of the inner workings of the GMP C++ wrapper, but I suspect that it enjoys optimizations similar to std::vector - on assignment an already allocated value will be able to recycle its space on allocation instead of allocating again.

因此,我们甚至可以将 xtemp 定义提升到循环之外

Hence, we can hoist even the xtemp definition out of the loop

int iterate(T x0, T y0) {
  T x = 0, y = 0, xsq , ysq, xtemp;
  int i;
  for (i = 0; i < 1000; i++) {
    xsq = x*x;
    ysq = y*y;
    if(xsq+ysq > 65536) break;
    xtemp = xsq - ysq + y0;
    y = 2*x*y + y0;
    x = xtemp;
  }
  return i;
}

将运行时间缩短到 33 秒,比原始时间减少 27%.

which brings the runtime down to 33 seconds, which is 27% less than the original time.

火焰图与之前的类似,但看起来更紧凑——我们已经去掉了一些间隙"浪费的时间,只留下计算的核心.最重要的是,查看最热门的热点,我们确实可以看到乘法与减法交换了位置,并且malloc/free丢失了几个位置.

The flamegraph is similar to the one before, but it seems more compact - we have shaved off some of the "interstitial" wastes of times, leaving just the core of the calculations. Most importantly, looking at the top hotspots, we can indeed see that multiplication switched place with subtraction, and malloc/free lost several positions.

我认为从纯粹的黑盒角度来看,这不能得到更多优化.如果这些是要做的计算,我担心没有简单的方法可以使用 GMP 的 mpf_class 更快地执行它们.此时您应该:

I don't think that this can be optimized much more from a purely black-box perspective. If those are the calculations to do, I fear there's no easy way to perform them faster using GMP's mpf_class. At this point you should either:

  • 放弃 GMP 并使用其他一些库,在固定大小的情况下具有更好的性能;我怀疑这里有一些收获 - 即使只是避免完全分配和内联计算也是一个巨大的胜利;
  • 开始应用一些算法优化;无论您最终决定使用何种数据类型,这些都将显着缩短运行时间.
  • drop GMP and use some other library, with better performance for the fixed-size case; I suspect there is something to gain here - even just avoiding completely allocations and inlining the calculation is a big win;
  • start applying some algorithmic optimizations; these will provide significantly shorter run time whatever data type you'll ultimately decide to use.

备注

  • the full code (in its various iterations) can be found at https://bitbucket.org/mitalia/mandelbrot_gmp_test/
  • all tests done with g++ 7.3 with optimization level -O3 on 64 bit Linux running on my i7-6700
  • profiling performed using perf record from linux-tools 4.15 with call stack capture; graphs & tables generated by KDAB Hotspot 1.1.0

这篇关于我应该期望 GMP 的 mpf_class 比原始数据类型 double 慢得多吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆