将元组有效处理为固定大小的向量 [英] Efficient treatment of tuples as fixed-size vectors

查看：58 发布时间：2021/6/14 18:52:26 performance parallel-processing tuples chapel parallelism-amdahl

本文介绍了将元组有效处理为固定大小的向量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在 Chapel 中，可以像使用小向量"一样使用同构元组
(例如，a = b + c * 3.0 + 5.0;).

In Chapel, homogeneous tuples can be used as if they were small "vectors"
( e.g., a = b + c * 3.0 + 5.0; ).

然而，由于元组没有提供各种数学函数，我尝试用多种方式为 norm() 编写函数并比较它们的性能.我的代码是这样的:

However, because various math functions are not provided for tuples, I have tried writing a function for norm() in several ways and compared their performance. My code is something like this:

proc norm_3tuple( x: 3*real ): real
{
    return sqrt( x[1]**2 + x[2]**2 + x[3]**2 );
}

proc norm_loop( x ): real
{
    var tmp = 0.0;
    for i in 1 .. x.size do
        tmp += x[i]**2;
    return sqrt( tmp );
}

proc norm_loop_param( x ): real
{
    var tmp = 0.0;
    for param i in 1 .. x.size do
        tmp += x[i]**2;
    return sqrt( tmp );
}

proc norm_reduce( x ): real
{
    var tmp = ( + reduce x**2 );
    return sqrt( tmp );
}

//.........................................................

var a = ( 1.0, 2.0, 3.0 );

// consistency check
writeln( norm_3tuple(     a ) );
writeln( norm_loop(       a ) );
writeln( norm_loop_param( a ) );
writeln( norm_reduce(     a ) );

config const nloops = 100000000;  // 1E+8

var res = 0.0;
for k in 1 .. nloops
{
    a[ 1 ] = (k % 5): real;

    res += norm_3tuple(     a );
 // res += norm_loop(       a );
 // res += norm_loop_param( a );
 // res += norm_reduce(     a );
}

writeln( "result = ", res );

我使用 chpl --fast test.chpl 编译了上述代码(OSX10.11 上的 Chapel v1.16，4 核，通过自制软件安装).然后，norm_3tuple()、norm_loop() 和 norm_loop_param() 给出几乎相同的速度(0.45 秒)，而 norm_reduce() 慢得多(大约 30 秒).我检查了 top 命令的输出，然后 norm_reduce() 使用了所有 4 个内核，而其他函数只使用了 1 个内核.所以我的问题是...

I compiled the above code with chpl --fast test.chpl (Chapel v1.16 on OSX10.11 with 4 cores, installed via homebrew). Then, norm_3tuple(), norm_loop(), and norm_loop_param() gave almost the same speed (0.45 sec), while norm_reduce() was much slower (about 30 sec). I checked the output of top command, and then norm_reduce() was using all 4 cores, while other functions use only 1 core. So my question is...

norm_reduce() 很慢，因为 reduce 并行工作并且并行执行的开销很大大于这个小元组的净计算成本?
鉴于我们要避免 reduce 用于 3 元组，其他三个例程基本上以相同的速度运行.这是否意味着显式 for 循环对于 3 元组的成本可以忽略不计(例如，通过 --fast 选项启用循环展开)?
在norm_loop_param() 中，我也尝试将param 关键字用于循环变量，但这并没有给我带来很少或没有性能提升.如果我们只对同构元组感兴趣，是否完全不需要附加 param(为了性能)?

Is norm_reduce() slow because reduce works in parallel and the overhead for parallel execution is much greater than the net computational cost for this small tuple?
Given that we want to avoid reduce for 3-tuples, the other three routines run essentially with the same speed. Does this mean that explicit for-loops have negligible cost for 3-tuples (e.g., via loop unrolling enabled by --fast option)?
In norm_loop_param(), I have also tried using param keyword for the loop variable, but this gave me little or no performance gain. If we are interested in homogeneous tuples only, is it not necessary to attach param at all (for performance)?

我很抱歉一下子有很多问题，我很感激任何关于有效处理小元组的建议/建议.非常感谢！

I'm sorry for many questions at once, and I would appreciate any advice/suggestions for efficient treatment of small tuples. Thanks very much!

推荐答案

norm_reduce() 很慢，因为 reduce 并行工作，并行执行的开销要大得多比这个小元组的净计算成本?

Is norm_reduce() slow because reduce works in parallel and the overhead for parallel execution is much greater than the net computational cost for this small tuple?

我相信你是正确的，这就是正在发生的事情.减少是并行执行的，当工作可能不需要时(在这种情况下)，Chapel 目前不会尝试进行任何智能节流来压缩这种并行性，所以我认为您正在承受太多的任务开销除了协调其他任务之外几乎没有其他工作(虽然我很惊讶差异如此之大......但我也发现我对这些事情没有什么直觉).将来，我们希望编译器能够序列化如此小的缩减以避免这些开销.

I believe you are correct that this is what's going on. Reductions are executed in parallel, and Chapel currently doesn't attempt to do any intelligent throttling to squash this parallelism when the work may not warrant it (as in this case), so I think you're suffering from too much task overhead to do almost no work other than coordinating with the other tasks (though I am surprised that the magnitude of the difference is so large... but I also find I have little intuition for such things). In the future, we'd hope that the compiler would serialize such small reductions in order to avoid these overheads.

鉴于我们要避免 reduce 用于 3 元组，其他三个例程基本上以相同的速度运行.这是否意味着显式 for 循环对于 3 元组的成本可以忽略不计(例如，通过由 --fast 选项启用的循环展开)?

Given that we want to avoid reduce for 3-tuples, the other three routines run essentially with the same speed. Does this mean that explicit for-loops have negligible cost for 3-tuples (e.g., via loop unrolling enabled by --fast option)?

Chapel 编译器不会在 norm_loop() 中展开显式 for 循环(您可以通过检查使用 --savec 生成的代码来验证这一点> 标志)，但也可能是后端编译器.或者，与 norm_loop_param() 的展开循环相比，for 循环确实不会花费太多.我怀疑您需要检查生成的程序集以确定是哪种情况.但我也希望后端 C 编译器能够很好地处理我们生成的代码——例如，它很容易看出这是一个 3 次迭代循环.

The Chapel compiler doesn't unroll the explicit for loop in norm_loop() (and you can verify this by inspecting the code generated with the --savec flag), but it could be that the back-end compiler is. Or that the for-loop really doesn't cost that much compared to the unrolled loop of norm_loop_param(). I suspect you'd need to inspect the generated assembly to determine which is the case. But I also expect that back-end C compilers would do decently with the code we generate -- e.g., it's easy for it to see that it's a 3-iteration loop.

在 norm_loop_param() 中，我也尝试使用 param 关键字作为循环变量，但这给了我很少或没有性能提升.如果我们只对同构元组感兴趣，是不是完全没有必要附加 param(为了性能)?

这很难给出明确的答案，因为我认为这主要是一个关于后端 C 编译器有多好的问题.

This is hard to give a definitive answer to since I think it's mostly a question about how good the back-end C compiler is.

这篇关于将元组有效处理为固定大小的向量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将元组有效处理为固定大小的向量 [英] Efficient treatment of tuples as fixed-size vectors

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将元组有效处理为固定大小的向量 [英] Efficient treatment of tuples as fixed-size vectors

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭