openmp 中嵌套 for 循环的性能改进失败 [英] Failed performance improvement in the nested for loop in openmp

查看:87
本文介绍了openmp 中嵌套 for 循环的性能改进失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Eratosthenes Sieve 算法来查找 n 之前的素数.想法是标记一个素数的所有倍数.但是,随着线程数量的增加,它并没有实现性能提升.

使用 100 个线程花费 0.009 秒,使用 1 个线程花费 0.006 秒来查找 100000 之前的素数.

#pragma omp parallel num_threads(t){#pragma omp forfor (int index = 2; index <= N; index++) {list_of_nums[index - 2] = 索引;}#pragma omp forfor (int loop_index = 2; loop_index <= num_to_stop; loop_index++) {如果(list_of_nums[loop_index - 2] != -1){for (int mark_index = loop_index - 2; mark_index 

解决方案

首先,尽管如此,并行化并不能保证提高程序的速度.管理多个线程会增加计算开销,这会超过通过并发执行多个计算获得的加速.

其次,加速的范围受可以实现的并发量的限制.特别是,对于没有阻塞操作的计算,您不能期望通过添加比独立执行引擎(粗略地说,核心)更多的线程来获得改进.

但第三,在这里我将重点关注这个答案的其余部分,Eratosthenes 筛具有数据依赖性,使其不适合并行化.您甚至可以从并行版本中获得正确的结果,这源于您的实现的特定特性.问题集中在这里:

<块引用>

 if (list_of_nums[loop_index - 2] != -1) {

即检查loop_index是否已经被确定为复合的,从而跳过冗余筛选其倍数.那里的关键字是已经".如果 loop_index 复合的,并且分配了与当前线程不同的线程来测试其主要因子,那么您不能确信 loop_index 已经被标记复合.

如果您在那时选择素数并将它们存储在单独的列表中,您会遇到麻烦,这在 SofE 实现中很常见.另一方面,您的特定实现可能只是做了很多不必要的工作,筛选出多个组合.因此,您不仅有管理多个线程所产生的开销,而且您可能会做更多的总工作.从这个意义上说,它并不是真正的埃拉托色尼筛.

编写一个并行版本的 SofE 来正确尊重其数据依赖性是可能的,尽管我不确定 OpenMP 是否足够丰富来描述一个.我用另一种语言完成了它,它确实表现出一些加速.但是适当地尊重依赖关系会极大地限制可用的并发量(并增加更多开销),因此加速非常乏味.

更新:可能的替代方案

通过测量,您知道并行方法的性能比底层串行方法差.您可以尝试调整参数,例如所使用的线程的精确数量,但您最好选择不同的方向.有前景的替代方案包括:

  • 只需使用您算法的串行版本.根据你的测量,这减少了 33% 的运行时间,一点也不差.

  • 预先计算您的筛选/素数列表,而不是即时计算.那么计算性能不是影响大型服务性能的因素.

  • 通过标记多个小素数的倍数,并行并接受所涉及的冗余来预先播种您的筛子,然后从那里运行标准的串行 SofE.您可能希望通过对不同选择进行适当的性能测量来调整已知素数和线程数以用于预播种过程.

此外,您还可以实施一些微优化,以(可能)即使从串行版本也能略微提高速度.这与问题无关,所以我不会在这里详细介绍,但您应该很容易在网络上找到示例.

I am doing the Eratosthenes Sieve algo to find prime numbers before n. Idea is to mark off all the multiples of a prime. However, it did not achieve a performance increase as the number of threads scale up.

It cost 0.009 seconds using 100 threads and 0.006 seconds using 1 thread to find the prime numbers before 100000.

#pragma omp parallel num_threads(t)
{
    #pragma omp for
    for (int index = 2; index <= N; index++) {
        list_of_nums[index - 2] = index;
    }

    #pragma omp for
    for (int loop_index = 2; loop_index <= num_to_stop; loop_index++) {
        if (list_of_nums[loop_index - 2] != -1) {
            for (int mark_index = loop_index - 2; mark_index < N - 1; mark_index += loop_index) {
                if (mark_index != loop_index - 2) {
                    if (list_of_nums[mark_index] % loop_index == 0) {
                        list_of_nums[mark_index] = -1;
                    }
                }
            }
        }
    }
}

解决方案

First of all, everything else notwithstanding, parallelization is not guaranteed to improve the speed of your program. Managing multiple threads adds overhead to the computation, and that can overwhelm the speedup obtained by performing multiple computations concurrently.

Second, the scope for speedup is bounded by the amount of concurrency that can be achieved. In particular, for computations that have no blocking operations, you cannot expect to see improvement from adding more threads than you have independent execution engines (roughly, cores).

But third, and here I will focus the rest of this answer, the Sieve of Eratosthenes has data dependencies that make it poorly suited for parallelization. That you even get correct results from your parallel version arises from particular idiosyncrasies of your implementation. The issue is focused here:

        if (list_of_nums[loop_index - 2] != -1) {

That's checking whether loop_index has already been determined to be composite, so as to skip redundantly sieving out its multiples. The keyword there is "already". If loop_index is composite, and different threads than the current one were assigned to test its prime factors, then you cannot be confident that loop_index has already been marked composite.

You would be in trouble if you were choosing primes at that point and storing them in a separate list, as is common with SofE implementations. Your particular implementation, on the other hand, is merely likely to do a lot of unnecessary work sieving out multiples of composites. Thus, not only do you have the overhead arising from managing multiple threads, but you are likely to be doing more total work. It's not really the Sieve of Eratosthenes in that sense.

It is possible to write a parallel version of the SofE that correctly respects its data dependencies, though I'm uncertain whether OpenMP is rich enough to describe one. I have done it in another language, and it did exhibit some speedup. But properly respecting the dependencies greatly limits the amount of concurrency available (and adds more overhead), so the speedup was pretty lackluster.

Update: possible alternatives

You know by measurement that your parallel approach has worse performance than the underlying serial one. You can try to tweak parameters, such as the precise number of threads used, but you're probably better off going in a different direction. Among the promising alternatives are:

  • just use the serial version of your algorithm. According to your measurement, that cuts run time by 33%, which is not shabby at all.

  • pre-compute your sieve / prime list instead of computing it on the fly. Then performance of the computation is not a factor in the performance of your larger service.

  • pre-seed your sieve by marking multiples of several small primes, in parallel and accepting the redundancies involved, then run standard, serial SofE from there. You would probably want to tune the numbers of known primes and of threads to use in the pre-seeding process by making appropriate performance measurements for different choices.

Additionally, there are some micro-optimizations you could implement to (probably) eke out a little speedup even from the serial version. That's tangential to the question, so I will not go into specifics here, but you should easily find examples around the Net.

这篇关于openmp 中嵌套 for 循环的性能改进失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆