并行计算大矢量的和 [英] Calculating the sum of a large vector in parallel

查看:184
本文介绍了并行计算大矢量的和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题背景



我有一个程序目前需要的时间太长,不能总计大 std :: vector std :: accumulate ,这是一个瓶颈。



要更快,我希望它是一个异步计算,所以GUI /服务器不会阻塞。计算也应该使用多线程,所以我可以减少总计向量所需的时间。



我想分割总和使得每个线程总和矢量的一部分,然后当计算所有部分和时,每个线程的部分和应该被加在一起以获得总和。



Boost。 Asio?



我想知道如何在 Boost.Asio 中解决这个问题?我的程序理想地需要重用线程(如线程组),不知道如何存储和检索partials总和,最终检索部分和的总和。



我想创建一个线程组,调用 boost :: asio :: io_service :: run ,传递一个处理程序来计算部分和,但是我不知道如何传递partials总和到另一个处理程序,并添加所有的部分和在一起。



这将是巨大的,如果有人显示一些骨架代码如何

解决方案

Boost.Asio是否适合此问题?



Boost.Asio的主要目的是为网络 I / O编程提供一个异步模型,你描述的问题似乎没有太多



我认为最简单的解决方案是使用Boost或C ++提供的线程原语标准库。



并行算法



这是一个并行版本的示例

  / *多线程算法的最小元素数。 
小于此并且算法在单线程上执行。 * /
static const int MT_MIN_SIZE = 10000;

template< typename InputIt,typename T>
auto parallel_accumulate(InputIt first,InputIt last,T init){
//确定总大小。
const auto size = std :: distance(first,last);
//确定工作分成多少部分。
const auto parts =(size< MT_MIN_SIZE)? 1:std :: thread :: hardware_concurrency();

std :: vector< std :: future< T>>期货;

//对于每个部分,计算大小并在单独的线程上累加运行。
for(std :: size_t i = 0; i!= parts; ++ i){
const auto part_size =(size * i + size)/ parts -
futures.emplace_back(std :: async(std :: launch :: async,
[=] {return std :: accumulate(first,std :: next(first,part_size),T { );}));
std :: advance(first,part_size);
}

//等待所有线程完成执行并累积结果。
return std :: accumulate(std :: begin(futures),std :: end(futures),init,
[](const T prev,auto& future){return prev + future.get ();});
}

活动示例 (并行版本与Coliru上的顺序执行大致相同,可用)



计时



在我的机器上(使用8个线程)


顺序和:

所需时间:46 ms

5000000050000000

--------------------------------

并行和:

所需时间:21 ms

5000000050000000


然而,100,000,000个元件的绝对增益仅为边际(25 ms)。虽然,当累积不同于 int 的元素类型时,性能增益可能更大。



OpenMP



如@sehe在注释中所提到的,值得一提的是, OpenMP 可能会提供一个简单的解决方案,例如

  template< typename T,typename U> 
auto omp_accumulate(const std :: vector< T>& v,U init){
U sum = init;

#pragma omp parallel for reduction(+:sum)
for(std :: size_t i = 0; i< v.size(); i ++){
sum + = v [i];
}

return sum;
}



在我的机器上,这个方法与使用标准线程原语的并行方法


顺序和:

所需时间:46 ms

5000000050000000

--------------------------------

并行和:

所用时间:21 ms

总和:5000000050000000

-------------------------- ------

OpenMP sum:

所用时间:21 ms

Sum:5000000050000000



Problem background

I have a program that currently takes way too long to sum up large std::vectors of ~100 million elements using std::accumulate, and this is a bottleneck.

I want it to be faster and I want it to be an asynchronous calculation so the GUI/Server doesn't block. The calculation should also be using multithreading so I can reduce the time it takes to sum up a vector.

I want to split up the summation so that each thread sums a part of the vector and then when all partial sums are calculated, each thread's partial sum should be added together to get the total summation.

Boost.Asio?

I was wondering how I could go about this in Boost.Asio? My program ideally needs to reuse the threads (like a thread group), not sure how store and retrieve the partials sums and finally retrieve the sum of the partial sums.

I was thinking of creating a thread group which call boost::asio::io_service::run, passing a handler to compute the partial sums, but I'm not sure how to pass the partials sums to another handler and add all the partial sums together.

It would be great if someone showed some skeleton code of how I can go about this.

解决方案

Is Boost.Asio suitable for this problem?

The main purpose of Boost.Asio is to provide an asynchronous model for network and I/O programming, and the problem you describe does not seem to have much to do with networking and I/O.

I think that the simplest solution is to use the threading primitives provided by either Boost or the C++ standard library.

A parallel algorithm

Here's an example of a parallel version of accumulate created by only using the standard library.

/* Minimum number of elements for multithreaded algorithm.
   Less than this and the algorithm is executed on single thread. */
static const int MT_MIN_SIZE = 10000;

template <typename InputIt, typename T>
auto parallel_accumulate(InputIt first, InputIt last, T init) {
    // Determine total size.
    const auto size = std::distance(first, last);
    // Determine how many parts the work shall be split into.
    const auto parts = (size < MT_MIN_SIZE)? 1 : std::thread::hardware_concurrency();

    std::vector<std::future<T>> futures;

    // For each part, calculate size and run accumulate on a separate thread.
    for (std::size_t i = 0; i != parts; ++i) {
        const auto part_size = (size * i + size) / parts - (size * i) / parts;
        futures.emplace_back(std::async(std::launch::async,
            [=] { return std::accumulate(first, std::next(first, part_size), T{}); }));
        std::advance(first, part_size);
    }

    // Wait for all threads to finish execution and accumulate results.
    return std::accumulate(std::begin(futures), std::end(futures), init,
        [] (const T prev, auto& future) { return prev + future.get(); });
}

Live example (Parallel version performs about the same as sequential on Coliru, probably only 1 core available)

Timings

On my machine (using 8 threads) the parallel version gave, on average, a ~120 % boost in performance.

Sequential sum:
Time taken: 46 ms
5000000050000000
--------------------------------
Parallel sum:
Time taken: 21 ms
5000000050000000

However, the absolute gain for 100,000,000 elements is only marginal (25 ms). Although, the performance gain might be greater when accumulating a different element type than int.

OpenMP

As mentioned by @sehe in the comments, it is worth mentioning that OpenMP might provide a simple solution to this problem, e.g.

template <typename T, typename U>
auto omp_accumulate(const std::vector<T>& v, U init) {
    U sum = init;

    #pragma omp parallel for reduction(+:sum)
    for(std::size_t i = 0; i < v.size(); i++) {
        sum += v[i];
    }

    return sum;
}

On my machine this method performed the same as the parallel method using standard thread primitives.

Sequential sum:
Time taken: 46 ms
5000000050000000
--------------------------------
Parallel sum:
Time taken: 21 ms
Sum: 5000000050000000
--------------------------------
OpenMP sum:
Time taken: 21 ms
Sum: 5000000050000000

这篇关于并行计算大矢量的和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆