比较两种复印技术的性能? [英] Comparing performance of two copying techniques?

查看:75
本文介绍了比较两种复印技术的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要复制一个巨大的double数组到另一个数组,我有以下两种选择:

选项1

copy(arr1, arr1+N, arr2);

选项2

#pragma omp parallel for
for(int i = 0; i < N; i++)
    arr2[i] = arr1[i];

我想知道一个较大的N值.以下哪项将是更好的选择(花费更少的时间),何时使用?"

系统配置:
内存:15.6 GiB
处理器:Intel®Core™i5-4590 CPU @ 3.30GHz×4
操作系统类型:64位
编译器:gcc(Ubuntu 4.9.2-0ubuntu1〜12.04)4.9.2

解决方案

实际上,如果性能很重要,请对其进行评估.

std::copymemcpy通常是经过高度优化的,使用复杂的性能技巧.您的编译器可能不够聪明,或者拥有正确的配置选项来从原始循环中获得这种性能.

从理论上说,将副本并行化可以带来好处.在现代系统上,您必须使用多个线程来充分利用内存和缓存带宽.看看这些基准测试结果,其中前两行比较并行与对比单线程高速缓存,以及最后两行并行与单线程主内存带宽之间的关系.在像您这样的桌面系统上,差距不是很大.在面向高性能的系统中,尤其是具有多个套接字的系统中,更多线程对于利用可用带宽非常重要.

对于最佳解决方案,您必须考虑诸如不要从多个线程中写入同一条缓存行之类的事情.同样,如果您的编译器没有从原始循环中生成完美的代码,则可能必须在多个线程/块上实际运行std::copy.在我的测试中,原始循环的性能要差得多,因为它不使用AVX.只有Intel编译器设法用avx_rep_memcpy实际替换了OpenMP循环中的部件-有趣的是,它没有对非OpenMP循环执行此优化.用于内存带宽的最佳线程数通常也不是内核数,而是更少.

通常的建议是:从一个简单的实现开始(在这种情况下为惯用的std::copy),然后再分析您的应用程序以了解瓶颈的实际位置.不要投资于复杂的,难以维护的,系统特定的优化,这些优化可能只影响代码总体运行时的一小部分.如果事实证明这是应用程序的瓶颈,并且硬件资源没有得到充分利用,那么您需要了解基础硬件(本地/共享缓存,NUMA,预取器)的性能特征,并相应地调整代码. /p>

For copying a huge double array to another array I have following two options:

Option 1

copy(arr1, arr1+N, arr2);

Option 2

#pragma omp parallel for
for(int i = 0; i < N; i++)
    arr2[i] = arr1[i];

I want to know for a large value of N. Which of the following will be the better (takes less time) option and when?"

System configuration:
Memory: 15.6 GiB
Processor: Intel® Core™ i5-4590 CPU @ 3.30GHz × 4
OS-Type: 64-bit
compiler: gcc (Ubuntu 4.9.2-0ubuntu1~12.04) 4.9.2

解决方案

Practically, if performance matters, measure it.

std::copy and memcpy are usually highly optimized, using sophisticated performance tricks. Your compiler may or may not be clever enough / have the right configuration options to achieve that performance from a raw loop.

That said, theoretically, parallelizing the copy can provide a benefit. On modern systems you must use multiple threads to fully utilize both your memory and cache bandwidth. Take a look at these benchmark results, where the first two rows compare parallel versus single threaded cache, and the last two rows parallel vs. single threaded main memory bandwidth. On a desktop system like yours, the gap is not very large. In a high-performance oriented system, especially with multiple sockets, more threads are very important to exploit the available bandwidth.

For an optimal solution, you have to consider things like not writing the same cache-line from multiple threads. Also if your compiler doesn't produce perfect code from the raw loop, you may have to actually run std::copy on multiple threads/chunks. In my tests, the raw loop performed much worse, because it doesn't use AVX. Only the Intel compiler managed to actually replace parts in the OpenMP loop with an avx_rep_memcpy - interestingly it did not perform this optimization with a non-OpenMP loop. The optimal number of threads for memory bandwidth is also usually not the number of cores, but less.

The general recommendation is: Start with a simple implementation, in this case the idiomatic std::copy, and later analyze your application to understand where the bottleneck actually is. Do not invest in complex, hard to maintain, system specific optimizations that may only affect a tiny faction of your codes overall runtime. If it turns out this is a bottleneck for your application, and your hardware resources are not utilized well, then you need to understand the performance characteristics of your underlying hardware (local/shared caches, NUMA, prefetchers) and tune your code accordingly.

这篇关于比较两种复印技术的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆