如果程序受内存限制,并行化对性能有多少帮助? [英] How much does parallelization help the performance if the program is memory-bound?

查看:90
本文介绍了如果程序受内存限制,并行化对性能有多少帮助?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我并行化了Java程序.在具有4核的Mac上,以下是线程数量不同的时间.

I parallelized a Java program. On a Mac with 4 cores, below is the time for different number of threads.

threads #   1         2          4           8          16
time 2597192200 1915988600  2086557400  2043377000  1931178200

在具有两个插槽(每个插槽具有4个内核)的Linux服务器上,以下是测量的时间.

On a Linux server with two sockets, each with 4 cores, below is the measured time.

threads #   1         2          4           8          16 
time 4204436859 2760602109  1850708620  2370905549  2422668438

如您所见,提速与线性提速相距甚远.在这种情况下,几乎没有并行化开销,例如同步或I/O依赖项.

As you seen, the speedup is far away from linear speedup. There is almost no parallelization overhead in this case, like synchronization, or I/O dependencies.

我有两个问题:

  1. 这些数据是否表示此Java程序受内存限制?
  2. 如果是这样,有没有办法在不更改硬件的情况下进一步提高性能?

推荐答案

回答标题问题

阿姆达尔定律解释说,并行化程序所获得的加速取决于多少.该程序是可并行化的.

Answering the Title Question

Amdahl's Law explains that the speed-up obtained parallelizing a program depends on how much of the program is parallelizable.

我们还必须增加协调并行性的开销.

And we must also add in the overhead for coordinating the parallelism.

因此,我们考虑程序的百分比/部分是可并行化的,以及产生了哪些开销(同步,通信,错误共享等).

So we consider what percent/parts of the program is/are parallelizable, and what overhead (synchronization, communication, false sharing, etc.) is incurred.

您可以同时读取2个不同的硬盘驱动器,而不会降低速度.

You can read from 2 different hard disk drives at the same time without a slow down.

但是,通常并行性不能提高从硬盘读取的速度.

But, usually parallelism does not provide a speed-up to reading from a hard drive.

硬盘驱动器(即带有旋转磁盘的驱动器)已经过优化,可以顺序读取,并且在内存位置之间跳转会减慢整体内存的传输速度.

Hard disk drives (i.e. drives with a spinning disk) have been optimized to read sequentially, and jumping around between memory locations will slow down the overall memory transfer.

固态驱动器实际上非常擅长随机访问数据,并在内存中四处跳转,因此使用固态驱动器保持读/写队列已满是一个好主意.

Solid state drives are actually quite good at randomly accessing data, jumping here and there in memory, so with solid state drives keeping the read/write queue full is a good idea.

了解高速缓存行的概念将有助于避免错误共享.

Understanding the idea of a cache-line will help avoid false-sharing.

这种类型的内存操作可以有效地并行化,例如通过将数组划分为四个分区来对数组进行迭代.

This type of memory operation can be parallelized effectively, such as iterating over an array by dividing it into four partitions.

我假设您的时间以纳秒为单位,因此在计算机1上,程序花费了2.5秒,然后稳定到大约2秒,峰值为1.9秒.

I'm assuming that your times are in nano-seconds, so on computer 1, the program took 2.5 secs and then leveled off to about 2 seconds, with a peak of a 1.9 seconds.

我希望您可以同时运行最少的后台程序,并且您多次执行了这些测试以消除异常情况.

I am hoping that you had minimal background programs running at the same time, and that you performed these tests a few times to get rid of irregularities.

此外,由于Java虚拟机的即时编译(JIT),时序上可能会出现不规则现象,因此,为了准确计时,您需要循环运行几次代码,并存储时间.最后一次迭代. (或预编译为本地代码).

Also, irregularities could come up in timing due to the Just In Time compiling (JIT) of the Java virtual machine, so to accurately time, you want to run the code in a loop a few times, and store the time of the last iteration. (or pre-compile to native code).

此外,由于该程序是首次运行,因此硬盘驱动器上使用的许多数据都将移入高速缓存,因此以后的执行速度应该更快. (因此,请使用循环后从上次运行开始的时间以确保内存在高速缓存中,或者使用第一个时间但要在两次时间之间关闭计算机电源并打开电源.)

Also, since the first time the program is run, much of the data that was used from hard drive would be moved into the cache, so later executions should be faster. (So either use a timing from the last run after looping to ensure the memory is in cache, or use the first timing but power off and on the computer between timings).

仅根据您的时间安排,这很难说.

Based only on your timings, this is hard to say.

第一台计算机花了2.5秒,然后通过2个线程将速度提高了20%,但随后保持了大约2.0秒.

The first computer took 2.5 seconds, then had a 20% speed-up with 2 threads, but then stayed at about 2.0 seconds.

就其本身而言,这种加速可能只是JIT和高速缓存内存由1个线程上的计时所填充的结果.之后,运行时的任何差异都可能只是噪音.

By itself, this speedup could just have been the results of the JIT and the cache memory being filled by the timing on 1 thread. After that, any differences in run-time might just be noise.

第二台计算机花了4.2秒,然后是2.8,然后是1.9,然后又回到了约2.3秒.

The second computer took 4.2 seconds, then 2.8, then 1.9, then back to about 2.3 seconds.

这似乎确实表明了某种并行性的加速,但是发生了一些争用时间(内存,高速缓存行,同步等),如时间从4个线程增加到8个线程所证明的那样.线程.

This one does seem to demonstrate some type of a speed-up with parallelism, but some time of contention occurs (memory, cache-lines, synchronization, or etc.) as demonstrated by the increase in time form 4 threads to 8 threads.

在代码上使用事件探查器,确定代码的哪些部分占用最多的时间.

Use a profiler on your code, determine what parts of your code are taking up the most time.

(您可以通过调试代码并破坏代码并查看程序的位置来模拟探查器.重复10次,以查看是否有一部分比另一部分停下来成比例.)

(You can simulate a profiler, by debugging your code and breaking and see where the program is. Repeat that 10 times, to see if there is one part that is proportionally more stopped at than another.)

使用更好的算法或以更好的方式将数据安排在内存(数据结构)中.

Use better algorithms or Arrange the data in memory (data structures) in a better way for the problem.

在问题中利用更多的并行性.

Exploit more parallelism in the problem.

尝试使硬盘驱动器存储器顺序读取.可能只有一个线程从硬盘驱动器读取数据,然后将数据放入并发队列中,以便其他线程对其进行操作.

Try to make hard drive memory reads sequential. Maybe have just one thread with reads from the hard drive and then puts the data in a concurrent queue to be operated on by the other threads.

这篇关于如果程序受内存限制,并行化对性能有多少帮助?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆