多线程减速,CPU绑定应用程序,可能与硬件相关 [英] Multithreading slowdown, CPU bound app, probably hardware related

查看:50
本文介绍了多线程减速,CPU绑定应用程序,可能与硬件相关的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的多线程应用程序的确切细节可能无关紧要。它完全受CPU限制(一次运行几小时或几天),没有I / O,所有线程都有足够的内存,并且线程几乎没有相互通信(出于所有实际目的,它们不与每个线程通信其他)。计算要解决许多子问题,每个子问题运行几个小时,多线程的唯一目的是让所有处理器核心始终保持忙碌状态。线程确实共享一些大的内存块,但只读取共享内存,永远不会写入。除了共享只读取的大块内存之外,应用程序可以很容易地成为每个处理器内核的单独工作或任务。



问题是解决每个子问题的CPU时间随着多线程而上升。例如,假设存在一个小的测试问题,需要250个CPU秒作为单个线程(以及250个挂钟秒)运行。在两个线程的同时运行两个这样的子问题可能需要每个线程270个CPU秒,总共540个CPU秒和270个挂钟秒。在一个正常工作的完美世界中,同时测试两个子问题每个线程需要250个CPU秒,总共500个CPU秒和250个挂钟秒。



将同时出现的子问题和线程数增加到处理器核心数会加剧明显的减速。将线程数增加到处理器内核数量总是更快,但总共使用n个处理器内核不会产生n倍的加速。例如,使用16个处理器核心示例,使用16个线程仅产生大约8或9倍的计算。



显然有某种硬件原因例如内存总线上的争用或同一处理器芯片上的多个处理器内核之间的争用等。这是一个已知问题吗?我有什么可以做的吗?



我尝试过:



一个可能的问题是我在一台配备四核处理器的高端笔记本电脑上进行了一些速度测试。以100%运行所有处理器核心会产生热量,导致笔记本电脑电源管理干预并降低处理器时钟速度,因为较慢的时钟速度会产生较少的热量。我调整了电源管理设置无济于事。我还有一个古老的服务器,64GB内存和四个四核处理器,总共16个核心。在这台机器上运行速度测试产生的结果基本上等同于笔记本电脑上的结果。因此笔记本电脑的热量和电源管理可能是一个问题,但如果是这样,它似乎不是主要问题。根据我运行的速度测试的确切大小,运行16个单独的作业(每个作程一个线程)可以稍快一点,或者使用16个线程运行1个大作业可以稍快一些。因此,争论似乎不在我的申请中。相反,它似乎在硬件中。我在笔记本电脑上运行Windows 10,在服务器上运行Linux下的Wine。

The exact details of my multi-theaded application probably don't matter. It's completely CPU bound (runs for hours or days at a time), does no I/O, there is ample memory for all the threads, and the threads barely communicate with each other (for all practical purposes they don't communicate with each other). There are many sub-problems to be solved by the computation, each of which runs for several hours, and the only purpose of multi-threading is to keep all the processor cores busy all the time. The threads do share some large blocks of memory, but the shared memory is only read and is never written into. Except for the sharing of the large block of memory which is only read, the application could just as easily be a separate "job" or "task" for each processor core.

The problem is that the CPU time to solve each sub-problem goes up with the multi-threading. For example, suppose there is a small test problem that requires 250 CPU seconds to run as a single thread (and also 250 wall clock seconds). Running two such sub-problems at the same time as two threads might take 270 CPU seconds for each thread for a total 540 CPU seconds and 270 wall clock seconds. In a perfect world where everything "just works", the test of two sub-problems at the same time would require 250 CPU seconds for each thread for a total of 500 CPU seconds and 250 wall clock seconds.

Increasing the number of simultaneous sub-problems and threads up to the number of processor cores exacerbates the apparent slowdown. It's always faster overall to add threads up to the number of processor cores, but the use of n processor cores overall do not produce an n times speedup. For example, with a 16 processor core example, using 16 threads only produces about 8 or 9 times as much computation.

There seems obviously to be some sort of hardware cause such as contention on the memory bus or contention between the multiple processor cores on the same processor chip, etc. Is this a known problem? Is there anything I can do about it?

What I have tried:

One possible problem is that I run some of the speed tests on a high end laptop with a single quad-core processor. Running all the processor cores at 100% generates heat which causes laptop power management to intervene and slow down the processor clock speeds because slower clock speeds generate less heat. I have tweaked power management settings to no avail. I also have an ancient server with 64GB memory and four quad-core processors, for a total of 16 cores. Running speed tests on this machine produce results that are essentially equivalent to the results on the laptop. So laptop heat and power management may be an issue, but if so it does not seem to be the main issue. Depending on the exact size of the speed test that I run, it can be slightly faster to run 16 separate jobs with one thread each or it can be slightly faster to run 1 big job with sixteen threads. The contention therefore does not seem to be in my application. Rather, it seems to be in the hardware. I run under Windows 10 on the laptop and under Wine under Linux on the server.

推荐答案

非常感谢您的回复。我想我需要添加一些额外的信息。



我可能很天真,但我真的希望在我的古老服务器(Dell R900)上加速16倍速拥有16个处理器内核。它是一台Linux机器,我的Windows程序在Wine下运行。 top命令通常表示我的程序在总计1600%中占CPU的1580%到1590%。肯定有其他进程正在运行,但它们只占用很少的cpu周期。我的程序没有I / O,我的线程没有交互(没有队列,没有锁定,没有等待等),并且有足够的内存。 (当我在Windows上运行时,有更多的Windows垃圾在后台运行,但是对于我的程序,我仍然可以获得350%左右的机器,四个处理器内核可以达到400%。)



我真的不相信我的程序有多线程设计问题(但当然它确实有可能)。因此,我决定尝试创建一个完整的虚拟程序来说明问题,并且该程序足够小以便发布。在这里。



Much thanks for the responses. I think I need to add a little extra information.

I am perhaps naive, but I really did hope for nearly a 16x speedup on my ancient server (Dell R900) with 16 processor cores. It's a Linux machine and my Windows program is running under Wine. The "top" command usually says that my program is getting 1580% to 1590% of the CPU out of a total of 1600%. There are certainly other processes running but they are taking very few cpu cycles. My program does no I/O, my threads do not interact (no queues, no locking, no waiting, etc.), and there is ample memory. (When I run on Windows, there is more "Windows junk" running in the background, but I can still usually get 350% percent or so of the machine for my program out of 400% possible with four processor cores.)

I really do not believe my program has a multi-threading design problem (but of course it's certainly possible it does). Therefore, I decided to try to create a complete dummy program which illustrates the problem and which is small enough to post. Here it is.

// timing_test.cpp : dummy program to test the timing of multiple CPU bound threads.
//

#include "stdafx.h"
#include <stdio.h>
#include <vector>
#include <algorithm>
#include <boost/timer/timer.hpp>
#include <boost/thread.hpp>

	std::vector<int> numbers;	// vector to contain 10,000,000 numbers to be sorted as dummy work.
	void timing_test_2(void);	// function prototype for thread worker

		class Comp_modulus	// functor for comparing intergers mod k
	{
		public:

		int k;	// k value to calculate modulus

		bool operator()(const std::vector<int>::iterator &x, const std::vector<int>::iterator &y)
		{
			return ( ( (*x) % k ) < ( (*y) % k )  );	// compare the integers by their modulus rather than by their
														// .... values to sort in strange ways. It is iterators that
														// .... are being sorted, not the integers themselves.
		}
	};

int main(void)
{
	int N = 16;					// number of threads to create
	numbers.resize(10000000);	// storage for 10,000,000 numbers

	int i = 0;
	for (std::vector<int>::iterator it = numbers.begin(); it != numbers.end(); it++)
		(*it) = i++; // fill in the numbers 0 to 9,999,999 which will be sorted in a variety of strange ways.

	std::cout << "beginning timing test, number of threads = " << std::setw(2) << (int)N << std::endl;	

	boost::timer::auto_cpu_timer auto_t;	// Quick and dirty overall timer. It prints automatically
											// .... when it is destructed at the end of the program.

	boost::thread_group tg;									// Define a thread group in preparation ....
															// .... for creating the threads	

	for (int i =  0; i < N; i++)
	{
		tg.create_thread( boost::bind( timing_test_2) );	// Create the N threads
	}

	tg.join_all();											// Wait until all the threads are done.

	std::cout << "completed timing test" << std::endl;

	return 0;
}

void timing_test_2(void)		// this is the thread that creates the artificial load to simulate work.
{
	std::vector<std::vector<int>::iterator> numbers_p;	// storage for iterators (pointers) which point to the array of integers
	numbers_p.resize( numbers.size() );					// allocate the correct amount of storage, same number of iterators
														// ... as integers to which they point.

	std::vector<int>::iterator it = numbers.begin();	// point to the beginning of the integer array.

	for (std::vector<std::vector<int>::iterator>::iterator it_it = numbers_p.begin(); it_it != numbers_p.end(); it_it++)
		(*it_it) = it++;	// populate the iterator vector with iterators pointing the the integers

	Comp_modulus comp_modulus;	// instantiate the functor which does strange compares

	// create the dummy work

	comp_modulus.k =  2; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k =  3; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k =  5; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k =  7; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k = 11; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k = 13; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k = 17; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k = 19; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k = 23; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );
	comp_modulus.k = 29; std::sort(numbers_p.begin(), numbers_p.end(), comp_modulus );

}





如你所见,它是C ++。我使用Visual Studio 2010在Windows上编译。因此,我使用boost来实现一些计时和多线程功能。如果你没有安装boost,并且你有一个足够新的C ++编译器,那么用C ++标准库函数完成定时和多线程是很简单的。



我的测试程序看起来很奇怪,但它实际上反映了我的真实程序的作用。测试程序首先创建一个包含10,000,000个整数的数组(std :: vector),范围为0到9,999,999。然后,它通过按各种素数模数对整数进行排序,为计时目的创建一个人工工作负载。例如,按模数2排序可对所有奇数前面的所有偶数进行排序。使用模数的素数可以确保数字变得非常彻底,并且std :: sort还有很多工作要做。



正在排序的数字仅存储每个线程同时对一个时间和概念进行排序。但是,线程之间没有锁定,并且数字实际上并没有被排序移动。相反,这个魔术是通过每个线程创建它自己的整数指针(C ++标准库术语中的迭代器)来实现的。然后它是排序而不是整数的指针。



在定时测试程序中,整数是32位,指针是64位,所以每个线程都可以同样拥有自己的整数副本并直接对它们进行排序,而不是使用指针和排序程序。在我的实际程序中,整数变为32字节排列,并且线程使用64位指针(8字节)比使用32字节排列更有效。对8字节长的项进行排序比对32字节长的项进行排序要快得多。



应用程序是计算组理论,问题是尝试为Rubik's Cube的每个可能位置列举最佳解决方案。找到一个位置的最优解决方案比仅仅找到一个位置的解决方案更加计算密集。大约有4.3 x 10 ^ 19个魔方位置,所以这是一个非常重要的问题。毫无疑问,这种终极解决方案需要数百台(如果不是数千台或数十万台机器)以分段方式解决问题。



以下是运行结果我古老的服务器上的虚拟计时程序。测试程序非常简单,我不得不运行16次,每次更改线程数。您可以看到挂钟和CPU时间如何不按您希望的那样扩展。例如,对于一个线程,挂钟时间和CPU时间都是大约24秒。所以人们希望有两个线程,挂钟时间大约是24秒,而cpu时间大约是48秒。但实际数字分别约为26.5秒和约53秒。问题只会随着更多线程而变得更糟。我的Windows机器上也发生了同样的事情。如果你摆脱线程并且只运行一个线程程序的多个实例,就会发生同样的事情。



Jerry Bryan





As you can see, it's C++. I compile on Windows with Visual Studio 2010. Hence, I use boost for some timing and multi-threading functions. If you don't have boost installed and if you have a new enough C++ compiler, it would be straightforward to accomplish the timing and multi-threading with C++ standard library functions.

What my test program does looks pretty strange, but it actually mirrors quite well what my real program does. The test program first creates an array (std::vector) containing 10,000,000 integers in the range of 0 to 9,999,999. It then creates an artificial workload for the purposes of timing by sorting the integers by various prime moduli. For example, sorting by a modulus 2 sorts all the even numbers in front of all the odd numbers. Using prime numbers for the moduli makes sure the numbers get scrambled pretty thoroughly and that std::sort has plenty of work to do.

The numbers being sorted are stored only a single time and conceptually are sorted simultaneously by each thread. However, there is no locking between threads and the numbers aren't really moved around by the sort at all. Rather, this magic is achieved by each thread creating it's own private set of pointers to the integers (iterators in C++ standard library terms). Then it's the pointers that are sorted rather than the integers.

In the timing test program, the integers are 32 bit and the pointers are 64 bit, so each thread could just as well have its own copy of the integers and sort them directly instead of having pointers and sorting the program. In my real program, the integers become 32 byte permutations and it's much more memory efficient for the threads to work with 64 bit pointers (8 bytes) than to work with 32 byte permutations. It's also much faster to sort items that are 8 bytes long than to sort items that are 32 bytes long.

The application is computational group theory and the problem is to attempt to enumerate an optimal solution for every single one of the possible positions of Rubik's Cube. Finding an optimal solution for a position is much more computationally intense than is simply finding a solution for a position. There are about 4.3 x 10^19 Rubik's cube positions, so it's an extremely non-trivial problem. It's ultimate solution will undoubtedly require many hundred if not thousands or hundreds of thousands of machines working on the problem in a piece-wise fashion.

Here are the results of running the dummy timing program on my ancient server. The test program is so simple minded that I had to run it 16 different times, changing the number of threads each time. You can see how the wall clock and cpu times do not scale up as you might hope. For example, with one thread the wall clock time and the cpu time are both about 24 seconds. So one would hope would be that with two threads the wall clock time would be about 24 seconds and that the cpu time would be about 48 seconds. But the real figures are about 26.5 seconds and about 53 seconds, respectively. The problem only gets worse with more threads. The same thing happens on my Windows machine. And the same thing happens if you get rid of the threads and just run multiple instances of a one thread program.

Jerry Bryan

beginning timing test, number of threads =  1
completed timing test
 24.109956s wall, 23.990000s user + 0.080000s system = 24.070000s CPU (99.8%)

beginning timing test, number of threads =  2
completed timing test
 26.558628s wall, 52.820000s user + 0.210000s system = 53.030000s CPU (199.7%)

beginning timing test, number of threads =  3
completed timing test
 27.632885s wall, 82.410000s user + 0.310000s system = 82.720000s CPU (299.4%)

beginning timing test, number of threads =  4
completed timing test
 33.766357s wall, 120.150000s user + 0.390000s system = 120.540000s CPU (357.0%)

beginning timing test, number of threads =  5
completed timing test
 34.820234s wall, 152.480000s user + 0.510000s system = 152.990000s CPU (439.4%)

beginning timing test, number of threads =  6
completed timing test
 33.279489s wall, 183.450000s user + 0.680000s system = 184.130000s CPU (553.3%)

beginning timing test, number of threads =  7
completed timing test
 35.896250s wall, 235.820000s user + 0.890000s system = 236.710000s CPU (659.4%)

beginning timing test, number of threads =  8
completed timing test
 36.156149s wall, 275.370000s user + 1.170000s system = 276.540000s CPU (764.8%)

beginning timing test, number of threads =  9
completed timing test
 38.529383s wall, 319.610000s user + 1.540000s system = 321.150000s CPU (833.5%)

beginning timing test, number of threads = 10
completed timing test
 40.542099s wall, 367.840000s user + 1.970000s system = 369.810000s CPU (912.2%)

beginning timing test, number of threads = 11
completed timing test
 45.129656s wall, 421.800000s user + 2.340000s system = 424.140000s CPU (939.8%)

beginning timing test, number of threads = 12
completed timing test
 44.265782s wall, 485.820000s user + 2.650000s system = 488.470000s CPU (1103.5%)

beginning timing test, number of threads = 13
completed timing test
 45.339959s wall, 545.310000s user + 3.190000s system = 548.500000s CPU (1209.7%)

beginning timing test, number of threads = 14
completed timing test
 45.510356s wall, 587.020000s user + 3.660000s system = 590.680000s CPU (1297.9%)

beginning timing test, number of threads = 15
completed timing test
 47.840413s wall, 651.430000s user + 4.310000s system = 655.740000s CPU (1370.7%)

beginning timing test, number of threads = 16
completed timing test
 49.448533s wall, 731.640000s user + 4.400000s system = 736.040000s CPU (1488.5%)


引用:

我或许是但我真的希望在我的古老服务器(戴尔R900)上有16个处理器内核,速度接近16倍。

I am perhaps naive, but I really did hope for nearly a 16x speedup on my ancient server (Dell R900) with 16 processor cores.



加速只能在线程分割任务时应用,每个人只做并行工作的一部分。

据我所知,在你的代码中,每个线程都在完成工作。


the speedup can only apply when the task is split between the threads, each doing only a part of the job in parallel.
As far as I can see, in your code, each thread is doing the full job.

引用:

我真的不相信我的程序有多线程设计问题(但当然它确实有可能)。因此,我决定尝试创建一个完整的虚拟程序来说明问题,并且该程序足够小以便发布。在这里。

I really do not believe my program has a multi-threading design problem (but of course it's certainly possible it does). Therefore, I decided to try to create a complete dummy program which illustrates the problem and which is small enough to post. Here it is.



在你的程序中,每个线程正在对10,000,000个数组进行10次排序。这种活动会破坏cpu缓存。所以运行时是连续缓存未命中,从而阻止cpu利用内核。



顺便说一下,线程不共享内存,每个线程都工作在它自己的10,000,000数组上,并将它排序10次。


In your program, each thread is sorting a 10,000,000 array 10 times. This kind of activity is trashing the cpu cache. So the runtime is doe to continuous cache miss, thus preventing the cpu from taking advantage of cores.

By the way, the thread do not share memory, each thread works on its own array of 10,000,000, and is sorting it 10 times.


这篇关于多线程减速,CPU绑定应用程序,可能与硬件相关的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆