高CPU,可能是由于上下文切换? [英] High CPU, possibly due to context switching?

查看:141
本文介绍了高CPU,可能是由于上下文切换?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的一台服务器在使用我们的应用程序时遇到了非常高的CPU负载。我们已经查看了各种统计数据,并且在查找问题根源时遇到了问题。



目前的理论之一是涉及的线程太多,我们应该尝试减少并发执行线程的数量。只有一个主线程池,有3000个线程,还有一个使用它的WorkManager(这是Java EE - Glassfish)。在任何给定时刻,大约需要并行执行大约620个独立的网络IO操作(使用java.NIO也不是一种选择)。此外,大约有100个操作没有涉及IO并且也是并行执行。



这种结构效率不高,我们想知道它是否真的造成损害,或者只是不好的做法。原因是在这个系统中任何变化都非常昂贵(就工时而言)所以我们需要一些问题的证据。



所以现在我们想知道是否有背景因为线程比所需的并发操作多得多,所以线程切换是原因。查看日志,我们发现在给定的秒内平均执行了14个不同的线程。如果我们考虑到两个CPU的存在(见下文),则每个CPU有7个线程。这听起来不是太多,但我们想验证这一点。



那么 - 我们可以排除上下文切换或太多线程作为问题吗? / p>

一般细节:


  1. Java 1.5(是的,它已经老了),正在运行CentOS 5,64位,Linux内核2.6.18-128.el5

  2. 机器上只有一个Java进程,没有别的。

  3. VMware下的两个CPU。

  4. 8GB RAM

  5. 我们没有选择在机器上运行分析器。

  6. 我们没有升级Java和操作系统的选项。

更新
如下所述,我们在各种负载的测试服务器上进行了平均负载(使用正常运行时间)和CPU(使用vmstat 1 120)的捕获。我们在每次负载变化和测量之间等待15分钟,以确保系统在新负载周围稳定并且负载平均数更新:



50%生产服务器的工作量: http://pastebin.com/GE2kGLkk



生产服务器工作量的34%: http://pastebin.com/V2PWq8CG



生产服务器工作量的25%: http://pastebin.com/0pxxK0Fu



随着负载的减少,CPU使用率似乎会降低,但不是在非常激烈的水平上(从50%变为25%并不是真正的50%)减少CPU使用率)。负载平均值似乎与工作负载量无关。



还有一个问题:鉴于我们的测试服务器也是VM,其CPU测量值是否会受到其他运行的VM的影响相同的主机(使上述测量无用)?



UPDATE 2
将线程的快照分为三部分(pastebin)限制)



第1部分: http://pastebin.com/DvNzkB5z



第2部分: http://pastebin.com / 72sC00rc



第3部分: http:// pastebin .com / YTG9hgF5

解决方案

在我看来问题是100个CPU绑定线程比什么都重要。 3000线程池基本上是一个红色的鲱鱼,因为空闲线程不会消耗太多任何东西。 I / O线程可能在大部分时间内处于休眠状态,因为I / O是在计算机操作方面的地质时间尺度上测量的。



你不喜欢提到100个CPU线程正在做什么,或者它们持续多长时间,但是如果你想减慢计算机速度,那么专用100个运行直到时间片表示停止的线程肯定会这样做。因为您有100个随时可以运行,所以机器将按照调度程序允许的速度进行上下文切换。空闲时间几乎为零。上下文切换会产生影响,因为您经常这样做。由于CPU线程(可能)消耗大部分CPU时间,因此您的I / O绑定线程将在运行队列中等待的时间比等待I / O的时间长。因此,更多的进程正在等待(I / O进程更频繁地纾困,因为它们很快就会遇到I / O障碍,导致下一个进程无效)。



毫无疑问,这里和那里都有调整以提高效率,但100个CPU线程是100个CPU线程。你可以做的不多。


One of our servers is experiencing a very high CPU load with our application. We've looked at various stats and are having issues finding the source of the problem.

One of the current theories is that there are too many threads involved and that we should try to reduce the number of concurrently executing threads. There's just one main thread pool, with 3000 threads, and a WorkManager working with it (this is Java EE - Glassfish). At any given moment, there are about 620 separate network IO operations that need to be conducted in parallel (use of java.NIO is not an option either). Moreover, there are roughly 100 operations that have no IO involved and are also executed in parallel.

This structure is not efficient and we want to see if it is actually causing damage, or is simply bad practice. Reason being that any change is quite expensive in this system (in terms of man hours) so we need some proof of an issue.

So now we're wondering if context switching of threads is the cause, given there are far more threads than the required concurrent operations. Looking at the logs, we see that on average there are 14 different threads executed in a given second. If we take into account the existence of two CPUs (see below), then it is 7 threads per CPU. This doesn't sound like too much, but we wanted to verify this.

So - can we rule out context switching or too-many-threads as the problem?

General Details:

  1. Java 1.5 (yes, it's old), running on CentOS 5, 64-bit, Linux kernel 2.6.18-128.el5
  2. There is only one single Java process on the machine, nothing else.
  3. Two CPUs, under VMware.
  4. 8GB RAM
  5. We don't have the option of running a profiler on the machine.
  6. We don't have the option of upgrading the Java, nor the OS.

UPDATE As advised below, we've conducted captures of load average (using uptime) and CPU (using vmstat 1 120) on our test server with various loads. We've waited 15 minutes between each load change and its measurements to ensure that the system stabilized around the new load and that the load average numbers are updated:

50% of the production server's workload: http://pastebin.com/GE2kGLkk

34% of the production server's workload: http://pastebin.com/V2PWq8CG

25% of the production server's workload: http://pastebin.com/0pxxK0Fu

CPU usage appears to be reduced as the load reduces, but not on a very drastic level (change from 50% to 25% is not really a 50% reduction in CPU usage). Load average seems uncorrelated with the amount of workload.

There's also a question: given our test server is also a VM, could its CPU measurements be impacted by other VMs running on the same host (making the above measurements useless)?

UPDATE 2 Attaching the snapshot of the threads in three parts (pastebin limitations)

Part 1: http://pastebin.com/DvNzkB5z

Part 2: http://pastebin.com/72sC00rc

Part 3: http://pastebin.com/YTG9hgF5

解决方案

Seems to me the problem is 100 CPU bound threads more than anything else. 3000 thread pool is basically a red herring, as idle threads don't consume much of anything. The I/O threads are likely sleeping "most" of the time, since I/O is measured on a geologic time scale in terms of computer operations.

You don't mention what the 100 CPU threads are doing, or how long they last, but if you want to slow down a computer, dedicating 100 threads of "run until time slice says stop" will most certainly do it. Because you have 100 "always ready to run", the machine will context switch as fast as the scheduler allows. There will be pretty much zero idle time. Context switching will have impact because you're doing it so often. Since the CPU threads are (likely) consuming most of the CPU time, your I/O "bound" threads are going to be waiting in the run queue longer than they're waiting for I/O. So, even more processes are waiting (the I/O processes just bail out more often as they hit an I/O barrier quickly which idles the process out for the next one).

No doubt there are tweaks here and there to improve efficiency, but 100 CPU threads are 100 CPU threads. Not much you can do there.

这篇关于高CPU,可能是由于上下文切换?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆