当 ColdFusion 最大化 CPU 时,我如何找出它正在咀嚼/窒息的东西? [英] When ColdFusion is maxing out the CPU, how do I find out what it's chewing/choking on?

查看:20
本文介绍了当 ColdFusion 最大化 CPU 时,我如何找出它正在咀嚼/窒息的东西?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在中"Amazon EC2 实例上的 Ubuntu 上运行 CF 9.0.1.CF 一直在间歇性地占用(每天几次......但特别是不孤立于高峰使用时间).在这种时候,运行 top 会得到这个(或类似的东西):

I'm running CF 9.0.1 on Ubuntu on an "Medium" Amazon EC2 instance. CF has been seizing-up intermittently (several times per day...but notably not isolated to hours of peak usage). At such times, running top gets me this (or something similar):

PID     USER    PR  NI  VIRT    RES     SHR S   %CPU    %MEM    TIME+COMMAND
15855   wwwrun  20  0   1762m   730m    20m S   99.3    19.4    13:22.96 coldfusion9

因此,它显然消耗了大部分服务器资源.在每次占用之前,我的 cfserver.log 中都会出现以下错误:

So, it's obviously consuming most of the server resources. The following error has been showing up in my cfserver.log in the lead-up to each seize-up:

java.lang.RuntimeException: Request timed out waiting for an available thread to run. You may want to consider increasing the number of active threads in the thread pool.

如果我运行 /opt/coldfusion9/bin/coldfusion status,我会得到:

If I run /opt/coldfusion9/bin/coldfusion status, I get:

Pg/Sec  DB/Sec  CP/Sec  Reqs  Reqs  Reqs  AvgQ   AvgReq AvgDB  Bytes  Bytes 
Now Hi  Now Hi  Now Hi  Q'ed  Run'g TO'ed Time   Time   Time   In/Sec Out/Sec
0   0   0   0   -1  -1  150   25    0     0      -1352560      0      0

在管理员中,在 Server Settings > Request Tuning 下,同时模板请求的最大数量 设置为 25.所以到目前为止这是有意义的.我可以增加线程池来覆盖这些负载峰值.我可以做到 200.(我刚才做了测试.)

In the administrator, under Server Settings > Request Tuning, the setting for Maximum number of simultaneous Template requests is 25. So this makes sense so far. I could just increase the thread pool to cover these sort of load spikes. I could make it 200. (Which I did just now as a test.)

但是,还有这个文件/opt/coldfusion9/runtime/servers/coldfusion/SERVER-INF/jrun.xml.那里的一些设置似乎有冲突.例如,它是这样写的:

However, there's also this file /opt/coldfusion9/runtime/servers/coldfusion/SERVER-INF/jrun.xml. And some of the settings in there appear to conflict. For example, it reads:

<service class="jrunx.scheduler.SchedulerService" name="SchedulerService">
  <attribute name="bindToJNDI">true</attribute>
  <attribute name="activeHandlerThreads">25</attribute>
  <attribute name="maxHandlerThreads">1000</attribute>
  <attribute name="minHandlerThreads">20</attribute>
  <attribute name="threadWaitTimeout">180</attribute>
  <attribute name="timeout">600</attribute>
</service>

哪一个 a) 具有较少的活动线程(这是什么意思?),和 b) 具有超过管理员中设置的同时请求限制的最大线程数.所以,我不确定.这些独立的配置需要手动匹配吗?或者 jrun.xml 文件是否应该由 CF 管理员在进行更改时编写?唔.但也许这是不同的,因为大概 CF 调度程序应该只使用所有可用线程的一个子集,对吧?...所以我们总是有一些线程供真实用户使用?我们也有这个:

Which a) has fewer active threads (what does this mean?), and b) has a max threads that exceed the simultaneous request limit set in the admin. So, I'm not sure. Are these independent configs that need to be made to match manually? Or is the jrun.xml file supposed to be written by the CF Admin when changes are made there? Hmm. But maybe this is different because presumably the CF Scheduler should only use a subset of all available threads, right?...so we'd always have some threads for real live users? We also have this in there:

<service class="jrun.servlet.http.WebService" name="WebService">
  <attribute name="port">8500</attribute>
  <attribute name="interface">*</attribute>
  <attribute name="deactivated">true</attribute>
  <attribute name="activeHandlerThreads">200</attribute>
  <attribute name="minHandlerThreads">1</attribute>
  <attribute name="maxHandlerThreads">1000</attribute>
  <attribute name="mapCheck">0</attribute>
  <attribute name="threadWaitTimeout">300</attribute>
  <attribute name="backlog">500</attribute>
  <attribute name="timeout">300</attribute>
</service>

当我更改 CF 管理设置时,这似乎发生了变化...也许...但它是 activeHandlerThreads 与我的新最大同时请求设置相匹配...而不是 maxHandlerThreads,再次超过它.最后,我们有这个:

This appears to have changed when I changed the CF Admin setting...maybe...but it's the activeHandlerThreads that matches my new maximum simulataneous requests setting...rather than the maxHandlerThreads, which again exceeds it. Finally, we have this:

<service class="jrun.servlet.jrpp.JRunProxyService" name="ProxyService">
  <attribute name="activeHandlerThreads">200</attribute>
  <attribute name="minHandlerThreads">1</attribute>
  <attribute name="maxHandlerThreads">1000</attribute>
  <attribute name="mapCheck">0</attribute>
  <attribute name="threadWaitTimeout">300</attribute>
  <attribute name="backlog">500</attribute>
  <attribute name="deactivated">false</attribute>
  <attribute name="interface">*</attribute>
  <attribute name="port">51800</attribute>
  <attribute name="timeout">300</attribute>
  <attribute name="cacheRealPath">true</attribute>
</service>

所以,我不确定我应该更改哪些(如果有的话)以及最大请求数和最大线程数之间的确切关系.此外,由于其中几个将 maxHandlerThreads 列为 1000,我想知道我是否应该将最大同时请求数设置为 1000.必须有一些上限取决于可用的服务器资源......但我不确定它是什么,而且我真的不想玩它,因为它是一个生产环境.

So, I'm not certain which (if any) of these I should change and what exactly the relationship is between maximum requests and maximum threads. Also, since several of these list the maxHandlerThreads as 1000, I'm wondering if I should just set the maximum simultaneous requests to 1000. There must be some upper limit that depends on available server resources...but I'm not sure what it is and I don't really want to play around with it since it's a production environment.

我不确定它是否与这个问题有关,但是当我运行 ps aux |grep Coldfusion 我得到以下信息:

I'm not sure if it pertains to this issue at all, but when I run a ps aux | grep coldfusion I get the following:

wwwrun   15853  0.0  0.0   8704    760    pts/1     S   20:22   0:00 /opt/coldfusion9/runtime/bin/coldfusion9 -jar jrun.jar -autorestart -start coldfusion
wwwrun   15855  5.4 18.2   1678552 701932 pts/1     Sl  20:22   1:38 /opt/coldfusion9/runtime/bin/coldfusion9 -jar jrun.jar -start coldfusion

总是有这两个,永远不会超过这两个过程.因此,进程和线程之间似乎没有一对一的关系.我记得在我维护多年的 MX 6.1 安装中,在进程列表中可以看到额外的 CF 进程.当时在我看来,每个线程都有一个进程......所以要么我错了,要么版本 9 中的某些东西完全不同,因为它报告了 25 个正在运行的请求并且只显示这两个进程.如果单个进程可以在后台有多个线程,那么我会想知道为什么我有两个进程而不是一个?...只是好奇.

There are always these two and never more than these two processes. So there does not appear to be a one-to-one relationship between processes and threads. I recall from an MX 6.1 install I maintained for many years that additional CF processes were visible in the process list. It seemed to me at the time like I had a process for each thread...so either I was wrong or something is quite different in version 9 since it's reporting 25 running requests and only showing these two processes. If a single process can have multiple threads in the background, then I'm given to wonder why I have two processes instead of one?...just curious.

所以,无论如何,我在撰写这篇文章时一直在尝试.如上所述,我将最大同时请求数调整为 200.我希望这能解决我的问题,但 CF 再次崩溃(相反,它停滞不前并且请求开始超时......所以有效地崩溃"了).这一次,top 看起来很相似(仍然消耗超过 99% 的 CPU),但 CF 状态看起来不同:

So, anyway, I've been experimenting while composing this post. As noted above I adjusted the maximum simultaneous requests up to 200. I was hoping this would solve my problem, but CF just crashed again (rather it slogged down and requests started timing out...so effectively "crashed"). This time, top looked similar (still consuming more than 99% of the CPU), but CF status looked different:

Pg/Sec  DB/Sec  CP/Sec  Reqs  Reqs  Reqs  AvgQ   AvgReq AvgDB  Bytes  Bytes
Now Hi  Now Hi  Now Hi  Q'ed  Run'g TO'ed Time   Time   Time   In/Sec Out/Sec
0   0   0   0   -1  -1  0     150   0     0      0      0      0      0

显然,由于我增加了最大同时请求数,它允许更多请求同时运行……但它仍然使服务器资源最大化.

Obviously, since I'd increased the maximum simultaneous requests, it was allowing more requests to run simultaneously...but it was still maxing out the server resources.

进一步的实验(重新启动 CF 后)显示,在大约 30-35 次Reqs Run'g"之后,服务器变得无法使用,所有额外的请求都不可避免地超时:

Further experiments (after restarting CF) showed me that the server became unusably slogged after about 30-35 "Reqs Run'g", with all additional requests headed for an inevitable timeout:

Pg/Sec  DB/Sec  CP/Sec  Reqs  Reqs  Reqs  AvgQ   AvgReq AvgDB  Bytes  Bytes
Now Hi  Now Hi  Now Hi  Q'ed  Run'g TO'ed Time   Time   Time   In/Sec Out/Sec
0   0   0   0   -1  -1  0     33    0     0      -492   0      0      0

因此,增加最大同时请求数显然没有帮助.我想归根结底是这样的:它有什么困难?这些尖峰来自哪里?流量爆发?在哪些页面上?在任何给定时间正在运行哪些请求?我想我只需要更多信息来继续故障排除.如果有长时间运行的请求或其他问题,我不会在日志中看到它(尽管我确实在管理员中选中了该选项).我需要知道哪些请求正是导致这些峰值的那些请求.任何帮助将非常感激.谢谢.

So, it's clear that increasing the maximum simultaneous requests has not helped. I guess what it comes down to is this: What is it having such a hard time with? Where are these spikes coming from? Bursts of traffic? On what pages? What requests are running at any given time? I guess I simply need more information to continue troubleshooting. If there are long-running requests, or other issues, I'm not seeing it in the logs (although I do have that option checked in the admin). I need to know which requests exactly are those responsible for these spikes. Any help would be much appreciated. Thanks.

~天

推荐答案

我遇到了一些生产中的高 CPU"类型的错误,我一直处理它们的方式是这样的:

I've had a number of 'high-cpu in production' type bugs and the way i've always dealt with them is this:

  1. 使用 jstack PID >> stack.log转储 5 个堆栈跟踪,间隔 5 秒.跟踪的数量和时间并不重要.

  1. Use jstack PID >> stack.log to dump 5 of stack traces, 5 seconds apart. Number of traces and timing not critical.

Samurai中打开日志.每次进行转储时,您都会看到线程的视图.处理您的代码的线程启动 web-(用于使用内置服务器的请求)和 jrpp- 用于通过 Apache/IIS 进入的请求.

Open the log in Samurai. You get a view of the threads at each time you did a dump. Threads processing your code start web- (for requests using the built-in server) and jrpp- for requests coming in through Apache/IIS.

阅读每个线程的历史记录.您正在寻找每个转储中非常相似的堆栈.如果一个线程看起来一直在处理同一个请求,那么靠近顶部的不同位将指向发生无限循环的位置.

Read the history of each thread. You're looking for the stack being very similar in each dump. If a thread looks like it's handling the same request the whole time, the bits that vary near the top will point to where an infinite loop is happening.

请随意在网上某处转储堆栈跟踪并指出我们.

Feel free to dump a stack trace somewhere online and point us to it.

我用来了解发生了什么的另一种技术是修改 apache 的 httpd.conf 以记录所用时间:%D 并记录会话 id:%{jsessionid},它允许您在启动时跟踪单个用户挂起并用数据做一些漂亮的统计/图表(我使用 LogParser 处理数字并输出到 CSV,然后用 Excel 绘制数据):

The other technique I've used to understand what's going on is to modify apache's httpd.conf to log time taken: %D and record session id: %{jsessionid} which allows you to trace individual users in the run-up to hangs and to do some nice stats/graphs with the data (I use LogParser to crunch the numbers and output to CSV, followed by Excel to graph the data):

LogFormat "%h %l %u %t "%r" %>s %b %D %{jsessionid}" customAnalysis
CustomLog logs/analysis_log customAnalysis

我刚刚记得的另一种技术是启用 CF Metrics,这将使您了解服务器在挂起之前的状态.我将其设置为每 10 秒记录一次,并将格式更改为 CSV,这样我就可以从事件日志中 grep 指标,然后通过 Excel 运行它们以绘制崩溃前的服务器负载.

One other technique I've just remembered is to enable CF Metrics, which will get you some measure of what the server was up to in the runup to a hang. I set this to log every 10 seconds and change the format to be CSV, so I can grep the metrics from the event log and then run them through Excel to graph server load in the runup to crashes.

巴尼

这篇关于当 ColdFusion 最大化 CPU 时,我如何找出它正在咀嚼/窒息的东西?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆