在IIS 7.5中使用传出异步Web请求时可伸缩性问题 [英] Scalability issue when using outgoing asynchronous web requests on IIS 7.5

查看：435 发布时间：2016/6/5 19:29:09 asp.net iis asynchronous queue

本文介绍了在IIS 7.5中使用传出异步Web请求时可伸缩性问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

的以下长描述了一点，但它是一个相当棘手的问题。我曾试图掩盖我们所知道的有关问题，以缩小搜索范围。现在的问题是超过一个基于单一问题一正在进行的调查，但我认为它可以帮助其他人。但请在评论中添加信息或纠正我，如果你觉得我说错了下面一些假设。的

更新19/2，2013：：我们已经清除了一些这方面的问号，我有存在的主要问题是，我将在下面更新的理论。不准备写一个解决回应它尚未虽然。的

UPDATE 19/2, 2013: We have cleared some question marks in this and I have a theory of what the main problem is which I'll update below. Not ready to write a "solved" response to it yet though.

更新24/4 2013年：事情已经在生产中稳定（虽然我相信这是暂时的），而现在，我认为这是由于两方面的原因。 1）端口的增加，和2）降低发送（转发）请求的数目。我将继续此更新进一步的倒在正确的上下文。的

UPDATE 24/4, 2013: Things have been stable in production (though I believe it is temporary) for a while now and I think it is due to two reasons. 1) port increase, and 2) reduced number of outgoing (forwarded) requests. I'll continue this update futher down in the correct context.

我们正在做我们的生产环境中的确定为什么当太多传出异步Web服务请求正在做我们的IIS Web服务器不结垢（一个传入的请求可能触发多个传出请求调查）。

We are currently doing an investigation in our production environment to determine why our IIS web server does not scale when too many outgoing asynchronous web service requests are being done (one incoming request may trigger multiple outgoing requests).

CPU只在20％左右，但我们收到的HTTP传入的请求，并多次传出Web请求503错误出现以下情况例外：SocketException：由于系统缺乏足够的缓冲区无法执行套接字上的操作空间或者因为队列已满的显然是有可扩展性瓶颈的地方，并需要我们找出它是什么，如果它是可以通过配置来解决它。

CPU is only at 20%, but we receive HTTP 503 errors on incoming requests and many outgoing web requests get the following exception: "SocketException: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full" Clearly there is a scalability bottleneck somewhere and we need to find out what it is and if it is possible to solve it by configuration.

应用程序上下文：

我们正在运行IIS 7.5版在Windows 2008 R2 64位操作系统使用.NET 4.5集成的托管管道。我们只使用1在IIS工作进程。硬件略有不同，但用于检查错误的机器是英特尔至强8核（16线程超）。

We are running IIS v7.5 integrated managed pipeline using .NET 4.5 on Windows 2008 R2 64 bit operating system. We use only 1 worker process in IIS. Hardware varies slightly but the machine used for examining the error is an Intel Xeon 8 core (16 hyper threaded).

我们使用同步和异步Web请求。那些异步正在使用新的.NET异步支持，使每个传入的请求使应用到其他服务器上保留的TCP连接（保持活动）多个HTTP请求。同步请求执行时间为0-32低MS（发生因线程上下文切换时间较长）。对于异步请求，执行时间可请求被中止长达120毫秒之前。

We use both asynchronous and synchronous web requests. Those that are asynchronous are using the new .NET async support to make each incoming request make multiple HTTP requests in the application to other servers on persisted TCP connections (keep-alive). Synchronous request execution time is low 0-32 ms (longer times occur due to thread context switching). For the asynchronous requests, execution time can be up to 120 ms before the requests are aborted.

通常，每个服务器最多可服务〜1000传入的请求。即将离任的请求〜300请求/ sec到〜600请求/秒时，问题就开始出现。只有传出异步时出现问题。请求在服务器上启用，我们去上面的传出请求（〜600 req./s）的一定水平。

Normally each server serves up to ~1000 incoming requests. Outgoing requests are ~300 requests/sec up to ~600 requests/sec when problem starts to arise. Problems only occurs when outgoing async. requests are enabled on the server and we go above a certain level of outgoing requests (~600 req./s).

的问题可能的解决方案：

在这个问题上搜索互联网发现可能的解决方案的候选人过多。虽然，他们都非常依赖于.NET，IIS版本和操作系统，因此需要时间来找到我们的上下文（安诺2013年）的东西。

Searching the Internet on this problem reveals a plethora of possible solutions candidates. Though, they are very much dependent upon versions of .NET, IIS and operating system so it takes time to find something in our context (anno 2013).

下面是解决方案的候选人和我们来到迄今与问候我们的配置方面的结论的列表。我已经在以下主要类别分类所检测的问题领域，到目前为止：

Below is a list of solution candidates and the conclusions we have come to so far with regards to our configuration context. I have categorised the detected problem areas, so far in the following main categories:

某些队列（S）填写

使用TCP连接和端口问题的（更新19/2，2013：这是问题）的

资源太慢分配

内存问题（更新19/2，2013：这是最有可能的另一个问题）的

Some queue(s) fill up
Problems with TCP connections and ports (UPDATE 19/2, 2013: This is the problem)
Too slow allocation of resources
Memory problems (UPDATE 19/2, 2013: This is most likely another problem)

即将离任的异步请求异常消息确实表明，缓冲区队列一些已经填满。但它没有说哪个队列/缓冲。通过<一个href=\"http://forums.iis.net/t/1194078.aspx/1?Scalability%20problem%20using%20web%20requests%20initiating%20asynch%20web%20service%20requests\">IIS论坛（和博客文章中引用那里）我已经能够分辨可能6（含）以上不同类型的请求管道队列的4标记低于A-F。

1) Some queue(s) fill up

The outgoing asynchronous request exception message does indicate that some queue of buffer has been filled up. But it does not say which queue/buffer. Via the IIS forum (and blog post referenced there) I have been able to distinguish 4 of possibly 6 (or more) different types of queues in the request pipeline labeled A-F below.

但应指出，所有下面定义的队列，我们看到了肯定，1.B）线程池的性能计数器排队请求得到问题的加载过程中非常充分。 所以，它很可能是问题的原因是在.NET水平不低于本（C-F）。

Though it should be stated that of all the below defined queues, we see for certain that the 1.B) ThreadPool performance counter Requests Queued gets very full during the problematic load. So it is likely that the cause of the problem is in .NET level and not below this (C-F).

我们使用发放，而不是HttpClient的，我们经历过同样的问题，但具有低得多REQ / s的阈值时，异步调用（异步支持）的.NET Framework类的WebClient。我们不知道是否与.NET Framework实现隐藏任何内部队列（S）或不超过线程池。我们不认为是这样。

We use the .NET framework class WebClient for issuing the asynchronous call (async support) as opposed to the HttpClient that we experienced had the same issue but with far lower req/s threshold. We do not know if the .NET Framework implementation hides any internal queue(s) or not above the Thread pool. We don’t think this is the case.

线程池充当自.NET线程（默认）调度从线程池拾取线程将要执行一个自然队列。

The Thread pool acts as a natural queue since the .NET Thread (default) Scheduler is picking threads from the thread pool to be executed.

性能计数器：[ASP.NET v4.0.30319] [请排队。

Performance counter: [ASP.NET v4.0.30319].[Requests Queued].

配置的可能性：

的（ApplicationPool）maxConcurrentRequestsPerCPU 的应该是（而不是previous 12）5000。所以在我们的情况下，它应该是5000 * 16 = 80.000请求/秒这应该是在我们的场景足以。

的（中processModel）AUTOCONFIG 的=真/假，允许的一些的线程池相关的配置根据机器配置进行设置。 我们用真正的这是一个潜在的错误人选，因为这些值可能会错误地为我们的（高）需要进行设置。

(ApplicationPool) maxConcurrentRequestsPerCPU should be 5000 (instead of previous 12). So in our case it should be 5000*16=80.000 requests/sec which should be sufficient enough in our scenario.
(processModel) autoConfig = true/false which allows some threadPool related configuration to be set according to machine configuration. We use true which is a potential error candidate since these values may be set erroneously for our (high) need.

如果线程池满了，请求开始堆积在此本机（未管理）队列中。

1.C) Global, process wide, native queue (IIS integrated mode only)

If the Thread Pool is full, requests starts to pile up in this native (not-managed) queue.

性能计数器： [ASP.NET v4.0.30319] [在纯队列中的请求]

Performance counter:[ASP.NET v4.0.30319].[Requests in Native Queue]

配置的可能性： ????

这队列不高于同一个队列为1.C）。下面是为据称我一个解释的是Http.sys内核队列本质上用户模式（IIS）接收来自内核模式的请求的完成端口（HTTP.sys中）。它有一个队列限制，而如果超出您会收到503状态code。该HTTPERR日志也将表明这事通过登录503状态和QueueFull的

This queue is not the same queue as 1.C) above. Here’s an explanation as stated to me "The HTTP.sys kernel queue is essentially a completion port on which user-mode (IIS) receives requests from kernel-mode (HTTP.sys). It has a queue limit, and when that is exceeded you will receive a 503 status code. The HTTPErr log will also indicate that this happened by logging a 503 status and QueueFull".

性能计数器：我一直没能找到这个队列中的任何性能计数器，但通过启用IIS HTTPERR日志，它应该是可能的，如果此队列被淹没检测

Performance counter: I have not been able to find any performance counter for this queue, but by enabling the IIS HTTPErr log, it should be possible to detect if this queue gets flooded.

配置的可能性：这在IIS中设置应用程序池，高级设置：队列长度。默认值是1000我看到建议增加至10.000。虽然尝试这种增长并没有解决我们的问题。

Configuration possibilities: This is set in IIS on the application pool, advanced setting: Queue Length. Default value is 1000. I have seen recommendations to increase it to 10.000. Though trying this increase has not solved our issue.

虽然不太可能，我猜可能OS的网卡缓冲区和队列的HTTP.sys之间实际上有一个队列的地方。

Although unlikely, I guess the OS could actually have a queue somewhere in between the network card buffer and the HTTP.sys queue.

作为请求到达网卡，它应该是自然的，它们被放置在一些缓冲，以便由某些操作系统内核线程被拾起。由于这是内核级执行，从而快速，也不太可能，它是罪魁祸首。

As request arrive to the network card, it should be natural that they are placed in some buffer in order to be picked up by some OS kernel thread. Since this is kernel level execution, and thus fast, it is not likely that it is the culprit.

Windows性能计数器： [网络接口]使用网卡实例[数据包丢弃的。

Windows Performance Counter: [Network Interface].[Packets Received Discarded] using the network card instance.

配置的可能性： ????

这是在这里和那里弹出，尽管我们的发送（异步）TCP请求是由一个持久（保持活动状态）的TCP连接的候选。因此，作为流量的增长，可用短周期端口的数量应真的只是由于传入请求成长。而且我们可以肯定，我们已经启用传出请求时，问题只出现了。

This is a candidate that pops up here and there, though our outgoing (async) TCP requests are made of a persisted (keep-alive) TCP connection. So as the traffic grows, the number of available ephemeral ports should really only grow due to the incoming requests. And we know for sure that the problem only arises when we have outgoing requests enabled.

然而，问题仍可能出现由于该请求的一个较长时间范围内的端口分配。一个传出请求可能需要长达120毫秒来执行（在.NET任务（线程）被取消之前），这可能意味着端口数得到分配用于更长的时间段。分析Windows性能计数器，验证因为TCPv4数这一假设。[连接建立当发生问题正常2-3000去峰值可达近12.000的总。

However, the problem may still arise due to that the port is allocated during a longer timeframe of the request. An outgoing request may take as long as 120 ms to execute (before the .NET Task (thread) is canceled) which might mean that the number of ports get allocated for a longer time period. Analyzing the Windows Performance Counter, verifies this assumption since the number of TCPv4.[Connection Established] goes from normal 2-3000 to peaks up to almost 12.000 in total when the problem occur.

我们已经验证了TCP连接的配置的最大金额设置为16384默认在这种情况下，它可能不是问题，虽然我们是危险地接近了最大限制。

We have verified that the configured maximum amount of TCP connections is set to the default of 16384. In this case, it may not be the problem, although we are dangerously close to the max limit.

当我们尝试使用没有任何输出它主要是在返回服务器上所有的netstat等也采用套装软件显示在开始的时候非常少的项目。如果我们让套装软件运行一段时间很快就开始出现新的（接收）连接相当快（比如25个连接/秒）。几乎所有的连接都在TIME_WAIT状态从一开始，这表明他们已经完成，等待清理。做那些连接使用临时端口？本地端口总是80，以及远程端口正在增加。我们希望，才能看到传出的连接使用套装软件，但我们不能看到他们列出的所有，这是非常奇怪的。不能这两个工具处理我们有连接的数量？
的（未完待续....但请与信息填写，如果你知道...）的

When we try using netstat on the server it mostly returns without any output at all, also using TcpView shows very few items in the beginning. If we let TcpView run for a while it soon starts to show new (incoming) connections quite rapidly (say 25 connections/sec). Almost all connections are in TIME_WAIT state from the beginning, suggesting that they have already completed and waiting for clean up. Do those connections use ephemeral ports? The local port is always 80, and the remote port is increasing. We wanted to use TcpView in order to see the outgoing connections, but we can’t see them listed at all, which is very strange. Can’t these two tools handle the amount of connections we are having? (To be continued.... But please fill in with info if you know it… )

Furhter多，作为一个侧踢这里。有人建议在这篇博客文章<一个href=\"http://blogs.msdn.com/b/tmarq/archive/2007/07/21/asp-net-thread-usage-on-iis-7-0-and-6-0.aspx\">ASP.NET在IIS 7.5中，线程使用IIS 7.0和IIS 6.0 即ServicePointManager.DefaultConnectionLimit应设置为int maxValue（最大值），否则可能是一个问题。但是在.NET 4.5，这是默认已经从一开始。

Furhter more, as a side kick here. It was suggested in this blog post "ASP.NET Thread Usage on IIS 7.5, IIS 7.0, and IIS 6.0" that ServicePointManager.DefaultConnectionLimit should be set to int maxValue which otherwise could be a problem. But in .NET 4.5, this is the default already from the start.

更新19/2，2013：

这是合理的假设，我们做了，其实打的16.384端口的最大限制。我们增加了一倍的端口数量几乎一台服务器上，并且只有旧服务器会遇到问题，当我们打传出请求的老峰值负载。那么，为什么在TCP.v4 [建立的连接]从来没有向我们展示的问题次较大的数字比〜12.000？我的理论：最有可能的，尽管未确立为事实上（还），性能计数器TCPv4 [成立连接]不等同于当前分配的端口的数量。我还没来得及赶上TCP状态学习，但我猜，有什么比连接建立显示这将会使端口被ccupied更多的TCP状态。虽然由于我们不能使用连接建立性能计数器，以此来检测用完端口的危险，我们发现实现这一最大端口范围检测时，一些其他的方式是很重要的。而如上文所述，我们不能既用netstat或我们的生产服务器应用程序套装软件用于该用途。这是个问题！（我会写更多关于它在即将到来的回应，我认为这个职位）

端口数在Windows仅限于一些最大65.535（虽然第一次〜1000应该不会被使用）。但是，它应该有可能通过降低TCP状态TIME_WAIT在众多places.It描述应该腾出端口更快的时间（默认为240秒），以避免用完港口的问题。我是第一次有点hestitant这个做这个，因为我们使用这两个长期运行的数据库查询以及WCF的TCP呼吁，我不希望descrease的时间限制。虽然不是在我的TCP状态机读出具有赶上了呢，我觉得这可能不是一个问题，毕竟。国家TIME_WAIT，我认为，只有在那里才能让一个适当的关闭握手到客户端。所以现有的TCP连接上的实际的数据传输不应当由于这个时限超时。更糟糕的情况下，客户端不能正常关机和它，而不是NEADS超时。我想所有的浏览器可能无法正确执行这一点，但也可能会仅在客户端上的问题。虽然我在这里猜测有点...

END UPDATE 19/2，2013

更新24/4 2013年：
我们已经端口的数量增加至最大值。与此同时，我们没有得到尽可能多的转发传出请求正如前面。这两个组合应该是，为什么我们还没有得到任何事件的原因。然而，由于发出的请求的数目被绑定到在未来再次增加这些服务器上它只是暂时的。因此，问题的关键在于，我认为，端口传入的请求有转发的请求的响应时间范围内保持开放。在我们的应用程序，这些转发的请求取消这一限制是120毫秒这可能与正常＆LT进行比较; 1ms的处理非转发的请求。因此，在本质上，我相信端口的数量肯定是这样的高吞吐量服务器的主要瓶颈，可扩展性（> 1000个请求/秒的〜16核机器），我们使用。这与高速缓存重装的GC工作相结合（参见下文），使服务器特别vulernable。

UPDATE 24/4, 2013: We have increased the number of port to to the maximum value. At the same time we do not get as many forwarded outgoing requests as earlier. These two in combination should be the reason why we have not had any incidents. However, it is only temporary since the number of outgoing requests are bound to increase again in the future on these servers. The problem thus lies in, I think, that port for the incoming requests has to remain open during the time frame for the response of the forwarded requests. In our application, this cancelation limit for these forwarded requests is 120 ms which could be compared with the normal <1ms to handle a non forwarded request. So in essence, I believe the definite number of ports is the major scalability bottleneck on such high throughput servers (>1000 requests/sec on ~16 cores machines) that we are using. This in combination with the GC work on cache reload (se below) makes the server especially vulernable.

END UPDATE 24/4

我们的性能计数器显示，在该问题的时间在线程池（1B）排队的请求数量波动了很多。所以潜在地，这意味着我们具有其中队列长度开始振荡动态情况下，由于环境的变化。举例来说，如果有洪水时的流量是洪水被激活的保护机制，这将是这种情况。正因为如此，我们有许多这样的机制：

Our performance counters show that the number of queued requests in the Thread Pool (1B) fluctuates a lot during the time of the problem. So potentially this means that we have a dynamic situation in which the queue length starts to oscillate due to changes in the environment. For instance, this would be the case if there are flooding protection mechanisms that are activated when traffic is flooding. As it is, we have a number of these mechanisms:

当事情真的不好，服务器用HTTP 503错误响应，负载均衡器会自动从被生产活跃15秒钟内取出Web服务器。这意味着其他服务器将在时间帧期间采取的增加的负荷。在冷却期，服务器可以提供完成其请求，当负载平衡器执行其下一个脉冲它会自动恢复。当然这仅仅是好的，只要所有服务器没有问题的一次。幸运的是，到目前为止，我们还没有过这种情况。

When things go really bad and the server responds with a HTTP 503 error, the load balancer will automatically remove the web server from being active in production for a 15 second period. This means that the other servers will take the increased load during the time frame. During the "cooling period", the server may finish serving its request and it will automatically be reinstated when the load balancer does its next ping. Of course this only is good as long as all servers don’t have a problem at once. Luckily, so far, we have not been in this situation.

在Web应用程序中，我们有自己的构建阀（是的，这是一个阀门不是价值）由Windows性能计数器在线程池中排队请求触发。有一个线程，在开始的Application_Start，即检查该性能计数器值每秒。而如果该值超过2000，所有传出的流量不再启动。下一秒，如果队列值小于2000，传出流量再次启动。

In the web application, we have our own constructed valve (Yes. It is a "valve". Not a "value") triggered by a Windows Performance Counter for Queued Requests in the thread pool. There is a thread, started in Application_Start, that checks this performance counter value each second. And if the value exceeds 2000, all outgoing traffic ceases to be initiated. The next second, if the queue value is below 2000, outgoing traffic starts again.

在这里奇怪的是，它没有到达错误情况帮助我们，因为我们没有发生这种情况的很多记录。这可能意味着，当流量击中我们努力，事情确实很快变坏，使1秒的时间间隔检查实际上是太高了。

The strange thing here is that it has not helped us from reaching the error scenario since we don’t have much logging of this occurring. It may mean that when traffic hits us hard, things goes bad really quickly so that the 1 second time interval check actually is too high.

有是本另一个方面。当有需要的应用程序池多个线程，这些线程得到非常缓慢分配。从我读，每秒1-2个线程。这是因为它是昂贵的创建线程，因为你不希望反正为了避免昂贵的上下文同步的情况下切换线程太多，我觉得这是很自然的。然而，它也应该意味着，如果突然大量突发流量的打我们，线程数量不会接近足以满足请求将启动异步的情况和排队等待的需要。这是一个非常容易出现问题的候选人，我认为。一个候选解决方案可能是随后增加的线程池创建的线程的最小量。但我想，这也可能会影响同步运行要求的性能。

There is another aspect of this as well. When there is a need for more threads in the application pool, these threads gets allocated very slowly. From what I read, 1-2 threads per second. This is so because it is expensive to create threads and since you don’t want too many threads anyways in order to avoid expensive context switching in the synchronous case, I think this is natural. However, it should also mean that if a sudden large burst of traffic hits us, the number of threads are not going to be near enough to satisfy the need in the asynchronous scenario and queuing of requests will start. This is a very likely problem candidate I think. One candidate solution may be then to increase the minimum amount of created threads in the ThreadPool. But I guess this may also effect performance of the synchronously running requests.

的（乔伊·雷耶斯写这个<一个href=\"http://blogs.msdn.com/b/josere/archive/2011/09/13/cpu-throttling-for-asp-net-asynchronous-scenarios.aspx\">here在一篇博客文章）的
由于对象收集后得到异步请求（高达120ms的在我们的情况以后），可能会出现内存问题，因为对象可以被提升到第1代和理所应当的记忆不会被回忆经常。垃圾收集器增加pressure很可能导致扩展的线程上下文切换发生，进一步削弱服务器的能力。

(Joey Reyes wrote about this here in a blog post) Since objects get collected later for asynchronous requests (up to 120ms later in our case), memory problem can arise since objects can be promoted to generation 1 and the memory will not be recollected as often as it should. The increased pressure on the Garbage Collector may very well cause extended thread context switching to occur and further weaken capacity of the server.

但是，我们没有看到在这个问题的时候增加GC-也不CPU使用率，所以我们不认为建议的CPU限制机制对我们来说是一个解决方案。

However, we don’t see an increased GC- nor CPU usage during the time of the problem so we don’t think the suggested CPU throttling mechanism is a solution for us.

更新19/2，2013：：我们在使用定期intervalls高速缓存交换机制，在其中（几乎）完全在内存中缓存加载到内存中，旧的缓存可以得到垃圾收集。在这些时候，GC将不得不从正常的请求处理更加努力地工作，并窃取资源。使用Windows性能计数器用于线程上下文切换它表明上下文的数量切换减少从以高的GC使用时的正常高值显著。我认为，在这样的高速缓存重新加载，服务器额外vulnernable为排队的请求，这是必要的，以减少GC的足迹。一个潜在修复的问题是只填充缓存不分配内存中的所有的时间。多做一些工作，但它应该是可行的。的

UPDATE 19/2, 2013: We use a cache swap mechanism at regular intervalls at which an (almost) full in-memory cache is reload into memory and the old cache can get garbage collected. At these times, the GC will have to work harder and steal resources from the normal request handling. Using Windows Performance counter for thread context switching it shows that the number of context switches decreases significantly from the normal high value at the time of a high GC usage. I think that during such cache reloads, the server is extra vulnernable for queueing up requests and it is necessary to reduce the footprint of the GC. One potential fix to the problem would be to just fill the cache without allocating memory all the time. A bit more work, but it should be doable.

更新24/4 2013年：
我仍然在高速缓存重新加载存储器TWEAK的中间，以避免具有运行之多GC上。但是，我们通常有一些暂时1000排队的请求的GC运行时。因为它运行在所有线程，这是naturall它窃取资源从处理正常请求。一旦这个调整已部署我会更新这个状态，我们可以看到一个区别。的

END UPDATE 24/4

在IIS 7.5中使用传出异步Web请求时可伸缩性问题 [英] Scalability issue when using outgoing asynchronous web requests on IIS 7.5

问题描述

1) Some queue(s) fill up

1.C) Global, process wide, native queue (IIS integrated mode only)

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

在IIS 7.5中使用传出异步Web请求时可伸缩性问题 [英] Scalability issue when using outgoing asynchronous web requests on IIS 7.5

问题描述

1) Some queue(s) fill up

1.C) Global, process wide, native queue (IIS integrated mode only)

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭