高度并发的Apache Async HTTP客户端IOReactor问题 [英] Highly Concurrent Apache Async HTTP Client IOReactor issues

查看:119
本文介绍了高度并发的Apache Async HTTP客户端IOReactor问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

应用程序说明:

  • 我正在使用由Comsat的Quasar FiberHttpClient(版本0.7.0)包装的Apache HTTP异步客户端(版本4.1.1)来运行&执行一个高度并发的Java应用程序,该应用程序使用光纤在内部将HTTP请求发送到多个HTTP端点
  • 该应用程序在tomcat之上运行(但是,光纤仅用于内部请求分派.tomcatservlet请求仍以标准阻塞方式处理)
  • 每个外部请求在内部打开15-20根光纤,每个光纤建立一个HTTP请求并使用FiberHttpClient进行调度
  • 我正在使用c44xlarge服务器(16个内核)来测试我的应用程序
  • 我要连接到抢占式保持活动连接的端点,这意味着如果我尝试通过重新使用套接字来进行维护,则在请求执行尝试期间会关闭连接.因此,我禁用了连接回收.
  • 根据上述部分,这是我的光纤http客户端的调音(当然,我正在使用的单个实例):

  • I'm using Apache HTTP Async Client ( Version 4.1.1 ) Wrapped By Comsat's Quasar FiberHttpClient ( version 0.7.0 ) in order to run & execute a highly concurrent Java application that uses fibers to internally send http requests to multiple HTTP end-points
  • The Application is running on top of tomcat( however , fibers are used only for internal request dispatching. tomcat servlet requests are still handled the standard blocking way )
  • Each external request opens 15-20 Fibers internally , each fiber builds an HTTP request and uses the FiberHttpClient to dispatch it
  • I'm using a c44xlarge server ( 16 cores ) to test my application
  • The end-points i'm connecting to preempt keep-alive connections, meaning if I try to maintain by resusing sockets , conncetions get closed during requests execution attempts. Therefor , I disable connection recycling.
  • According to the above sections, here's the tunning for my fiber http client ( which of course I'm using a single instance of ):

PoolingNHttpClientConnectionManager connectionManager = 
new PoolingNHttpClientConnectionManager(
    new DefaultConnectingIOReactor(
        IOReactorConfig.
            custom().
            setIoThreadCount(16).
            setSoKeepAlive(false).
            setSoLinger(0).
            setSoReuseAddress(false).
            setSelectInterval(10).
            build()
            )
    );

connectionManager.setDefaultMaxPerRoute(32768);
connectionManager.setMaxTotal(131072);
FiberHttpClientBuilder fiberClientBuilder = FiberHttpClientBuilder.
        create().
        setDefaultRequestConfig(
                RequestConfig.
                custom().
                setSocketTimeout(1500).
                setConnectTimeout(1000).
                build()
        ).
       setConnectionReuseStrategy(NoConnectionReuseStrategy.INSTANCE).
       setConnectionManager(connectionManager).
       build();

  • 打开文件的ulimit设置得很高(软值和硬值都为131072)

  • ulimits for open-files are set super high ( 131072 for both soft and hard values )

    kernel.printk = 8 4 1 7kernel.printk_ratelimit_burst = 10kernel.printk_ratelimit = 5net.ipv4.ip_local_port_range = 8192 65535net.core.rmem_max = 16777216net.core.wmem_max = 16777216net.core.rmem_default = 16777216net.core.wmem_default = 16777216net.core.optmem_max = 40960net.ipv4.tcp_rmem = 4096 87380 16777216net.ipv4.tcp_wmem = 4096 65536 16777216net.core.netdev_max_backlog = 100000net.ipv4.tcp_max_syn_backlog = 100000net.ipv4.tcp_max_tw_buckets = 2000000net.ipv4.tcp_tw_reuse = 1net.ipv4.tcp_tw_recycle = 1net.ipv4.tcp_fin_timeout = 10net.ipv4.tcp_slow_start_after_idle = 0net.ipv4.tcp_sack = 0net.ipv4.tcp_timestamps = 1

    kernel.printk = 8 4 1 7 kernel.printk_ratelimit_burst = 10 kernel.printk_ratelimit = 5 net.ipv4.ip_local_port_range = 8192 65535 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.optmem_max = 40960 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 100000 net.ipv4.tcp_max_syn_backlog = 100000 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_sack = 0 net.ipv4.tcp_timestamps = 1

    问题描述

    • 在中低负载下一切都很好,连接被租借,克隆并且池得到补充
    • 除了某些并发点之外,IOReactor线程(其中的16个)似乎在死亡之前就停止了正常运行.
    • 我写了一个小线程来获取池状态并每秒打印一次.在大约25,000个租用连接的情况下,实际数据不再通过套接字连接发送, Pending stat clibms也发送到飙升的30K挂起的连接请求
    • 这种情况持续存在,基本上使应用程序无用.在某个时候,I/O Reactor线程死了,不知道何时以及到目前为止我还无法捕获异常
    • lsof 在java进程中,我可以看到它具有成千上万个文件描述符,几乎所有文件描述符都在CLOSE_WAIT中(这很有意义,因为I/O反应堆线程死机/停止发挥作用,从不真正关闭它们
    • 在应用程序中断期间,服务器不会出现严重的超负荷/cpu压力
    • Under low-medium load all is well , connections are leased , cloesd and the pool replenishes
    • Beyond some concurrency point , the IOReactor Threads ( 16 of them ) seem to stop functioning properly, prior to dying.
    • I've written a small thread to get the pool stats and print them each second. At around 25K leased connections , actual data is not sent anymore over the socket connections , The Pending stat clibms to a sky-rocketing 30K pending connection requests as well
    • This situation persists and basically renders the application useless. At some point the I/O Reactor threads die, not sure when and I haven't been able to catch the exceptions so far
    • lsofing the java process , I can see it has tens of thousands of file descriptors , almost all of them are in CLOSE_WAIT ( which makes sense , as the I/O reactor thread die/stop functioning and never get to actually closing them
    • During the time the application breaks, the server is not heavily overloaded/cpu stressed

    问题

    • 我想我正在某个地方达到某种界限,尽管我对它可能位于什么地方或位置一无所知.除了以下内容
    • 是否有可能到达OS端口(毕竟所有应用请求都源自单个内部IP)限制并创建了一个错误,该错误使IO Reactor线程死亡(类似于打开文件的限制错误)?

    推荐答案

    忘了回答这个问题,但是发布问题后大约一周我就知道了怎么回事:

    Forgot to answer this, but I got what's going on roughly a week after posting the question :

    1. 存在某种未命中配置,导致io-reactor仅产生2个线程.

    1. There was some sort of miss-configuration that caused the io-reactor to spawn with only 2 threads.

    即使提供了更多的反应堆线程,问题仍然存在.事实证明,我们的传出请求主要是SSL.Apache SSL连接处理将核心处理传播到JVM的SSL设施,这些设施-效率不足以每秒处理数千个SSL连接请求.更具体地说,SSLEngine内部的一些方法(如果我没记错的话)是同步的.在高负载下进行线程转储表明IORecator线程在尝试打开SSL连接时会相互阻塞.

    Even after providing more reactor threads, the issue persisted. It turns out that our outgoing requests were mostly SSL. Apache SSL connection handling propagates the core handling to the JVM's SSL facilities which simply - are not efficient enough for handling thousands of SSL connections requests per second. Being more specific, some methods inside SSLEngine(If I recall correctly) are synchronized. doing thread-dumps under high loads shows the IORecator threads blocking each-other while trying to open SSL connections.

    即使尝试创建连接租用超时形式的泄压阀也不起作用,因为创建的积压订单过多,导致应用程序无用.

    Even trying to create a pressure release valve in the form of connection lease-timeout didn't work because the backlogs created were to large, rendering the application useless.

    将SSL传出请求处理工作卸载到nginx的情况甚至更糟-由于远程端点抢先终止了请求,因此无法使用SSL客户端会话缓存(对于JVM实现也是如此).

    Offloading SSL outgoing requests handling to nginx performed even worse - because the remote endpoints are terminating the requests preemptively, SSL client session cache could not be used ( same goes for the JVM implementation ).

    将信号量放在整个模块的前面,在任何给定时刻将整个内容限制在6000个左右,从而解决了这个问题.

    Wound up putting a semaphore in-front of the entire module, limiting the whole thing to ~6000 at any given moment, which solved the issue.

    这篇关于高度并发的Apache Async HTTP客户端IOReactor问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆