高并发 Apache 异步 HTTP 客户端 IOReactor 问题 [英] Highly Concurrent Apache Async HTTP Client IOReactor issues

查看:96
本文介绍了高并发 Apache 异步 HTTP 客户端 IOReactor 问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

应用描述:

  • 我正在使用由 Comsat 的 Quasar FiberHttpClient(版本 0.7.0)包装的 Apache HTTP 异步客户端(版本 4.1.1),以便运行 &执行一个高并发 Java 应用程序,该应用程序使用纤程在内部将 http 请求发送到多个 HTTP 端点
  • 应用程序运行在 tomcat 之上(但是,纤程仅用于内部请求分派.tomcat servlet 请求仍以标准阻塞方式处理)
  • 每个外部请求在内部打开 15-20 个 Fibers,每个 Fibers 构建一个 HTTP 请求并使用 FiberHttpClient 调度它
  • 我正在使用 c44xlarge 服务器(16 核)来测试我的应用程序
  • 我正在连接的端点抢占保持活动连接,这意味着如果我尝试通过重用套接字来维护,则在请求执行尝试期间连接将关闭.因此,我禁用了连接回收.
  • 根据以上部分,这里是我的光纤 http 客户端(当然我使用的是 的单个实例)的调整:

  • I'm using Apache HTTP Async Client ( Version 4.1.1 ) Wrapped By Comsat's Quasar FiberHttpClient ( version 0.7.0 ) in order to run & execute a highly concurrent Java application that uses fibers to internally send http requests to multiple HTTP end-points
  • The Application is running on top of tomcat( however , fibers are used only for internal request dispatching. tomcat servlet requests are still handled the standard blocking way )
  • Each external request opens 15-20 Fibers internally , each fiber builds an HTTP request and uses the FiberHttpClient to dispatch it
  • I'm using a c44xlarge server ( 16 cores ) to test my application
  • The end-points i'm connecting to preempt keep-alive connections, meaning if I try to maintain by resusing sockets , conncetions get closed during requests execution attempts. Therefor , I disable connection recycling.
  • According to the above sections, here's the tunning for my fiber http client ( which of course I'm using a single instance of ):

PoolingNHttpClientConnectionManager connectionManager = 
new PoolingNHttpClientConnectionManager(
    new DefaultConnectingIOReactor(
        IOReactorConfig.
            custom().
            setIoThreadCount(16).
            setSoKeepAlive(false).
            setSoLinger(0).
            setSoReuseAddress(false).
            setSelectInterval(10).
            build()
            )
    );

connectionManager.setDefaultMaxPerRoute(32768);
connectionManager.setMaxTotal(131072);
FiberHttpClientBuilder fiberClientBuilder = FiberHttpClientBuilder.
        create().
        setDefaultRequestConfig(
                RequestConfig.
                custom().
                setSocketTimeout(1500).
                setConnectTimeout(1000).
                build()
        ).
       setConnectionReuseStrategy(NoConnectionReuseStrategy.INSTANCE).
       setConnectionManager(connectionManager).
       build();

  • 打开文件的 ulimits 设置得非常高(软和硬值均为 131072)

  • ulimits for open-files are set super high ( 131072 for both soft and hard values )

    kernel.printk = 8 4 1 7kernel.printk_ratelimit_burst = 10kernel.printk_ratelimit = 5net.ipv4.ip_local_port_range = 8192 65535net.core.rmem_max = 16777216net.core.wmem_max = 16777216net.core.rmem_default = 16777216net.core.wmem_default = 16777216net.core.optmem_max = 40960net.ipv4.tcp_rmem = 4096 87380 16777216net.ipv4.tcp_wmem = 4096 65536 16777216net.core.netdev_max_backlog = 100000net.ipv4.tcp_max_syn_backlog = 100000net.ipv4.tcp_max_tw_buckets = 2000000net.ipv4.tcp_tw_reuse = 1net.ipv4.tcp_tw_recycle = 1net.ipv4.tcp_fin_timeout = 10net.ipv4.tcp_slow_start_after_idle = 0net.ipv4.tcp_sack = 0net.ipv4.tcp_timestamps = 1

    kernel.printk = 8 4 1 7 kernel.printk_ratelimit_burst = 10 kernel.printk_ratelimit = 5 net.ipv4.ip_local_port_range = 8192 65535 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.optmem_max = 40960 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 100000 net.ipv4.tcp_max_syn_backlog = 100000 net.ipv4.tcp_max_tw_buckets = 2000000 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_fin_timeout = 10 net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_sack = 0 net.ipv4.tcp_timestamps = 1

    问题描述

    • 在中低负载下一切正常,连接被租用,关闭,池补充
    • 在某些并发点之外,IOReactor 线程(其中 16 个)似乎在死亡之前停止正常运行.
    • 我写了一个小线程来获取池统计信息并每秒打印一次.在大约 25K 的租用连接时,不再通过套接字连接发送实际数据,Pending 统计数据也爬升到了飙升的 30K 待处理连接请求
    • 这种情况持续存在,基本上会使应用程序无用.在某些时候,I/O Reactor 线程死了,不知道什么时候,到目前为止我还没有能够捕捉到异常
    • lsof在java进程中,我可以看到它有数以万计的文件描述符,几乎所有的都在CLOSE_WAIT(这是有道理的,因为I/O反应器线程死亡/停止正常运行,永远不会真正关闭它们
    • 在应用程序中断期间,服务器不会严重过载/cpu 压力很大
    • Under low-medium load all is well , connections are leased , cloesd and the pool replenishes
    • Beyond some concurrency point , the IOReactor Threads ( 16 of them ) seem to stop functioning properly, prior to dying.
    • I've written a small thread to get the pool stats and print them each second. At around 25K leased connections , actual data is not sent anymore over the socket connections , The Pending stat clibms to a sky-rocketing 30K pending connection requests as well
    • This situation persists and basically renders the application useless. At some point the I/O Reactor threads die, not sure when and I haven't been able to catch the exceptions so far
    • lsofing the java process , I can see it has tens of thousands of file descriptors , almost all of them are in CLOSE_WAIT ( which makes sense , as the I/O reactor thread die/stop functioning and never get to actually closing them
    • During the time the application breaks, the server is not heavily overloaded/cpu stressed

    问题

    • 我猜我正在某个地方达到某种边界,尽管我对它可能驻留的内容或位置一无所知.以下情况除外
    • 我是否有可能到达 OS 端口(毕竟所有应用请求都来自单个内部 IP)限制并创建错误,导致 IO Reactor 线程死亡(类似于打开文件限制错误)?

    推荐答案

    忘记回答这个问题,但我在发布问题大约一周后才知道发生了什么:

    Forgot to answer this, but I got what's going on roughly a week after posting the question :

    1. 有某种配置错误导致 io-reactor 生成时只有 2 个线程.

    1. There was some sort of miss-configuration that caused the io-reactor to spawn with only 2 threads.

    即使提供了更多的反应器线程,问题仍然存在.事实证明,我们的传出请求主要是 SSL.Apache SSL 连接处理将核心处理传播到 JVM 的 SSL 设施,这对于每秒处理数千个 SSL 连接请求来说效率不够高.更具体地说,SSLEngine 中的一些方法(如果我没记错的话)是同步的.在高负载下进行线程转储显示 IORecator 线程在尝试打开 SSL 连接时相互阻塞.

    Even after providing more reactor threads, the issue persisted. It turns out that our outgoing requests were mostly SSL. Apache SSL connection handling propagates the core handling to the JVM's SSL facilities which simply - are not efficient enough for handling thousands of SSL connections requests per second. Being more specific, some methods inside SSLEngine(If I recall correctly) are synchronized. doing thread-dumps under high loads shows the IORecator threads blocking each-other while trying to open SSL connections.

    即使尝试以连接租用超时的形式创建压力释放阀也无济于事,因为创建的积压过多,导致应用程序无用.

    Even trying to create a pressure release valve in the form of connection lease-timeout didn't work because the backlogs created were to large, rendering the application useless.

    将 SSL 传出请求处理卸载到 nginx 的表现更差 - 因为远程端点抢先终止请求,无法使用 SSL 客户端会话缓存(JVM 实现也是如此).

    Offloading SSL outgoing requests handling to nginx performed even worse - because the remote endpoints are terminating the requests preemptively, SSL client session cache could not be used ( same goes for the JVM implementation ).

    最后在整个模块前面放置了一个信号量,在任何给定时刻将整个模块限制在 ~6000,从而解决了问题.

    Wound up putting a semaphore in-front of the entire module, limiting the whole thing to ~6000 at any given moment, which solved the issue.

    这篇关于高并发 Apache 异步 HTTP 客户端 IOReactor 问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆