已检测到阻塞的系统关键线程 [英] Blocked system-critical thread has been detected

查看:101
本文介绍了已检测到阻塞的系统关键线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Ignite.NET 2.7.6.从一台服务器和大约40个客户端进行配置.工作8小时后,服务器开始出现异常行为:客户端无法连接它,某些查询没有结果,等等.

I'm using Ignite.NET 2.7.6. There is a configuration from one server and about 40 clients. After 8 hours of work, the server starts behaving strangely: clients cannot connect it, some queries have no result, etc.

在服务器方面,内存消耗正常,线程数量约为250,并且一切正常.我看不到任何问题,因此我决定解决服务器端所有被标记为严重"的问题.

On the server's side, the memory consumption is ok, the amount of threads is about 250 and all looks ok. I don't see any problems, so I decided to solve all the problems on the server's side that were marked as SEVERE.

我遇到的第一个是:

已检测到系统关键线程已阻塞.这可能导致群集范围内的未定义行为[threadName = tcp-comm-worker,blockedFor = 13s]

Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=tcp-comm-worker, blockedFor=13s]

所以我想了解这种情况发生的原因.完整的服务器日志可以在这里找到:

So I want to understand the reason this happens. Full server's log can be found here:

https://yadi.sk/d/LF03Vz5vz4tRcw

https://yadi.sk/d/MMe0xrgI3k6lkA

已添加:这个问题似乎并不是无害的,该消息从各个线程每秒出现一次,"blockedFor"值从几秒钟增加到几小时.

Added: The issue doesn't seem to be innocuous, this message appears every second from various threads, the "blockedFor" value is increasing from seconds to hours.

服务器上的负载很低,但是随着服务器线程被锁定,它将停止响应并注册新客户端.

The load on the server is low but as the servers' threads become locked, it stops responding and registering new clients.

以下是来自服务器的日志:

Here are logs from the server:

https://yadi.sk/d/tc3g2hb9B0jtvg

https://yadi.sk/d/05YrlYXcp4xPqg

这是来自一个客户端的日志:

This is the log from one client:

https://yadi.sk/d/bcbQ7ee4PUzq2w

重新启动服务器后,客户端日志的最后几行位于19:03:52.

The client's log's last lines are at 19:03:52, when the server was restarted.

推荐答案

如Denis所述,存在很多网络通信问题.

As Denis described, there are a lot of network communication issues.

通常,客户端希望执行某些缓存操作,但是来自条带化池的服务器线程被长时间阻止.我认为这与.NET部分无关.

In general, a client would like to perform some cache operation, but a server thread from the striped pool is blocked for a long time. I don't think it relates to the .NET part.

您可以看到以下消息:

[18:53:04,385][SEVERE][tcp-disco-msg-worker-#2][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-7, blockedFor=13s]

如果您查看该线程:

hread [name="sys-stripe-7-#8", id=28, state=WAITING, blockCnt=51, waitCnt=3424]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(Unknown Source)
        at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
        at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2911)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
        at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1656)
        at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1879)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1904)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1875)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1857)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1275)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1212)

该线程正在尝试发送连续查询回调,但是未能建立与客户端节点的连接.这将导致线程被阻塞,并且无法为需要相同分区的其他缓存API请求提供服务.

The thread is trying to send a Continuous Query callback but is failing to establish a connection to a client node. This causes the thread to be blocked and it can not serve other cache API requests that require the same partition.

乍一看,您可以尝试减少 #clientFailureDetectionTimeout ,默认值为30秒.但这不能完全解决网络问题.

At first glance, you could try to reduce #clientFailureDetectionTimeout, default is 30sec. But this won't fix the network issues completely.

这篇关于已检测到阻塞的系统关键线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆