在thread_pool中始终始终搜索特定节点 [英] Search thread_pool for particular nodes always at maximum

查看:98
本文介绍了在thread_pool中始终始终搜索特定节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个6个节点的Elasticsearch集群.堆大小设置为50GB.(我知道建议使用小于32的大小,但是由于某些原因,我已经将其设置为50Gb,我不知道).现在,我看到了很多来自搜索thread_pool的拒绝.

I have a elasticsearch cluster with 6 nodes. The heapsize is set as 50GB.(I know less than 32 is what is recommended but this was already set to 50Gb for some reason which I don't know). Now I am seeing a lot of rejections from search thread_pool.

这是我当前的搜索thread_pool:

This is my current search thread_pool:

node_name               name   active rejected  completed
1105-IDC.node          search      0 19295154 1741362188
1108-IDC.node          search      0  3362344 1660241184
1103-IDC.node          search     49 28763055 1695435484
1102-IDC.node          search      0  7715608 1734602881
1106-IDC.node          search      0 14484381 1840694326
1107-IDC.node          search     49 22470219 1641504395

我注意到的是两个节点始终具有最大活动线程数(1103-IDC.node&1107-IDC.node).即使其他节点也有拒绝,这些节点的拒绝率最高.硬件类似于其他节点.这可能是什么原因?难道是由于他们的击打更多而有任何特定的碎片吗?如果是的话,如何找到它们.

Something I have noticed is two nodes always have max active threads(1103-IDC.node & 1107-IDC.node ). Even though other nodes also have rejections, these nodes have the highest. Hardware is similar to other nodes. What could be the reason for this? Can it be due to them having any particular shards where hits are more? If yes how to find them.?

此外,在活动线程始终为最大的节点上,年轻堆占用的时间超过70ms(有时约为200ms).在GC日志中找到以下几行:

Also, the young heap is taking more than 70ms(sometimes around 200ms) on the nodes where active thread is always max. Find below some lines from the GC log:

[2020-10-27T04:32:14.380+0000][53678][gc             ] GC(6768757) Pause Young (Allocation Failure) 27884M->26366M(51008M) 196.226ms
[2020-10-27T04:32:26.206+0000][53678][gc,start       ] GC(6768758) Pause Young (Allocation Failure)
[2020-10-27T04:32:26.313+0000][53678][gc             ] GC(6768758) Pause Young (Allocation Failure) 27897M->26444M(51008M) 107.850ms
[2020-10-27T04:32:35.466+0000][53678][gc,start       ] GC(6768759) Pause Young (Allocation Failure)
[2020-10-27T04:32:35.574+0000][53678][gc             ] GC(6768759) Pause Young (Allocation Failure) 27975M->26444M(51008M) 108.923ms
[2020-10-27T04:32:40.993+0000][53678][gc,start       ] GC(6768760) Pause Young (Allocation Failure)
[2020-10-27T04:32:41.077+0000][53678][gc             ] GC(6768760) Pause Young (Allocation Failure) 27975M->26427M(51008M) 84.411ms
[2020-10-27T04:32:45.132+0000][53678][gc,start       ] GC(6768761) Pause Young (Allocation Failure)
[2020-10-27T04:32:45.200+0000][53678][gc             ] GC(6768761) Pause Young (Allocation Failure) 27958M->26471M(51008M) 68.105ms
[2020-10-27T04:32:46.984+0000][53678][gc,start       ] GC(6768762) Pause Young (Allocation Failure)
[2020-10-27T04:32:47.046+0000][53678][gc             ] GC(6768762) Pause Young (Allocation Failure) 28001M->26497M(51008M) 62.678ms
[2020-10-27T04:32:56.641+0000][53678][gc,start       ] GC(6768763) Pause Young (Allocation Failure)
[2020-10-27T04:32:56.719+0000][53678][gc             ] GC(6768763) Pause Young (Allocation Failure) 28027M->26484M(51008M) 77.596ms
[2020-10-27T04:33:29.488+0000][53678][gc,start       ] GC(6768764) Pause Young (Allocation Failure)
[2020-10-27T04:33:29.740+0000][53678][gc             ] GC(6768764) Pause Young (Allocation Failure) 28015M->26516M(51008M) 251.447ms

推荐答案

要注意的一件事是,如果您从

One important thing to note is that, if you got these stats from elasticsearch threadpool cat API then it shows just the point-in-time data and doesn't show the historical data for the last 1 hr, 6 hr, 1 day, 1 week like that.

拒绝和完成是节点上次重新启动时的统计信息,因此当我们试图确定某些ES节点是否由于碎片配置不正确/不平衡而成为热点时,这也不是很有用.

And rejected and completed is the stats from the last restart of the nodes, so this is also not very helpful when we are trying to figure out if some of ES nodes are becoming hot-spots due to bad/unbalanced shards configuration.

所以这里我们要弄清两个非常重要的事情

  1. 通过按时间范围查看数据节点上的平均活动,被拒绝的请求,可以确定集群中的实际热点节点(您可以只检查高峰时段),如果您有一些工具,这将非常容易像
  2. 已知热点节点后,查看分配给它们的分片,然后将其与其他节点分片进行比较,要检查的指标很少,分片数,分片接收更多流量,分片接收最慢的查询等,,其中大多数都需要通过查看ES的各种指标和API来弄清楚,这可能非常耗时,并且需要大量的内部ES知识.
  1. Make sure, we know the actual hotspot nodes in the cluster by looking at the average active, rejected requests on data nodes by time range(you can just check for peak hours), This would be very easy if you have some tools like this
  2. Once hotspot nodes are known, look at the shards allocated to them, and compare it to other nodes shards, few metric to check is, number of shards, shards receive more traffic, shards receive slowest queries, etc and again most of them you have to figure out by looking at various metrics and API of ES which can be very time consuming and requires a lot of internal ES knowledge.

这篇关于在thread_pool中始终始终搜索特定节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆