重要条款导致CircuitBreakingException [英] Significant terms causes a CircuitBreakingException

查看:80
本文介绍了重要条款导致CircuitBreakingException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个中型的Elasticsearch索引(1.46T或〜1e8文档).它运行在4台服务器上,每台服务器在弹性服务器和操作系统之间均分配有64GB的Ram(用于缓存).

I've got a mid-size elasticsearch index (1.46T or ~1e8 docs). It's running on 4 servers which each have 64GB Ram split evenly between elastic and the OS (for caching).

我想尝试新的重要条款",聚合,因此我触发了以下查询...

I want to try out the new "Significant terms" aggregation so I fired off the following query...

{
  "query": {
    "ids": {
      "type": "document",
      "values": [
        "xCN4T1ABZRSj6lsB3p2IMTffv9-4ztzn1R11P_NwTTc"
      ]
    }
  },
  "aggregations": {
    "Keywords": {
      "significant_terms": {
        "field": "Body"
      }
    }
  },
  "size": 0
}

应将指定文档的主体与索引的其余部分进行比较,并找出对文档有意义的,索引中不常见的术语.

Which should compare the body of the document specified with the rest of the index and find terms significant to the document that are not common in the index.

不幸的是,这总是导致

ElasticsearchException [org.elasticsearch.common.breaker.CircuitBreakingException:数据太大,数据将超过[25741911654]个字节的限制];

ElasticsearchException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data would be larger than limit of [25741911654] bytes];

嵌套:UncheckedExecutionException [org.elasticsearch.common.breaker.CircuitBreakingException:数据太大,数据将超过[25741911654]字节的限制];

nested: UncheckedExecutionException[org.elasticsearch.common.breaker.CircuitBreakingException: Data too large, data would be larger than limit of [25741911654] bytes];

嵌套:CircuitBreakingException [数据太大,数据将大于[25741911654]字节的限制];

nested: CircuitBreakingException[Data too large, data would be larger than limit of [25741911654] bytes];

一两分钟后,似乎暗示我没有足够的内存.

after a minute or two and seems to imply I haven't got enough memory.

有问题的弹性服务器实际上是VM,因此我关闭了其他VM,并为每个弹性实例分配了96GB,为每个OS分配了96GB.

The elastic servers in question are actually VMs, so I shut down other VMs and gave each elastic instance 96GB and each OS another 96GB.

发生相同的问题(数字不同,花费的时间更长).我没有可用的可用内存超过192GB的硬件,因此不能再提高了.

The same problem occurred (different numbers, took longer). I haven't got hardware to hand with more than 192GB of memory available so can't go higher.

汇总不是要用于整个索引吗?我在查询格式方面犯了错误吗?

Are aggregations not meant for use against the index as a whole? Am I making a mistake with regards to the query format?

推荐答案

关于此聚合的文档上有一个警告,内容涉及超大索引

There is a warning on the documentation for this aggregation about RAM use on free-text fields for very large indices [1]. On large indices it works OK for lower-cardinality fields with a smaller vocabulary (e.g. hashtags) but the combination of many free-text terms and many docs is a memory-hog. You could look at specifying a filter on the loading of FieldData cache [2] for the Body field to trim the long-tail of low-frequency terms (e.g. doc frequency <2) which would reduce RAM overheads.

在对最匹配的文档样本中的重要术语进行分析之前,我使用了该算法的变体,并且此方法所需的RAM更少,因为仅从磁盘读取了前N个文档并标记了令牌(使用TermVectors或分析仪).但是,目前,Elasticsearch中的实现依赖于FieldData缓存并查找所有匹配文档的术语.

I have used a variation of this algorithm before where only a sample of the top-matching docs were analysed for significant terms and this approach requires less RAM as only the top N docs are read from disk and tokenised (using TermVectors or an Analyzer). However, for now the implementation in Elasticsearch relies on a FieldData cache and looks up terms for ALL matching docs.

另一件事-当您说要比较指定文档的正文"时,请注意,通常的操作方式是将一组文档与背景进行比较,而不仅仅是一个.所有分析均基于文档频率计数,因此,如果一个样本集只有一个文档,则所有术语的前台频率均为1,这意味着您缺乏加强任何分析的证据.

One more thing - when you say you want to "compare the body of the document specified" note that the usual mode of operation is to compare a set of documents against the background, not just one. All analysis is based on doc frequency counts so with a sample set of just one doc all terms will have the foreground frequency of 1 meaning you have less evidence to reinforce any analysis.

这篇关于重要条款导致CircuitBreakingException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆