索引时对文档进行排序是否会改善Elasticsearch的搜索性能? [英] Would ordering of documents when indexing improve Elasticsearch search performance?

查看:46
本文介绍了索引时对文档进行排序是否会改善Elasticsearch的搜索性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为Elasticsearch编制约4000万份文档的索引.通常是一次数据加载,然后我们在上面运行查询.索引本身没有进一步的更新.但是,Elasticsearch的默认设置无法获得预期的吞吐量.

因此,在需要调整和验证的一长串内容中,我想知道通过业务密钥订购是否会有助于提高搜索吞吐量.我们所有的分析查询都使用此关键字,并且已经将其索引为关键字,并且像下面这样对它进行过滤,

 <代码> {查询":{布尔":{必须":{"multi_match":{"type":"cross_fields","query":"store related query","minimum_should_match":"30%",字段":["field1 ^ 5","field2 ^ 5","field3 ^ 3","field4 ^ 3","firstLine","field5","field6","; field7]}},过滤器":{条款":{"businessKey":"storename"}}}}} 

此查询在大约几个小时内批量运行约2000万次.目前,我不能超过21k/min.但这可能是由于多种因素造成的.任何改进此类工作流程(加载一次并进行大量搜索)的性能的技巧都将受到赞赏.

但是,我特别感兴趣的是,在索引时是否可以首先通过业务密钥对数据进行排序,以便该businessKey的数据位于单个Lucene段中,因此查找会更快.那条思路正确吗?鉴于它是关键字词,这是ES已经做的事情吗?

解决方案

这是一个非常好的性能优化用例,正如您已经提到的,将列出需要执行的性能优化列表.

我可以看到,您已经在正确建立查询,即基于 businessKey 过滤记录,然后搜索其余文档,这样您已经在使用合并进程将被阻止细分,因此对于您的数据来说只有一个细分几乎是不可能的.

我认为您可以做的几件事是

  1. 在提取数据并准备要搜索的索引后,请禁用刷新间隔.
  2. 在使用过滤器时,应使用request_cache并在查询时监视缓存使用情况,并监视多少次结果来自缓存.

  GET your-index/_stats/request_cache?human 

  1. 当您有更多副本时,如果您的弹性搜索集群中有节点,请确保这些节点具有ES索引的副本.读取吞吐量会更高.
  2. 监控每个节点上的搜索队列,并确保其不会耗尽,否则您将无法提高吞吐量,请参考{ "query" : { "bool" : { "must" : { "multi_match" : { "type": "cross_fields", "query":"store related query", "minimum_should_match": "30%", "fields": [ "field1^5", "field2^5", "field3^3", "field4^3", "firstLine", "field5", "field6", "field7"] } }, "filter": { "term": { "businessKey": "storename" } } } } }

    This query is run in a bulk fashion about 20M times in a matter of few hours. Currently I cannot go past 21k/min. But that could be because of various factors. Any tips to improve performance for this sort of work flow (load once and search a lot) would be appreciated.

    However I'm particularly interested to know if I could order the data first by business key when I'm indexing so that data for that businessKey lives within one single Lucene segment and hence the lookup would be quicker. Is that line of thoughts correct? Is this something ES already does given that it's keyword term?

    解决方案

    It's a very good performance optimization use-case and as you already mentioned there will be a list of performance optimization which you need to do.

    I can see, you are already building the query correctly that is that filtering the records based on businessKey and than search on remaining docs, this way you are already utilizing the filter-cache of elasticsearch.

    As you have huge number of documents ~40M docs, it doesn't make sense to put all of them in single segments, default max size of segment is 5 GB and beyond that merge process will be blocked on segments, hence its almost impossible for you to have just 1 segment for your data.

    I think couple of things which you can do is:

    1. Disable refresh interval when you are done with ingesting your data and preparing index for search.
    2. As you are using the filters, your request_cache should be used and you should monitor the cache usage when you are querying and monitor how many times results are coming from cache.

    GET your-index/_stats/request_cache?human
    

    1. Read throughput is more when you have more replicas and if you have nodes in your elasticsearch cluster makes sure these nodes, have replicas of your ES index.
    2. Monitor the search queues on each nodes and make sure its not getting exhausted otherwise you will not be able to increase the throughput, refer threadpools in ES for more info

    You main issue is around throughput and you want to go beyond current limit of 21k/min, so it requires a lot of index and cluster configuration optimization as well and I have written short tips to improve search performance please refer them and let me know how it goes.

    这篇关于索引时对文档进行排序是否会改善Elasticsearch的搜索性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆