ElasticSearch搜索性能 [英] ElasticSearch search perfomance

查看:115
本文介绍了ElasticSearch搜索性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个两个节点集群(私有云中的VM,64GB的RAM,每个节点的8个核心CPU,CentOS),一些小的索引(约1百万个文档)和一个大的索引,约220万个文档(2个碎片,170GB的空间)。 24GB的内存分配给每个盒子上的弹性搜索。



文档结构:

  {
'article_id':{
'index':'not_analyzed',
'store':'yes',
'type':'long'
},
'feed_id':{
'index':'not_analyzed',
'store':'yes',
'type':'string'
},
'title':{
'index':'analyze',
'type':'string'
},
'content' :{
'index':'analyze',
'type':'string'
},
'lang':{
'index':'not_analyzed ',
'type':'string'
}
}

运行以下查询大约需要1-2秒:

  {
query:{
multi_match:{
query:some search terms,
字段:[title,content],
type:phrase_prefix
}
},
size:20,
字段:[article_id,feed_id]
}

我们是否打硬件在这一点上有限制还是有方法来优化查询或数据结构以提高性能?



提前感谢

解决方案

可能您遇到硬件限制,但您可以先执行一些操作来优化查询。



最大扩展



我要做的第一件事是限制 max_expansions 。前缀查询的工作方式是通过生成与查询中的最后一个令牌匹配的前缀列表。在搜索查询一些搜索词中,最后一个令牌term将使用term作为前缀种子进行扩展。您可以生成如下列表:




  • 术语

  • 条款

  • 终止

  • 终止者

  • 白蚁



前缀扩展过程运行在您的发布列表中,查找与种子前缀匹配的任何字词。默认情况下,此列表是无界的,这意味着您可以生成一个非常大的扩展列表。



第二阶段使用扩展将您的原始查询重写为一系列术语查询。扩展列表越大,对您的索引进行评估的条件越多,速度就相应降低。



如果将扩展过程限制在合理的范围内,您可以保持速度仍然通常会得到很好的前缀匹配:

  {
查询:{
multi_match {
query:some search terms,
fields:[title,content],
type:phrase_prefix,
max_expansions:100
}
},
size:20,
fields:[article_id,feed_id],

}

你必须玩你想要的扩展量。



过滤



一般来说,您可以添加的另一件事是过滤。如果有一些类型的标准可以过滤,您可以大大提高速度。目前,您的查询正在针对整个索引(250m文档)执行,这是很多值得评估的。如果您可以添加减少该数字的过滤器,您可以看到延长时间的改善。



最后,查询评估的文档越少,该查询将运行。过滤器会减少查询文件的数量,缓存,运行速度等等。



您的情况可能没有任何适用的过滤器,但如果,他们真的可以帮助!



文件系统缓存



取决于系统的其余部分。如果您没有完全利用您的堆(24gb),因为您正在进行简单的搜索和过滤(例如,不是面部/地理/重/排序/脚本),您可能可以将堆重新分配到文件系统缓存。



例如,如果您的最大堆使用率达到12gb,则将堆大小降低到15gb可能是有意义的。您释放的额外的10gb将返回到操作系统并帮助缓存段,这将有助于提高搜索性能,只需要更多的操作是无盘的。


We have a two node cluster (VM in a private cloud, 64GB of RAM, 8 core CPU each node, CentOS), a few small indices ( ~1 mil documents) and one big index with ~220 mil docs (2 shards, 170GB of space). 24GB of memory is allocated to elastic search on each box.

Document structure:

 {
        'article_id': {
            'index': 'not_analyzed',
            'store': 'yes',
            'type': 'long'
        },
        'feed_id': {
            'index': 'not_analyzed',
            'store': 'yes',
            'type': 'string'
        },
        'title': {
            'index': 'analyzed',
            'type': 'string'
        },
        'content': {
            'index': 'analyzed',
            'type': 'string'
        },
        'lang': {
            'index': 'not_analyzed',
            'type': 'string'
        }
    }

It takes about 1-2 seconds to run the following query:

{
    "query" : {
        "multi_match" : {
            "query" : "some search term",
            "fields" : [ "title", "content" ],
            "type": "phrase_prefix"
        }
    },
    "size": 20,
    "fields" :["article_id", "feed_id"]
}

Are we hitting hardware limits at this point or are there ways to optimize the query or data structure to increase performance?

Thanks in advance!

解决方案

It's possible you are hitting the limits of your hardware, but there are a few things you can do to your query first to help optimize it.

Max Expansions

The first thing I would do is limit max_expansions. The way the prefix-queries work is by generating a list of prefixes that match the last token in your query. In your search query "some search term", the last token "term" would be expanded using "term" as the prefix seed. You may generate a list like this:

  • term
  • terms
  • terminate
  • terminator
  • termite

The prefix expansion process runs through your posting list looking for any word which matches the seed prefix. By default, this list is unbounded, which means you can generate a very large list of expansions.

The second phase rewrites your original query into a series of term queries using the expansions. The bigger the expansion list, the more terms are evaluated against your index and a corresponding decrease in speed.

If you limit the expansion process to something reasonable, you can maintain speed and still usually get good prefix matching:

{
    "query" : {
        "multi_match" : {
            "query" : "some search term",
            "fields" : [ "title", "content" ],
            "type": "phrase_prefix",
            "max_expansions" : 100
        }
    },
    "size": 20,
    "fields" :["article_id", "feed_id"],

}

You'll have to play with how many expansions you want. It is a tradeoff between speed and recall.

Filtering

In general, the other thing you can add is filtering. If there is some type of criteria you can filter on, you can potentially drastically improve speed. Currently, your query is executing against the entire index (250m documents), which is a lot to evaluate. If you can add filter that cuts that number down, you can see much improved latency.

At the end of the day, the fewer documents which the query evaluates, the faster the query will run. Filters decrease the number of docs that a query will see, are cached, operate very quickly, etc etc.

Your situation may not have any applicable filters, but if it does, they can really help!

File System Caching

This advice is entirely dependent on the rest of the system. If you aren't fully utilizing your heap (24gb) because you are doing simple search and filtering (e.g. not faceting / geo / heavy sorts / scripts) you may be able to reallocate your heap to the file system cache.

For example, if your max heap usage peaks at 12gb, it may make sense to decrease heap size down to 15gb. The extra 10gb that you freed will go back to the OS and help cache segments, which will help boost search performance simply by the fact that more operations are diskless.

这篇关于ElasticSearch搜索性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆