Elasticsearch 6.8中模糊搜索的最佳实践是什么(例如MySQL中的'%aaa%') [英] What is the best practice of fuzzy search (like '%aaa%' in MySQL) in Elasticsearch 6.8

查看:74
本文介绍了Elasticsearch 6.8中模糊搜索的最佳实践是什么(例如MySQL中的'%aaa%')的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景:我使用Mysql,有数百万个数据,每行有二十列,我们有一些复杂的搜索和某些列使用模糊匹配,例如 username如'%aaa%',除非删除第一个,否则它不能使用mysql索引,但是我们需要模糊匹配为了像Satckoverflow搜索一样进行搜索,我还检查了Mysql <全文索引,但是如果使用其他索引,它不支持复杂的搜索.

我的解决方案:将Elasticsearch添加为我们的搜索引擎,将数据插入Mysql和Es并仅在Elasticsearch中搜索数据

我检查了Elasticsearch模糊搜索,通配符可行,但是很多人不建议在单词开头使用 * ,这会使搜索非常慢.

例如:用户名:"John_Snow"

通配符可以使用,但可能会很慢

  GET/user/_search{查询":{通配符":{用户名":"* hn *"}}} 

match_phrase 无效,似乎仅在令牌生成器上有效,例如短语"John Snow"

  {查询":{"match_phrase":{" dbName" ;:" hn"}}} 

我的问题:有没有更好的解决方案来进行包含模糊匹配(例如'%no%'或'%hn_Sn%')的复杂查询.

解决方案

您可以使用 ngram标记生成器首先将文本分解为每当遇到指定字符列表中的一个单词时,然后它会为指定长度的每个单词发出N元语法.

添加包含索引数据,映射,搜索查询和结果的工作示例.

索引映射:

  {设置":{分析":{分析器":{"my_analyzer":{"tokenizer":"my_tokenizer"}},"tokenizer":{"my_tokenizer":{"type":"ngram","min_gram":2"max_gram":10,"token_chars":[字母",数字"]}}},"max_ngram_diff":50},映射":{属性":{标题":{"type":"text","analyzer":"my_analyzer","search_analyzer":"standard"}}}} 

分析API

  POST/_analyze{"analyzer":"my_analyzer",文本":"John_Snow"} 

令牌为:

  {令牌":[{令牌":"Jo","start_offset":0,"end_offset":2"type":"word",位置":0},{令牌":"Joh","start_offset":0,"end_offset":3,"type":"word",位置":1},{令牌":"John","start_offset":0,"end_offset":4"type":"word",位置":2},{"令牌":哦","start_offset":1,"end_offset":3,"type":"word",位置":3},{" token":"ohn","start_offset":1,"end_offset":4"type":"word",位置":4},{"token":"hn","start_offset":2"end_offset":4"type":"word",位置":5},{令牌":"Sn","start_offset":5"end_offset":7"type":"word",位置":6},{令牌":"Sno","start_offset":5"end_offset":8"type":"word",位置":7},{令牌":雪","start_offset":5"end_offset":9"type":"word",位置":8},{令牌":否","start_offset":6"end_offset":8"type":"word",位置":9},{令牌":现在","start_offset":6"end_offset":9"type":"word",位置":10},{令牌":"ow","start_offset":7"end_offset":9"type":"word",位置":11}]} 

索引数据:

  {"title":"John_Snow";} 

搜索查询:

  {查询":{匹配":{标题":"hn";}}} 

搜索结果:

 "hits":[{"_index":"test","_type":"_ doc","_id":"1","_score":0.2876821,"_source":{标题":"John_Snow"}}] 

另一个搜索查询

  {查询":{匹配":{标题":"ohr"}}} 

上面的搜索查询没有显示结果

Background: I use Mysql and there are millions data, each line have twenty columns, we have some complex search and some column use fuzzy match, such as username like '%aaa%', it can't use mysql index unless remove the first %, but we need fuzzy match to do search like Satckoverflow search, i also checked Mysql fulltext index, but it doesn't support complex search whthin one sql if using other index.

My solution: add Elasticsearch as our search engine, insert data into Mysql and Es and search data only in Elasticsearch

I checked Elasticsearch fuzzy search, wildcard works, but many people don't suggest use * in the word beginning, it will make search very slow.

For example: username: 'John_Snow'

wildcard works but may very slow

GET /user/_search
{
  "query": {
    "wildcard": {
      "username": "*hn*"
    }
  }
}

match_phrase doesn't work seems only work on Tokenizer like phrase 'John Snow'

{
  "query": {
      "match_phrase":{
      "dbName": "hn"
      }
  }
}

My question: Is there any better solution to do complex query that contains fuzzy match like '%no%' or '%hn_Sn%'.

解决方案

You can use ngram tokenizer that first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.

Adding a working example with index data, mapping, search query, and results.

Index Mapping:

     {
    "settings": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "type": "ngram",
                    "min_gram": 2,
                    "max_gram": 10,
                    "token_chars": [
                        "letter",
                        "digit"
                    ]
                }
            }
        },
        "max_ngram_diff": 50
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "my_analyzer",
                "search_analyzer": "standard"
            }
        }
    }
}

Analyze API

POST/ _analyze

{
  "analyzer": "my_analyzer",
  "text": "John_Snow"
}

The tokens are :

   {
    "tokens": [
        {
            "token": "Jo",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "Joh",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 1
        },
        {
            "token": "John",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 2
        },
        {
            "token": "oh",
            "start_offset": 1,
            "end_offset": 3,
            "type": "word",
            "position": 3
        },
        {
            "token": "ohn",
            "start_offset": 1,
            "end_offset": 4,
            "type": "word",
            "position": 4
        },
        {
            "token": "hn",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 5
        },
        {
            "token": "Sn",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 6
        },
        {
            "token": "Sno",
            "start_offset": 5,
            "end_offset": 8,
            "type": "word",
            "position": 7
        },
        {
            "token": "Snow",
            "start_offset": 5,
            "end_offset": 9,
            "type": "word",
            "position": 8
        },
        {
            "token": "no",
            "start_offset": 6,
            "end_offset": 8,
            "type": "word",
            "position": 9
        },
        {
            "token": "now",
            "start_offset": 6,
            "end_offset": 9,
            "type": "word",
            "position": 10
        },
        {
            "token": "ow",
            "start_offset": 7,
            "end_offset": 9,
            "type": "word",
            "position": 11
        }
    ]
}

Index Data:

{
  "title":"John_Snow"
}

Search Query:

{
    "query": {
        "match" : {
            "title" : "hn"
        }
    }
}

Search Result:

"hits": [
            {
                "_index": "test",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "title": "John_Snow"
                }
            }
        ]

Another search query

{
    "query": {
        "match" : {
            "title" : "ohr"
        }
    }
}

The above search query shows no result

这篇关于Elasticsearch 6.8中模糊搜索的最佳实践是什么(例如MySQL中的'%aaa%')的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆