带短语匹配的 Edge NGram [英] Edge NGram with phrase matching

查看:24
本文介绍了带短语匹配的 Edge NGram的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要自动完成短语.例如,当我搜索老年痴呆症"时,我想得到老年痴呆症的痴呆症".

I need to autocomplete phrases. For example, when I search "dementia in alz", I want to get "dementia in alzheimer's".

为此,我配置了 Edge NGram tokenizer.我尝试了 edge_ngram_analyzerstandard 作为查询正文中的分析器.然而,当我试图匹配一个短语时,我无法得到结果.

For this, I configured Edge NGram tokenizer. I tried both edge_ngram_analyzer and standard as the analyzer in the query body. Nevertheless, I can't get results when I'm trying to match a phrase.

我做错了什么?

我的查询:

{
  "query":{
    "multi_match":{
      "query":"dementia in alz",
      "type":"phrase",
      "analyzer":"edge_ngram_analyzer",
      "fields":["_all"]
    }
  }
}

我的映射:

...
"type" : {
  "_all" : {
    "analyzer" : "edge_ngram_analyzer",
    "search_analyzer" : "standard"
  },
  "properties" : {
    "field" : {
      "type" : "string",
      "analyzer" : "edge_ngram_analyzer",
      "search_analyzer" : "standard"
    },
...
"settings" : {
  ...
  "analysis" : {
    "filter" : {
      "stem_possessive_filter" : {
        "name" : "possessive_english",
        "type" : "stemmer"
      }
    },
    "analyzer" : {
      "edge_ngram_analyzer" : {
        "filter" : [ "lowercase" ],
        "tokenizer" : "edge_ngram_tokenizer"
      }
    },
    "tokenizer" : {
      "edge_ngram_tokenizer" : {
        "token_chars" : [ "letter", "digit", "whitespace" ],
        "min_gram" : "2",
        "type" : "edgeNGram",
        "max_gram" : "25"
      }
    }
  }
  ...

我的文档:

{
  "_score": 1.1152233, 
  "_type": "Diagnosis", 
  "_id": "AVZLfHfBE5CzEm8aJ3Xp", 
  "_source": {
    "@timestamp": "2016-08-02T13:40:48.665Z", 
    "type": "Diagnosis", 
    "Document_ID": "Diagnosis_1400541", 
    "Diagnosis": "F00.0 -  Dementia in Alzheimer's disease with early onset", 
    "@version": "1", 
  }, 
  "_index": "carenotes"
}, 
{
  "_score": 1.1152233, 
  "_type": "Diagnosis", 
  "_id": "AVZLfICrE5CzEm8aJ4Dc", 
  "_source": {
    "@timestamp": "2016-08-02T13:40:51.240Z", 
    "type": "Diagnosis", 
    "Document_ID": "Diagnosis_1424351", 
    "Diagnosis": "F00.1 -  Dementia in Alzheimer's disease with late onset", 
    "@version": "1", 
  }, 
  "_index": "carenotes"
}

分析老年痴呆症"短语:

{
  "tokens": [
    {
      "end_offset": 2, 
      "token": "de", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 3, 
      "token": "dem", 
      "type": "word", 
      "start_offset": 0, 
      "position": 1
    }, 
    {
      "end_offset": 4, 
      "token": "deme", 
      "type": "word", 
      "start_offset": 0, 
      "position": 2
    }, 
    {
      "end_offset": 5, 
      "token": "demen", 
      "type": "word", 
      "start_offset": 0, 
      "position": 3
    }, 
    {
      "end_offset": 6, 
      "token": "dement", 
      "type": "word", 
      "start_offset": 0, 
      "position": 4
    }, 
    {
      "end_offset": 7, 
      "token": "dementi", 
      "type": "word", 
      "start_offset": 0, 
      "position": 5
    }, 
    {
      "end_offset": 8, 
      "token": "dementia", 
      "type": "word", 
      "start_offset": 0, 
      "position": 6
    }, 
    {
      "end_offset": 9, 
      "token": "dementia ", 
      "type": "word", 
      "start_offset": 0, 
      "position": 7
    }, 
    {
      "end_offset": 10, 
      "token": "dementia i", 
      "type": "word", 
      "start_offset": 0, 
      "position": 8
    }, 
    {
      "end_offset": 11, 
      "token": "dementia in", 
      "type": "word", 
      "start_offset": 0, 
      "position": 9
    }, 
    {
      "end_offset": 12, 
      "token": "dementia in ", 
      "type": "word", 
      "start_offset": 0, 
      "position": 10
    }, 
    {
      "end_offset": 13, 
      "token": "dementia in a", 
      "type": "word", 
      "start_offset": 0, 
      "position": 11
    }, 
    {
      "end_offset": 14, 
      "token": "dementia in al", 
      "type": "word", 
      "start_offset": 0, 
      "position": 12
    }, 
    {
      "end_offset": 15, 
      "token": "dementia in alz", 
      "type": "word", 
      "start_offset": 0, 
      "position": 13
    }, 
    {
      "end_offset": 16, 
      "token": "dementia in alzh", 
      "type": "word", 
      "start_offset": 0, 
      "position": 14
    }, 
    {
      "end_offset": 17, 
      "token": "dementia in alzhe", 
      "type": "word", 
      "start_offset": 0, 
      "position": 15
    }, 
    {
      "end_offset": 18, 
      "token": "dementia in alzhei", 
      "type": "word", 
      "start_offset": 0, 
      "position": 16
    }, 
    {
      "end_offset": 19, 
      "token": "dementia in alzheim", 
      "type": "word", 
      "start_offset": 0, 
      "position": 17
    }, 
    {
      "end_offset": 20, 
      "token": "dementia in alzheime", 
      "type": "word", 
      "start_offset": 0, 
      "position": 18
    }, 
    {
      "end_offset": 21, 
      "token": "dementia in alzheimer", 
      "type": "word", 
      "start_offset": 0, 
      "position": 19
    }
  ]
}

推荐答案

非常感谢 rendel 帮助我找到正确的解决方案!

Many thanks to rendel who helped me to find the right solution!

Andrei Stefan 的解决方案不是最优的.

为什么?首先,搜索分析器中没有小写过滤器,使得搜索不方便;大小写必须严格匹配.需要一个带有 lowercase 过滤器的自定义分析器,而不是 "analyzer": "keyword".

Why? First, the absence of the lowercase filter in the search analyzer makes search inconvenient; the case must be matched strictly. A custom analyzer with lowercase filter is needed instead of "analyzer": "keyword".

其次,分析部分有误!在索引期间,edge_ngram_analyzer 会分析字符串F00.0 - 早发性阿尔茨海默病痴呆".使用这个分析器,我们有以下字典数组作为分析的字符串:

Second, the analysis part is wrong! During index time a string "F00.0 - Dementia in Alzheimer's disease with early onset" is analyzed by edge_ngram_analyzer. With this analyzer, we have the following array of dictionaries as the analyzed string:

{
  "tokens": [
    {
      "end_offset": 2, 
      "token": "f0", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 3, 
      "token": "f00", 
      "type": "word", 
      "start_offset": 0, 
      "position": 1
    }, 
    {
      "end_offset": 6, 
      "token": "0 ", 
      "type": "word", 
      "start_offset": 4, 
      "position": 2
    }, 
    {
      "end_offset": 9, 
      "token": "  ", 
      "type": "word", 
      "start_offset": 7, 
      "position": 3
    }, 
    {
      "end_offset": 10, 
      "token": "  d", 
      "type": "word", 
      "start_offset": 7, 
      "position": 4
    }, 
    {
      "end_offset": 11, 
      "token": "  de", 
      "type": "word", 
      "start_offset": 7, 
      "position": 5
    }, 
    {
      "end_offset": 12, 
      "token": "  dem", 
      "type": "word", 
      "start_offset": 7, 
      "position": 6
    }, 
    {
      "end_offset": 13, 
      "token": "  deme", 
      "type": "word", 
      "start_offset": 7, 
      "position": 7
    }, 
    {
      "end_offset": 14, 
      "token": "  demen", 
      "type": "word", 
      "start_offset": 7, 
      "position": 8
    }, 
    {
      "end_offset": 15, 
      "token": "  dement", 
      "type": "word", 
      "start_offset": 7, 
      "position": 9
    }, 
    {
      "end_offset": 16, 
      "token": "  dementi", 
      "type": "word", 
      "start_offset": 7, 
      "position": 10
    }, 
    {
      "end_offset": 17, 
      "token": "  dementia", 
      "type": "word", 
      "start_offset": 7, 
      "position": 11
    }, 
    {
      "end_offset": 18, 
      "token": "  dementia ", 
      "type": "word", 
      "start_offset": 7, 
      "position": 12
    }, 
    {
      "end_offset": 19, 
      "token": "  dementia i", 
      "type": "word", 
      "start_offset": 7, 
      "position": 13
    }, 
    {
      "end_offset": 20, 
      "token": "  dementia in", 
      "type": "word", 
      "start_offset": 7, 
      "position": 14
    }, 
    {
      "end_offset": 21, 
      "token": "  dementia in ", 
      "type": "word", 
      "start_offset": 7, 
      "position": 15
    }, 
    {
      "end_offset": 22, 
      "token": "  dementia in a", 
      "type": "word", 
      "start_offset": 7, 
      "position": 16
    }, 
    {
      "end_offset": 23, 
      "token": "  dementia in al", 
      "type": "word", 
      "start_offset": 7, 
      "position": 17
    }, 
    {
      "end_offset": 24, 
      "token": "  dementia in alz", 
      "type": "word", 
      "start_offset": 7, 
      "position": 18
    }, 
    {
      "end_offset": 25, 
      "token": "  dementia in alzh", 
      "type": "word", 
      "start_offset": 7, 
      "position": 19
    }, 
    {
      "end_offset": 26, 
      "token": "  dementia in alzhe", 
      "type": "word", 
      "start_offset": 7, 
      "position": 20
    }, 
    {
      "end_offset": 27, 
      "token": "  dementia in alzhei", 
      "type": "word", 
      "start_offset": 7, 
      "position": 21
    }, 
    {
      "end_offset": 28, 
      "token": "  dementia in alzheim", 
      "type": "word", 
      "start_offset": 7, 
      "position": 22
    }, 
    {
      "end_offset": 29, 
      "token": "  dementia in alzheime", 
      "type": "word", 
      "start_offset": 7, 
      "position": 23
    }, 
    {
      "end_offset": 30, 
      "token": "  dementia in alzheimer", 
      "type": "word", 
      "start_offset": 7, 
      "position": 24
    }, 
    {
      "end_offset": 33, 
      "token": "s ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 25
    }, 
    {
      "end_offset": 34, 
      "token": "s d", 
      "type": "word", 
      "start_offset": 31, 
      "position": 26
    }, 
    {
      "end_offset": 35, 
      "token": "s di", 
      "type": "word", 
      "start_offset": 31, 
      "position": 27
    }, 
    {
      "end_offset": 36, 
      "token": "s dis", 
      "type": "word", 
      "start_offset": 31, 
      "position": 28
    }, 
    {
      "end_offset": 37, 
      "token": "s dise", 
      "type": "word", 
      "start_offset": 31, 
      "position": 29
    }, 
    {
      "end_offset": 38, 
      "token": "s disea", 
      "type": "word", 
      "start_offset": 31, 
      "position": 30
    }, 
    {
      "end_offset": 39, 
      "token": "s diseas", 
      "type": "word", 
      "start_offset": 31, 
      "position": 31
    }, 
    {
      "end_offset": 40, 
      "token": "s disease", 
      "type": "word", 
      "start_offset": 31, 
      "position": 32
    }, 
    {
      "end_offset": 41, 
      "token": "s disease ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 33
    }, 
    {
      "end_offset": 42, 
      "token": "s disease w", 
      "type": "word", 
      "start_offset": 31, 
      "position": 34
    }, 
    {
      "end_offset": 43, 
      "token": "s disease wi", 
      "type": "word", 
      "start_offset": 31, 
      "position": 35
    }, 
    {
      "end_offset": 44, 
      "token": "s disease wit", 
      "type": "word", 
      "start_offset": 31, 
      "position": 36
    }, 
    {
      "end_offset": 45, 
      "token": "s disease with", 
      "type": "word", 
      "start_offset": 31, 
      "position": 37
    }, 
    {
      "end_offset": 46, 
      "token": "s disease with ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 38
    }, 
    {
      "end_offset": 47, 
      "token": "s disease with e", 
      "type": "word", 
      "start_offset": 31, 
      "position": 39
    }, 
    {
      "end_offset": 48, 
      "token": "s disease with ea", 
      "type": "word", 
      "start_offset": 31, 
      "position": 40
    }, 
    {
      "end_offset": 49, 
      "token": "s disease with ear", 
      "type": "word", 
      "start_offset": 31, 
      "position": 41
    }, 
    {
      "end_offset": 50, 
      "token": "s disease with earl", 
      "type": "word", 
      "start_offset": 31, 
      "position": 42
    }, 
    {
      "end_offset": 51, 
      "token": "s disease with early", 
      "type": "word", 
      "start_offset": 31, 
      "position": 43
    }, 
    {
      "end_offset": 52, 
      "token": "s disease with early ", 
      "type": "word", 
      "start_offset": 31, 
      "position": 44
    }, 
    {
      "end_offset": 53, 
      "token": "s disease with early o", 
      "type": "word", 
      "start_offset": 31, 
      "position": 45
    }, 
    {
      "end_offset": 54, 
      "token": "s disease with early on", 
      "type": "word", 
      "start_offset": 31, 
      "position": 46
    }, 
    {
      "end_offset": 55, 
      "token": "s disease with early ons", 
      "type": "word", 
      "start_offset": 31, 
      "position": 47
    }, 
    {
      "end_offset": 56, 
      "token": "s disease with early onse", 
      "type": "word", 
      "start_offset": 31, 
      "position": 48
    }
  ]
}

如您所见,整个字符串的标记大小为 2 到 25 个字符.字符串以线性方式与所有空格和位置一起被标记为每个新标记加一.

As you can see, the whole string tokenized with token size from 2 to 25 characters. The string is tokenized in a linear way together with all spaces and position incremented by one for every new token.

它有几个问题:

  1. edge_ngram_analyzer 产生了永远不会被搜索的无用标记,例如:0"、"、d", "sd", "s disease w" 等.
  2. 此外,它没有产生很多可以使用的有用的标记,例如:疾病"、early onset" 等.如果您尝试搜索这些词中的任何一个,则会有 0 个结果.
  3. 注意,最后一个标记是早期发病".最后的t"在哪里?由于 "max_gram" : "25" 我们丢失"了所有字段中的一些文本.您无法再搜索此文本,因为它没有标记.
  4. trim 过滤器只会混淆过滤额外空格的问题,当它可以由分词器完成时.
  5. edge_ngram_analyzer 会增加每个标记的位置,这对于位置查询(例如短语查询)是有问题的.应该使用 edge_ngram_filter 代替,它会在生成 ngram 时保留标记的位置.
  1. The edge_ngram_analyzer produced unuseful tokens which will never be searched for, for example: "0 ", " ", " d", "s d", "s disease w" etc.
  2. Also, it didn't produce a lot of useful tokens that could be used, for example: "disease", "early onset" etc. There will be 0 results if you try to search for any of these words.
  3. Notice, the last token is "s disease with early onse". Where is the final "t"? Because of the "max_gram" : "25" we "lost" some text in all fields. You can't search for this text anymore because there are no tokens for it.
  4. The trim filter only obfuscates the problem filtering extra spaces when it could be done by a tokenizer.
  5. The edge_ngram_analyzer increments the position of each token which is problematic for positional queries such as phrase queries. One should use the edge_ngram_filter instead that will preserve the position of the token when generating the ngrams.

最优解.

要使用的映射设置:

The optimal solution.

The mappings settings to use:

...
"mappings": {
    "Type": {
       "_all":{
          "analyzer": "edge_ngram_analyzer", 
          "search_analyzer": "keyword_analyzer"
        }, 
        "properties": {
          "Field": {
            "search_analyzer": "keyword_analyzer",
             "type": "string",
             "analyzer": "edge_ngram_analyzer"
          },
...
...
"settings": {
   "analysis": {
      "filter": {
         "english_poss_stemmer": {
            "type": "stemmer",
            "name": "possessive_english"
         },
         "edge_ngram": {
           "type": "edgeNGram",
           "min_gram": "2",
           "max_gram": "25",
           "token_chars": ["letter", "digit"]
         }
      },
      "analyzer": {
         "edge_ngram_analyzer": {
           "filter": ["lowercase", "english_poss_stemmer", "edge_ngram"],
           "tokenizer": "standard"
         },
         "keyword_analyzer": {
           "filter": ["lowercase", "english_poss_stemmer"],
           "tokenizer": "standard"
         }
      }
   }
}
...

看分析:

{
  "tokens": [
    {
      "end_offset": 5, 
      "token": "f0", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 5, 
      "token": "f00", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 5, 
      "token": "f00.", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 5, 
      "token": "f00.0", 
      "type": "word", 
      "start_offset": 0, 
      "position": 0
    }, 
    {
      "end_offset": 17, 
      "token": "de", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dem", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "deme", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "demen", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dement", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dementi", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 17, 
      "token": "dementia", 
      "type": "word", 
      "start_offset": 9, 
      "position": 2
    }, 
    {
      "end_offset": 20, 
      "token": "in", 
      "type": "word", 
      "start_offset": 18, 
      "position": 3
    }, 
    {
      "end_offset": 32, 
      "token": "al", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alz", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzh", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzhe", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzhei", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzheim", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzheime", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 32, 
      "token": "alzheimer", 
      "type": "word", 
      "start_offset": 21, 
      "position": 4
    }, 
    {
      "end_offset": 40, 
      "token": "di", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "dis", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "dise", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "disea", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "diseas", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 40, 
      "token": "disease", 
      "type": "word", 
      "start_offset": 33, 
      "position": 5
    }, 
    {
      "end_offset": 45, 
      "token": "wi", 
      "type": "word", 
      "start_offset": 41, 
      "position": 6
    }, 
    {
      "end_offset": 45, 
      "token": "wit", 
      "type": "word", 
      "start_offset": 41, 
      "position": 6
    }, 
    {
      "end_offset": 45, 
      "token": "with", 
      "type": "word", 
      "start_offset": 41, 
      "position": 6
    }, 
    {
      "end_offset": 51, 
      "token": "ea", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 51, 
      "token": "ear", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 51, 
      "token": "earl", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 51, 
      "token": "early", 
      "type": "word", 
      "start_offset": 46, 
      "position": 7
    }, 
    {
      "end_offset": 57, 
      "token": "on", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }, 
    {
      "end_offset": 57, 
      "token": "ons", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }, 
    {
      "end_offset": 57, 
      "token": "onse", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }, 
    {
      "end_offset": 57, 
      "token": "onset", 
      "type": "word", 
      "start_offset": 52, 
      "position": 8
    }
  ]
}

在索引时,文本由 standard 分词器分词,然后单独的词由 lowercasepossessive_englishedge_ngram 过滤器.令牌仅针对单词生成.在搜索时,文本由 standard 分词器分词,然后单独的词由 lowercasepossessive_english 过滤.搜索到的词与索引时间内创建的标记进行匹配.

On index time a text is tokenized by standard tokenizer, then separate words are filtered by lowercase, possessive_english and edge_ngram filters. Tokens are produced only for words. On search time a text is tokenized by standard tokenizer, then separate words are filtered by lowercase and possessive_english. The searched words are matched against the tokens which had been created during the index time.

因此我们使增量搜索成为可能!

Thus we make the incremental search possible!

现在,因为我们对单独的单词进行 ngram,我们甚至可以执行像

Now, because we do ngram on separate words, we can even execute queries like

{
  'query': {
    'multi_match': {
      'query': 'dem in alzh',  
      'type': 'phrase', 
      'fields': ['_all']
    }
  }
}

并获得正确的结果.

没有文本丢失",一切都是可搜索的,并且不再需要通过 trim 过滤器处理空格.

No text is "lost", everything is searchable and there is no need to deal with spaces by trim filter anymore.

这篇关于带短语匹配的 Edge NGram的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆