无论查询输入如何,Elasticsearch ngram 标记器都会返回所有结果 [英] Elasticsearch ngram tokenizer returns all results regardless of query input

查看:64
本文介绍了无论查询输入如何,Elasticsearch ngram 标记器都会返回所有结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建查询以搜索以下格式的记录:TR000002_1_2020.

I am trying to build a query to search for records in the following format: TR000002_1_2020.

用户应该能够通过以下方式搜索结果:

Users should be able to search for results the following ways:

TR0000022_1_2020TR000002_1_20202020.我认为 ngram 标记化查询最适合我的需求.我使用的是 Elasticsearch 6.8,因此无法使用 E7 中引入的内置 Search-As-You-Type.

TR000002 or 2_1_2020 or TR000002_1_2020 or 2020. I figured an ngram tokenization query would be best suited for my needs. I am using Elasticsearch 6.8 so I cannot use the built in Search-As-You-Type introduced in E7.

这是我从文档中遵循的实现 这里.我唯一修改的是 EdgeNGram ->NGram 作为用户可以从文本的任意位置进行搜索.

Here's my implementation I followed from docs here. The only thing I modified was EdgeNGram -> NGram as the user can search from any point of the text.

我的分析块如下所示:

.Analysis(a => a
    .Analyzers(aa => aa
        .Custom("autocomplete", ca => ca
            .Tokenizer("autocomplete")
            .Filters(new string[] {
                "lowercase"
            })
        )
        .Custom("autocomplete_search", ca => ca
            .Tokenizer("lowercase")
        )
    )
    .Tokenizers(t => t
        .NGram("autocomplete", e => e
            .MinGram(2)
            .MaxGram(16)
            .TokenChars(new TokenChar[] {
                TokenChar.Letter,
                TokenChar.Digit,
                TokenChar.Punctuation,
                TokenChar.Symbol
            })
        )
    )
)

然后在我的映射中定义:

Then in my mapping I define:

.Text(t => t
    .Name(tr => tr.TestRecordId)
    .Analyzer("autocomplete")
    .SearchAnalyzer("autocomplete_search")
)

当我搜索 TR000002 时,我的查询将返回所有结果,而不仅仅是包含这些特定字符的记录.我究竟做错了什么?对于这个特定用例,是否有更好的标记器?谢谢!

When I search for TR000002, my query returns all results instead of just the records that contain those specific characters. What am I doing wrong? Is there a better tokenizer for this specific use case? Thanks!

这是返回内容的示例:

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 27,
    "max_score" : 0.105360515,
    "hits" : [
      {
        "_index" : "test-records-development-09-09-2020-02-00-00",
        "_type" : "testrecorddto",
        "_id" : "3",
        "_score" : 0.105360515,
        "_source" : {
          "id" : 3,
          "testRecordId" : "TR000002_1_2020",
          "type" : 0,
          "typeName" : "TIDCo60",
          "missionId" : 1,
          "mission" : {
            "missionId" : 1,
            "name" : "[REDACTED]",
            "mRPLUsername" : "[REDACTED]",
            "missionRadiationPartsLead" : {
              "username" : "[REDACTED]",
              "displayName" : "[REDACTED]"
            },
            "missionInstruments" : [
              {
                "missionId" : 1,
                "instrumentId" : 1,
                "cognizantEngineerUsername" : "[REDACTED]",
                "instrument" : {
                  "intstrumentId" : 1,
                  "name" : "Instrument"
                },
                "cognizantEngineer" : {
                  "username" : "[REDACTED]",
                  "displayName" : "[REDACTED]"
                }
              },
              {
                "missionId" : 1,
                "instrumentId" : 2,
                "instrument" : {
                  "intstrumentId" : 2,
                  "name" : "Instrument 2"
                }
              }
            ]
          },
          "procurementPartId" : 2,
          "procurementPart" : {
            "procurementPartId" : 2,
            "partNumber" : "procurement part",
            "part" : {
              "partId" : 1,
              "manufacturer" : "Texas Instruments",
              "genericPartNumber" : "123",
              "description" : "description",
              "partTechnology" : "Part Tech"
            }
          },
          "testStatusId" : 12,
          "testStatus" : {
            "testStatusId" : 12,
            "name" : "Complete: Postponed Until Further Notice"
          },
          "discriminator" : "SingleEventEffectsRecord",
          "testRecordServiceOrders" : [
            {
              "testRecordId" : 3,
              "serviceOrderId" : 9,
              "serviceOrder" : {
                "serviceOrderId" : 9,
                "serviceOrderNumber" : "105702"
              }
            }
          ],
          "rtdbFiles" : [ ],
          "personnelGroups" : [
            {
              "personnelGroupUsers" : [ ]
            },
            {
              "personnelGroupUsers" : [ ]
            }
          ],
          "testRecordTestSubTypes" : [ ],
          "testRecordTestFacilityConditions" : [ ],
          "testRecordFollowers" : [ ],
          "isDeleted" : false,
          "sEETestRates" : [ ]
        }
      },
      {
        "_index" : "test-records-development-09-09-2020-02-00-00",
        "_type" : "testrecorddto",
        "_id" : "11",
        "_score" : 0.105360515,
        "_source" : {
          "id" : 11,
          "testRecordId" : "TR000011_1_2020",
          "type" : 0,
          "typeName" : "TIDCo60",
          "missionId" : 1,
          "mission" : {
            "missionId" : 1,
            "name" : "[REDACTED]",
            "mRPLUsername" : "[REDACTED]",
            "missionRadiationPartsLead" : {
              "username" : "[REDACTED]",
              "displayName" : "[REDACTED]"
            },
            "missionInstruments" : [
              {
                "missionId" : 1,
                "instrumentId" : 1,
                "cognizantEngineerUsername" : "[REDACTED]",
                "instrument" : {
                  "intstrumentId" : 1,
                  "name" : "Instrument"
                },
                "cognizantEngineer" : {
                  "username" : "[REDACTED]",
                  "displayName" : "[REDACTED]"
                }
              },
              {
                "missionId" : 1,
                "instrumentId" : 2,
                "instrument" : {
                  "intstrumentId" : 2,
                  "name" : "Instrument 2"
                }
              }
            ]
          },
          "procurementPartId" : 2,
          "procurementPart" : {
            "procurementPartId" : 2,
            "partNumber" : "procurement part",
            "part" : {
              "partId" : 1,
              "manufacturer" : "Texas Instruments",
              "genericPartNumber" : "123",
              "description" : "description",
              "partTechnology" : "Part Tech"
            }
          },
          "testStatusId" : 1,
          "testStatus" : {
            "testStatusId" : 1,
            "name" : "Active"
          },
          "discriminator" : "TotalIonizingDoseRecord",
          "creatorUsername" : "[REDACTED]",
          "creator" : {
            "username" : "[REDACTED]",
            "displayName" : "[REDACTED]"
          },
          "testRecordServiceOrders" : [ ],
          "partLDC" : "12",
          "waferLot" : "1",
          "rtdbFiles" : [ ],
          "personnelGroups" : [
            {
              "personnelGroupUsers" : [ ]
            }
          ],
          "testRecordTestSubTypes" : [ ],
          "testRecordTestFacilityConditions" : [ ],
          "testRecordFollowers" : [ ],
          "isDeleted" : false,
          "testStartDate" : "2020-07-30T00:00:00",
          "actualCompletionDate" : "2020-07-31T00:00:00"
        }
      },
      {
        "_index" : "test-records-development-09-09-2020-02-00-00",
        "_type" : "testrecorddto",
        "_id" : "17",
        "_score" : 0.105360515,
        "_source" : {
          "id" : 17,
          "testRecordId" : "TR000017_1_2020",
          "type" : 0,
          "typeName" : "TIDCo60",
          "missionId" : 1,
          "mission" : {
            "missionId" : 1,
            "name" : "[REDACTED]",
            "mRPLUsername" : "[REDACTED]",
            "missionRadiationPartsLead" : {
              "username" : "[REDACTED]",
              "displayName" : "[REDACTED]"
            },
            "missionInstruments" : [
              {
                "missionId" : 1,
                "instrumentId" : 1,
                "cognizantEngineerUsername" : "[REDACTED]",
                "instrument" : {
                  "intstrumentId" : 1,
                  "name" : "Instrument"
                },
                "cognizantEngineer" : {
                  "username" : "lewallen",
                  "displayName" : "[REDACTED]"
                }
              },
              {
                "missionId" : 1,
                "instrumentId" : 2,
                "instrument" : {
                  "intstrumentId" : 2,
                  "name" : "Instrument 2"
                }
              }
            ]
          },
          "procurementPartId" : 2,
          "procurementPart" : {
            "procurementPartId" : 2,
            "partNumber" : "procurement part",
            "part" : {
              "partId" : 1,
              "manufacturer" : "Texas Instruments",
              "genericPartNumber" : "123",
              "description" : "description",
              "partTechnology" : "Part Tech"
            }
          },
          "testStatusId" : 1,
          "testStatus" : {
            "testStatusId" : 1,
            "name" : "Active"
          },
          "discriminator" : "TotalIonizingDoseRecord",
          "creatorUsername" : "[REDACTED]",
          "creator" : {
            "username" : "[REDACTED]",
            "displayName" : "[REDACTED]"
          },
          "testRecordServiceOrders" : [ ],
          "rtdbFiles" : [ ],
          "personnelGroups" : [
            {
              "personnelGroupUsers" : [ ]
            }
          ],
          "testRecordTestSubTypes" : [ ],
          "testRecordTestFacilityConditions" : [ ],
          "testRecordFollowers" : [ ],
          "isDeleted" : false
        }
      },

还有这里显示的mapping:

"testRecordId" : {
  "type" : "text",
  "analyzer" : "autocomplete",
  "search_analyzer" : "autocomplete_search"
},

我想我还应该提一下,我一直在控制台中测试这个查询,如下所示:

I guess I should also mention, I've been testing this query in the console like so:

GET test-records-development/_search
{
  "query": {
    "match": {
      "testRecordId": {
        "query": "TR000002_1_2020"
      }
    }
  }
}

编辑 2:从索引 _settings 端点添加 API 响应:

EDIT 2: Added API response from index _settings endpoint:

{
  "test-records-development-09-09-2020-02-00-00" : {
    "settings" : {
      "index" : {
        "number_of_shards" : "5",
        "provided_name" : "test-records-development-09-09-2020-02-00-00",
        "creation_date" : "1599617013874",
        "analysis" : {
          "analyzer" : {
            "autocomplete" : {
              "filter" : [
                "lowercase"
              ],
              "type" : "custom",
              "tokenizer" : "autocomplete"
            },
            "autocomplete_search" : {
              "type" : "custom",
              "tokenizer" : "lowercase"
            }
          },
          "tokenizer" : {
            "autocomplete" : {
              "token_chars" : [
                "letter",
                "digit",
                "punctuation",
                "symbol"
              ],
              "min_gram" : "2",
              "type" : "ngram",
              "max_gram" : "16"
            }
          }
        },
        "number_of_replicas" : "0",
        "uuid" : "FSeCa0YwRCOJVbjfxYGkig",
        "version" : {
          "created" : "6080199"
        }
      }
    }
  }
}

推荐答案

由于我没有 JSON 格式的分析器设置访问权限,我无法确认,但最有可能的问题是您的搜索分析器 autocomplete_search 正在创建与索引时间标记匹配的搜索时间标记.

As I don't have the analyzer setting access in JSON format,I can't confirm it but most probably issue is with your search analyzer autocomplete_search which is creating search time tokens which are matching the index time tokens.

例如:您正在搜索 TR000002_1_2020 并且如果它创建 2020 作为标记,并且包含 TR000011_1_2020 的文档也会创建一个 2020 令牌比您的查询将匹配它.

For example: you are searching for TR000002_1_2020 and if it creates 2020 as a token and for document containing TR000011_1_2020 also creates a 2020 token than your query will match it.

您可以使用分析API 根据分析器检查生成的令牌,如前所述,大多数情况下有一些令牌匹配,如上所示.

You can use the analyze API to check the generated tokens based on a analyzer and as mentioned earlier mostly there is some tokens which are matching as shown above.

这篇关于无论查询输入如何,Elasticsearch ngram 标记器都会返回所有结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆