Elasticsearch 自定义分析器,带有 ngram,连字符上没有单词分隔符 [英] Elasticsearch custom analyzer with ngram and without word delimiter on hyphens

查看:35
本文介绍了Elasticsearch 自定义分析器,带有 ngram,连字符上没有单词分隔符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试索引包含连字符但不包含空格、句点或任何其他标点符号的字符串.我不想根据连字符拆分单词,而是希望将连字符作为索引文本的一部分.

I am trying to index strings that contain hyphens but do not contain spaces, periods or any other punctuation. I do not want to split up the words based on hyphens, instead I would like to have the hyphens be part of the indexed text.

例如,我的 6 个文本字符串将是:

For example, my 6 text strings would be:

  • 杂志游戏
  • 马杂志
  • 在线杂志
  • 最佳杂志
  • 杂志之友
  • 杂志游戏

我希望能够在这些字符串中搜索包含play"的文本以magazine"开头的文本.

I would like to be able to search these string for the text containing "play" or for the text starting with "magazine".

我已经能够使用 ngram 使包含播放"的文本正常工作.但是,连字符会导致文本拆分,并且它包括在连字符之后的单词中包含杂志"的结果.我只希望出现以magazine"开头的字符串开头的单词.

I have been able to use ngram to make the text containing "play" work properly. However, the hyphen is causing text to split and it is including results where "magazine" is in the word after a hyphen. I only want words starting at the beginning of the string with "magazine" to appear.

基于上面的示例,以magazine"开头的应该只出现这三个:

Based on the sample above, only these 3 should appear when beginning with "magazine":

  • 杂志游戏
  • 马杂志
  • 杂志游戏

请帮助我的 ElasticSearch 索引示例:

Please help with my ElasticSearch Index Sample:

DELETE /sample

PUT /sample
{
    "settings": {
        "index.number_of_shards":5,
        "index.number_of_replicas": 0,
        "analysis": {
            "filter": {
                "nGram_filter": {
                   "type": "nGram",
                   "min_gram": 2,
                   "max_gram": 20,
                   "token_chars": [
                      "letter",
                      "digit"
                   ]
                },
                "word_delimiter_filter": {
                    "type": "word_delimiter",
                    "preserve_original": true,
                    "catenate_all" : true
                }
             },
          "analyzer": {
            "ngram_index_analyzer": {
              "type" : "custom",
              "tokenizer": "lowercase",
              "filter" : ["nGram_filter", "word_delimiter_filter"]
            }
          }
        }
    }
}
PUT /sample/1/_create
{
    "name" : "magazineplayon"
}
PUT /sample/3/_create
{
    "name" : "magazineofhorses"
}
PUT /sample/4/_create
{
    "name" : "online-magazine"
}
PUT /sample/5/_create
{
    "name" : "best-magazine"
}
PUT /sample/6/_create
{
    "name" : "friend-of-magazines"
}
PUT /sample/7/_create
{
    "name" : "magazineplaygames"
}

GET /sample/_search
{
"query": {
        "wildcard": {
          "name": "*play*" 
        }
    }
}

GET /sample/_search
{
"query": {
        "wildcard": {
          "name": "magazine*" 
        }
    }
}

更新 1我更新了我的所有创建语句以在示例后使用 TEST:

Update 1 I updated all my create statements to use TEST after sample:

PUT /sample/test/7/_create
{
    "name" : "magazinefairplay"
}

然后我运行以下命令以仅返回其中包含播放"一词的名称,而不是进行通配符搜索.这工作正常,只返回两条记录.

I then ran the following command to return only names that had the word "play" in them instead of doing the wildcard search. This worked correctly and returned only two records.

POST /sample/test/_search
{
    "query": {
        "bool": {
            "minimum_should_match": 1,
            "should": [
                {"match": { "name.substrings": "play" }}
            ]
        }
    }
}

我运行以下命令只返回以magazine"开头的名称.我的期望是在线杂志"、最佳杂志"和杂志之友"不会出现.但是,包括这三个记录在内的所有七个记录都返回了.

I ran the following command to return only names that started with "magazine". My expectation was that "online-magazine", "best-magazine" and "friend-of-magazines" would not appear. However, all seven records were returned including these three.

POST /sample/test/_search
{
    "query": {
        "bool": {
            "minimum_should_match": 1,
            "should": [
                {"match": { "name.prefixes": "magazine" }}
            ]
        }
    }
}

有没有办法过滤掉使用连字符的前缀?

Is there a way to filter out the prefix where the hyphen is used?

推荐答案

您走在正确的道路上,但是,您还需要添加另一个利用 edge-ngram 标记过滤器 以使开始与"约束工作.您可以保留 ngram 来检查包含"给定单词的字段,但您需要 edge-ngram 来检查字段以"某个标记开始.>

You're on the right path, however, you need to also add another analyzer that leverages the edge-ngram token filter in order to make the "starts with" contraint work. You can keep the ngram for checking fields that "contain" a given word, but you need edge-ngram to check that a field "starts with" some token.

PUT /sample
{
  "settings": {
    "index.number_of_shards": 5,
    "index.number_of_replicas": 0,
    "analysis": {
      "filter": {
        "nGram_filter": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        },
        "edgenGram_filter": {
          "type": "edgeNGram",
          "min_gram": 2,
          "max_gram": 20
        }
      },
      "analyzer": {
        "ngram_index_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "nGram_filter"
          ]
        },
        "edge_ngram_index_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "edgenGram_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "prefixes": {
              "type": "string",
              "analyzer": "edge_ngram_index_analyzer",
              "search_analyzer": "standard"
            },
            "substrings": {
              "type": "string",
              "analyzer": "ngram_index_analyzer",
              "search_analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

然后您的查询将变为(即搜索name字段包含play或以magazine开头的所有文档)

Then your query will become (i.e. search for all documents whose name field contains play or starts with magazine)

POST /sample/test/_search
{
    "query": {
        "bool": {
            "minimum_should_match": 1,
            "should": [
                {"match": { "name.substrings": "play" }},
                {"match": { "name.prefixes": "magazine" }}
            ]
        }
    }
}

注意:不要使用 wildcard 来搜索子字符串,因为它会降低集群的性能(更多信息 这里这里)

Note: don't use wildcard for searching for substrings, as it will kill the performance of your cluster (more info here and here)

这篇关于Elasticsearch 自定义分析器,带有 ngram,连字符上没有单词分隔符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆