弹性搜索自定义分析器,带ngram,无连字符分隔符 [英] Elasticsearch custom analyzer with ngram and without word delimiter on hyphens

查看:415
本文介绍了弹性搜索自定义分析器,带ngram,无连字符分隔符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试索引包含连字符的字符串,但不包含空格,句点或任何其他标点符号。我不想分开基于连字符的单词,而是我希望连字符是索引文本的一部分。

I am trying to index strings that contain hyphens but do not contain spaces, periods or any other punctuation. I do not want to split up the words based on hyphens, instead I would like to have the hyphens be part of the indexed text.

例如,我的6个文本字符串将是:

For example, my 6 text strings would be:


  • magazineplayon

  • magazineofhorses

  • 在线-magazine

  • 最佳杂志

  • 朋友之友

  • magazineplaygames

  • magazineplayon
  • magazineofhorses
  • online-magazine
  • best-magazine
  • friend-of-magazines
  • magazineplaygames

我想能够搜索包含播放或文本的文本的这些字符串杂志

I would like to be able to search these string for the text containing "play" or for the text starting with "magazine".

我已经能够使用 ngram 使包含播放的文本正常工作。然而,连字符正在导致文本分割,并且包括结果,其中杂志在连字符之后的单词中。我只想要从字符串开始的字符杂志出现。

I have been able to use ngram to make the text containing "play" work properly. However, the hyphen is causing text to split and it is including results where "magazine" is in the word after a hyphen. I only want words starting at the beginning of the string with "magazine" to appear.

根据上面的示例,只有这3个应该出现在杂志 :

Based on the sample above, only these 3 should appear when beginning with "magazine":


  • magazineplayon

  • magazineofhorses

  • magazineplaygames

  • magazineplayon
  • magazineofhorses
  • magazineplaygames

请帮助我的ElasticSearch索引样本:

Please help with my ElasticSearch Index Sample:

DELETE /sample

PUT /sample
{
    "settings": {
        "index.number_of_shards":5,
        "index.number_of_replicas": 0,
        "analysis": {
            "filter": {
                "nGram_filter": {
                   "type": "nGram",
                   "min_gram": 2,
                   "max_gram": 20,
                   "token_chars": [
                      "letter",
                      "digit"
                   ]
                },
                "word_delimiter_filter": {
                    "type": "word_delimiter",
                    "preserve_original": true,
                    "catenate_all" : true
                }
             },
          "analyzer": {
            "ngram_index_analyzer": {
              "type" : "custom",
              "tokenizer": "lowercase",
              "filter" : ["nGram_filter", "word_delimiter_filter"]
            }
          }
        }
    }
}
PUT /sample/1/_create
{
    "name" : "magazineplayon"
}
PUT /sample/3/_create
{
    "name" : "magazineofhorses"
}
PUT /sample/4/_create
{
    "name" : "online-magazine"
}
PUT /sample/5/_create
{
    "name" : "best-magazine"
}
PUT /sample/6/_create
{
    "name" : "friend-of-magazines"
}
PUT /sample/7/_create
{
    "name" : "magazineplaygames"
}

GET /sample/_search
{
"query": {
        "wildcard": {
          "name": "*play*" 
        }
    }
}

GET /sample/_search
{
"query": {
        "wildcard": {
          "name": "magazine*" 
        }
    }
}

更新1
我更新了我的所有创建语句,以便在样本后使用TEST:

Update 1 I updated all my create statements to use TEST after sample:

PUT /sample/test/7/_create
{
    "name" : "magazinefairplay"
}

然后我运行以下命令只返回具有单词play的名称而不是进行通配符搜索。这个工作正常,只返回两个记录。

I then ran the following command to return only names that had the word "play" in them instead of doing the wildcard search. This worked correctly and returned only two records.

POST /sample/test/_search
{
    "query": {
        "bool": {
            "minimum_should_match": 1,
            "should": [
                {"match": { "name.substrings": "play" }}
            ]
        }
    }
}

我运行以下命令只返回以magazine开头的名称。我的期望是不会出现在线杂志,最好的杂志和杂志的杂志。但是,这七个记录都归还了,包括这三个记录。

I ran the following command to return only names that started with "magazine". My expectation was that "online-magazine", "best-magazine" and "friend-of-magazines" would not appear. However, all seven records were returned including these three.

POST /sample/test/_search
{
    "query": {
        "bool": {
            "minimum_should_match": 1,
            "should": [
                {"match": { "name.prefixes": "magazine" }}
            ]
        }
    }
}

有没有办法过滤掉使用连字符的前缀?

Is there a way to filter out the prefix where the hyphen is used?

推荐答案

您的路径正确,但您还需要添加另一个分析器利用 edge -ngram 令牌过滤器,以使开始对照工作。您可以保留 ngram 来检查包含给定单词的字段,但需要 edge-ngram 来检查一个字段以开始一些令牌。

You're on the right path, however, you need to also add another analyzer that leverages the edge-ngram token filter in order to make the "starts with" contraint work. You can keep the ngram for checking fields that "contain" a given word, but you need edge-ngram to check that a field "starts with" some token.

PUT /sample
{
  "settings": {
    "index.number_of_shards": 5,
    "index.number_of_replicas": 0,
    "analysis": {
      "filter": {
        "nGram_filter": {
          "type": "nGram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": [
            "letter",
            "digit"
          ]
        },
        "edgenGram_filter": {
          "type": "edgeNGram",
          "min_gram": 2,
          "max_gram": 20
        }
      },
      "analyzer": {
        "ngram_index_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "nGram_filter"
          ]
        },
        "edge_ngram_index_analyzer": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "edgenGram_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "prefixes": {
              "type": "string",
              "analyzer": "edge_ngram_index_analyzer",
              "search_analyzer": "standard"
            },
            "substrings": {
              "type": "string",
              "analyzer": "ngram_index_analyzer",
              "search_analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

然后您的查询将变为(即搜索名称字段包含播放或以杂志开始的所有文档

Then your query will become (i.e. search for all documents whose name field contains play or starts with magazine)

POST /sample/test/_search
{
    "query": {
        "bool": {
            "minimum_should_match": 1,
            "should": [
                {"match": { "name.substrings": "play" }},
                {"match": { "name.prefixes": "magazine" }}
            ]
        }
    }
}

注意:不要使用通配符来搜索子串,因为它会杀死您的群集(更多信息,请参阅此处和< a href =https://www.elastic.co/guide/en/elasticsearch/guide/current/_wildcard_and_regexp_queries.html =nofollow noreferrer> here )

Note: don't use wildcard for searching for substrings, as it will kill the performance of your cluster (more info here and here)

这篇关于弹性搜索自定义分析器,带ngram,无连字符分隔符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆