Elasticsearch:使用关键字标记器索引字段但没有停用词 [英] Elasticsearch: index a field with keyword tokenizer but without stopwords

查看:36
本文介绍了Elasticsearch:使用关键字标记器索引字段但没有停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种使用关键字标记化但没有停用词来搜索公司名称的方法.

I am looking for a way to search company names with keyword tokenizing but without stopwords.

例如:索引公司名称是Hansel und Gretel Gmbh".

For ex : The indexed company name is "Hansel und Gretel Gmbh."

这里的und"和Gmbh"是公司名称的停用词.

Here "und" and "Gmbh" are stop words for the company name.

如果搜索词是Hansel Gretel",则应该找到该文档,如果搜索词是Hansel",则不应找到任何文档.如果搜索词是hansel gmbh",也应该找到 no 文档.

If the search term is "Hansel Gretel", that document should be found, If the search term is "Hansel" then no document should be found. And if the search term is "hansel gmbh", the no document should be found as well.

我尝试在自定义分析器中将关键字标记器与停用词结合使用,但没有奏效(我猜是预期的).

I have tried to combine keywords tokenizer with stopwords in custom analyzer but it didnt work(as expected I guess).

我也尝试过使用常用词查询,但Hansel"开始出现(再次如预期)

I have also tried to use common terms query, but "Hansel" started to hit(again as expected)

提前致谢.

推荐答案

坏和丑有两种方式.第一个使用正则表达式来删除停用词和修剪空格.有很多缺点:

There are two ways bad and ugly. The first one uses regular expressions in order to remove stop words and trim spaces. There are a lot of drawbacks:

  • 您必须自己支持空白标记化(regexp(/s+)) 和特殊符号(.,;) 删除
  • 不支持高亮 - 关键字标记器不支持
  • 区分大小写也是一个问题
  • normalizers(关键字分析器) 是实验性功能 - 支持不佳,没有功能
  • you have to support white-space tokenization(regexp(/s+)) and special symbol(.,;) removal by your own
  • no highlight is supported - keyword tokenizer does not support
  • case sensitivity is also a problem
  • normalizers(analyzers for keywords) are experimental feature - bad support, no features

以下是分步示例:

curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "normalizer": {
        "custom_normalizer": {
          "type": "custom",
          "char_filter": ["stopword_char_filter", "trim_char_filter"],
          "filter": ["lowercase"]
        }
      },
      "char_filter": {
        "stopword_char_filter": {
          "type": "pattern_replace",
          "pattern": "( ?und ?| ?gmbh ?)",
          "replacement": " "
        },
        "trim_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\s+)$",
          "replacement": ""
        }
      }
    }
  },
  "mappings": {
    "file": {
      "properties": {
        "name": {
          "type": "keyword",
          "normalizer": "custom_normalizer"
        }
      }
    }
  }
}'

现在我们可以检查我们的分析器是如何工作的(请注意,只有 ES 6.x 支持对 normalyzer 的请求)

Now we can check how our analyzer works(please note that requests to normalyzer are supported only in ES 6.x)

curl -XPOST "http://localhost:9200/test/_analyze" -H 'Content-Type: application/json' -d'
{
  "normalizer": "custom_normalizer",
  "text": "hansel und gretel gmbh"
}'

现在我们准备索引我们的文档:

Now we are ready to index our document:

curl -XPUT "http://localhost:9200/test/file/1" -H 'Content-Type: application/json' -d'
{
  "name": "hansel und gretel gmbh"
}'

最后一步是搜索:

curl -XGET "http://localhost:9200/test/_search" -H 'Content-Type: application/json' -d'
{
    "query": {
        "match" : {
            "name" : {
                "query" : "hansel gretel"
            }
        }
    }
}'

另一种方法是:

  • 使用停用词过滤器创建标准文本分析器
  • 使用分析过滤掉所有停用词和特殊符号
  • 手动连接令牌
  • 将术语作为关键字发送给 ES

以下是分步示例:

curl -XPUT "http://localhost:9200/test" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_analyzer": {
          "type":      "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "custom_stopwords"]
        }
      }, "filter": {
        "custom_stopwords": {
          "type": "stop",
          "stopwords": ["und", "gmbh"]
        }
      }
    }
  },
  "mappings": {
    "file": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "custom_analyzer"
        }
      }
    }
  }
}' 

现在我们准备分析我们的文本:

Now we are ready to analyze our text:

POST test/_analyze
{
  "analyzer": "custom_analyzer",
  "text": "Hansel und Gretel Gmbh."
}

结果如下:

{
  "tokens": [
    {
      "token": "hansel",
      "start_offset": 0,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "gretel",
      "start_offset": 11,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

最后一步是标记连接:hansel + gretel.唯一的缺点是使用自定义代码进行手动分析.

The last step is token concatenation: hansel + gretel. The only drawback is manual analysis with custom code.

这篇关于Elasticsearch:使用关键字标记器索引字段但没有停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆