Elasticsearch 精确匹配分析的字段 [英] Elasticsearch exact matches on analyzed fields

查看:29
本文介绍了Elasticsearch 精确匹配分析的字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法让 ElasticSearch 在分析的字段上识别完全匹配?理想情况下,我想对我的文档进行小写、标记、词干甚至语音化,然后让查询拉出精确"匹配.

Is there a way to have ElasticSearch identify exact matches on analyzed fields? Ideally, I would like to lowercase, tokenize, stem and perhaps even phoneticize my docs, then have queries pull "exact" matches out.

我的意思是,如果我索引Hamburger Buns"和Hamburgers",它们将被分析为 ["hamburger","bun"] 和 ["hamburger"].如果我搜索Hamburger",它只会返回hamburger"文档,因为这是完全"匹配.

What I mean is that if I index "Hamburger Buns" and "Hamburgers", they will be analyzed as ["hamburger","bun"] and ["hamburger"]. If I search for "Hamburger", it will only return the "hamburger" doc, as that's the "exact" match.

我尝试过使用关键字标记器,但这不会阻止单个标记.我需要做些什么来确保令牌数量相等吗?

I've tried using the keyword tokenizer, but that won't stem the individual tokens. Do I need to do something to ensure that the number of tokens is equal or so?

我熟悉多字段并使用not_analyzed"类型,但这比我正在寻找的限制更多.我想要精确匹配,后期分析.

I'm familiar with multi-fields and using the "not_analyzed" type, but this is more restrictive than I'm looking for. I'd like exact matching, post-analysis.

推荐答案

将 shingles tokenizer 与词干提取和其他任何您需要的东西一起使用.添加 token_count 类型的子字段,用于计算字段中的令牌数量.

Use shingles tokenizer together with stemming and whatever else you need. Add a sub-field of type token_count that will count the number of tokens in the field.

在搜索时,您需要添加一个额外的过滤器,以将索引中的标记数与您在搜索文本中拥有的标记数相匹配.您需要一个额外的步骤,当您执行实际搜索时,应该计算搜索字符串中的标记.之所以如此,是因为 shingles 会创建多个标记排列,您需要确保它与搜索文本的大小相匹配.

At searching time, you need to add an additional filter to match the number of tokens in the index with the number of tokens you have in the searching text. You would need an additional step, when you perform the actual search, that should count the tokens in the searching string. This is like this because shingles will create multiple permutations of tokens and you need to make sure that it matches the size of your searching text.

尝试这个,只是给你一个想法:

An attempt for this, just to give you an idea:

{
  "settings": {
    "analysis": {
      "filter": {
        "filter_shingle": {
          "type": "shingle",
          "max_shingle_size": 10,
          "min_shingle_size": 2,
          "output_unigrams": true
        },
        "filter_stemmer": {
          "type": "porter_stem",
          "language": "_english_"
        }
      },
      "analyzer": {
        "ShingleAnalyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "snowball",
            "filter_stemmer",
            "filter_shingle"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "ShingleAnalyzer",
          "fields": {
            "word_count": {
              "type": "token_count",
              "store": "yes",
              "analyzer": "ShingleAnalyzer"
            }
          }
        }
      }
    }
  }
}

和查询:

{
  "query": {
    "filtered": {
      "query": {
        "match_phrase": {
          "text": {
            "query": "HaMbUrGeRs BUN"
          }
        }
      },
      "filter": {
        "term": {
          "text.word_count": "2"
        }
      }
    }
  }
}

shingles 过滤器在这里很重要,因为它可以创建标记组合.更重要的是,这些是保持顺序或令牌的组合.Imo,这里最难满足的要求是更改标记(词干、小写等),以及组装回原始文本.除非您定义自己的串联"过滤器,否则我认为除了使用 shingles 过滤器之外别无他法.

The shingles filter is important here because it can create combinations of tokens. And more than that, these are combinations that keep the order or the tokens. Imo, the most difficult requirement to fulfill here is to change the tokens (stemming, lowercasing etc) and, also, to assemble back the original text. Unless you define your own "concatenation" filter I don't think there is any other way than using the shingles filter.

但是对于 shingles 还有另一个问题:它会创建不需要的组合.对于像 洛杉矶的汉堡包" 这样的文本,您最终会得到一长串带状疱疹:

But with shingles there is another issue: it creates combinations that are not needed. For a text like "Hamburgers buns in Los Angeles" you end up with a long list of shingles:

          "angeles",
          "buns",
          "buns in",
          "buns in los",
          "buns in los angeles",
          "hamburgers",
          "hamburgers buns",
          "hamburgers buns in",
          "hamburgers buns in los",
          "hamburgers buns in los angeles",
          "in",
          "in los",
          "in los angeles",
          "los",
          "los angeles"

如果您只对那些完全匹配含义的文档感兴趣,则上述文档仅在您搜索洛杉矶的汉堡包"时才匹配(并且不匹配诸如任何汉堡包"之类的内容)洛杉矶的小圆面包")那么您需要一种方法来过滤那一长串带状疱疹.我的看法是使用 word_count.

If you are interested in only those documents that match exactly meaning, the documents above matches only when you search for "hamburgers buns in los angeles" (and doesn't match something like "any hamburgers buns in los angeles") then you need a way to filter that long list of shingles. The way I see it is to use word_count.

这篇关于Elasticsearch 精确匹配分析的字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆