Elasticsearch精确匹配分析字段 [英] Elasticsearch exact matches on analyzed fields

查看:198
本文介绍了Elasticsearch精确匹配分析字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法让ElasticSearch识别分析字段的完全匹配?理想情况下,我想小写,标记,干扰,甚至可能会使我的文档发音,然后查询拉出确切匹配。



我的意思是,如果我指数汉堡包和汉堡,将被分析为[汉堡包,面包]和[汉堡包]。如果我搜索汉堡包,它将只返回汉堡包文档,因为这是确切的匹配。



我已经尝试使用关键字tokenizer,但这不会阻止各个令牌。我需要做一些事情来确保令牌数量相等吗?



我熟悉多字段并使用not_analyzed类型,但是这比我正在寻找更多的限制。我想要完全匹配,后期分析。

解决方案

使用带状键盘分析器和其他任何你需要的东西。添加一个类型为 token_count 的子字段,用于计算字段中的令牌数。



在搜索时间您需要添加一个额外的过滤条件,以将索引中的令牌数与搜索文本中的令牌数量相匹配。您需要一个额外的步骤,当您执行实际搜索时,应该对搜索字符串中的令牌进行计数。这就是这样,因为带状疱疹会创建多个令牌的排列,你需要确保它符合搜索文本的大小。



给你一个想法:

  {
设置:{
分析:{
filter:{
filter_shingle:{
type:shingle,
max_shingle_size:10,
min_shingle_size:2,
output_unigrams:true
},
filter_stemmer:{
type:porter_stem,
language:_english_
}
},
analyzer:{
ShingleAnalyzer:{
tokenizer:standard,
filter:[
小写,
snowball,
filter_stemmer,
filter_shingle
]
}
}
}
},
mappings:{
test:{
properties:{
text:{
type:string,
analyzer:ShingleAnalyzer,
fields:{
word_count:{
type:token_count,
store:yes,
analyzer:ShingleAnalyzer
}
}
}
}
}
}
}

查询:

  {
query :{
filtered:{
query:{
match_phrase:{
text:{
query:HaMbUrGeRs BUN
}
}
},
过滤器:{
term:{
text.word_count:2
}
}
}
}
}

shingles 过滤器在这里很重要,因为它可以创建令牌的组合。而且,这些都是保持秩序或令牌的组合。 Imo,这里最难解决的问题就是改变令牌(起始,下放等),还可以组合原始文本。除非你定义自己的连接过滤器,否则我不认为除了使用 shingles 过滤器之外,还有其他方法。



但是使用 shingles 还有一个问题:它创建不需要的组合。对于像洛杉矶的汉堡包子这样的文本,您最终会得到长列表:

 angeles,
buns,
buns in,
buns in los,
buns in los angeles
hamburgers,
hamburgers buns,
hamburgers buns in,
hamburgers buns in los,
hamburgers buns in los angeles
in,
in los,
in los angeles,
los,
los angeles

如果您只对符合完全的文件感兴趣,则上述文件仅在您搜索时才匹配对于在洛杉矶的汉堡包面包(不符合任何在洛杉矶的任何汉堡包小圆面包),那么你需要一种方法来过滤这一长串木瓦。我看到的方式是使用 word_count


Is there a way to have ElasticSearch identify exact matches on analyzed fields? Ideally, I would like to lowercase, tokenize, stem and perhaps even phoneticize my docs, then have queries pull "exact" matches out.

What I mean is that if I index "Hamburger Buns" and "Hamburgers", they will be analyzed as ["hamburger","bun"] and ["hamburger"]. If I search for "Hamburger", it will only return the "hamburger" doc, as that's the "exact" match.

I've tried using the keyword tokenizer, but that won't stem the individual tokens. Do I need to do something to ensure that the number of tokens is equal or so?

I'm familiar with multi-fields and using the "not_analyzed" type, but this is more restrictive than I'm looking for. I'd like exact matching, post-analysis.

解决方案

Use shingles tokenizer together with stemming and whatever else you need. Add a sub-field of type token_count that will count the number of tokens in the field.

At searching time, you need to add an additional filter to match the number of tokens in the index with the number of tokens you have in the searching text. You would need an additional step, when you perform the actual search, that should count the tokens in the searching string. This is like this because shingles will create multiple permutations of tokens and you need to make sure that it matches the size of your searching text.

An attempt for this, just to give you an idea:

{
  "settings": {
    "analysis": {
      "filter": {
        "filter_shingle": {
          "type": "shingle",
          "max_shingle_size": 10,
          "min_shingle_size": 2,
          "output_unigrams": true
        },
        "filter_stemmer": {
          "type": "porter_stem",
          "language": "_english_"
        }
      },
      "analyzer": {
        "ShingleAnalyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "snowball",
            "filter_stemmer",
            "filter_shingle"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "ShingleAnalyzer",
          "fields": {
            "word_count": {
              "type": "token_count",
              "store": "yes",
              "analyzer": "ShingleAnalyzer"
            }
          }
        }
      }
    }
  }
}

And the query:

{
  "query": {
    "filtered": {
      "query": {
        "match_phrase": {
          "text": {
            "query": "HaMbUrGeRs BUN"
          }
        }
      },
      "filter": {
        "term": {
          "text.word_count": "2"
        }
      }
    }
  }
}

The shingles filter is important here because it can create combinations of tokens. And more than that, these are combinations that keep the order or the tokens. Imo, the most difficult requirement to fulfill here is to change the tokens (stemming, lowercasing etc) and, also, to assemble back the original text. Unless you define your own "concatenation" filter I don't think there is any other way than using the shingles filter.

But with shingles there is another issue: it creates combinations that are not needed. For a text like "Hamburgers buns in Los Angeles" you end up with a long list of shingles:

          "angeles",
          "buns",
          "buns in",
          "buns in los",
          "buns in los angeles",
          "hamburgers",
          "hamburgers buns",
          "hamburgers buns in",
          "hamburgers buns in los",
          "hamburgers buns in los angeles",
          "in",
          "in los",
          "in los angeles",
          "los",
          "los angeles"

If you are interested in only those documents that match exactly meaning, the documents above matches only when you search for "hamburgers buns in los angeles" (and doesn't match something like "any hamburgers buns in los angeles") then you need a way to filter that long list of shingles. The way I see it is to use word_count.

这篇关于Elasticsearch精确匹配分析字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆