如何在ElasticSearch中基于正则表达式过滤令牌 [英] How to filter tokens based on a regex in ElasticSearch

查看:35
本文介绍了如何在ElasticSearch中基于正则表达式过滤令牌的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于ElasticSearch查询,我们希望以不同的方式处理单词(即仅由字母组成的令牌)和非单词.为此,我们尝试定义两个返回单词或非单词的分析器.

For a ElasticSearch query we want to handle words (i.e. tokens only consisting of letters) and non-words differently. To do this we try to define two analyzers either returning the words or the non-words.

例如,我们有描述硬件商店产品的文档:

For example we have documents describing products for a hardware store:

{
    "name": "Torx drive T9",
    "category": "screws",
    "size": 2.5,
}

然后,用户将搜索"Torx T9",并希望找到该文档.搜索T9太笼统了,并且给出了太多无关的产品.因此,如果我们已经找到"Torx",我们只想搜索"T9"一词.

The user would then search for "Torx T9" and expect to find this document. Searching for T9 would be too generic and give too many non-relevant products. So we only want to search for the 'T9' term if we already found 'Torx'.

我们尝试创建这样的查询

We try to create a query like this

{
    "query": {
        "bool": {
            "must": {
                "match: {
                    "name": {
                    "query": "Torx T9",
                    "analyzer": "words"
                 }
             },
            "should": {
                "match: {
                    "name": {
                    "query": "Torx T9",
                    "analyzer": "nonwords"
                 }
             }
         }
     }
}

这个想法是创建令牌过滤器来做到这一点很简单.例如:

The idea is that it would be simple to create token filters to do this. For example something like:

"settings": {
  "analysis": {
     "filter": {
        "words": {
           "type": "pattern",
           "pattern": "\\A\\p{L}*\\Z",
        },
        "nonwords": {
            "type": "pattern",
            "pattern": "\\P{L}",
        }
    }
}

但是似乎没有一个过滤器仅匹配模式.相反,我们(ab)使用pattern_replace过滤器:

But there doesn't seem to be a filter just matching on patterns. Instead we (ab)use the pattern_replace filter:

"settings": {
  "analysis": {
     "filter": {
        "words": {
           "type": "pattern_replace",
           "pattern": "\\A((?=.*\\P{L}).*)",
           "replacement": ""
        },
        "nonwords": {
            "type": "pattern_replace",
            "pattern": "\\A((?!.*\\P{L}).*)",
            "replacement": ""
        },
        "nonempty": {
            "type": "length",
            "min":1
        }
    }
}

这将用空令牌替换不需要的令牌,然后可以通过非空过滤器将其删除.这似乎可行,但是所需的模式更加晦涩.

This replaces the unwanted tokens with the empty token, which can then be removed by the nonempty filter. This seems to work, but the required patterns are more obscure.

有没有更好的表达方式?

Is there a better way to express this?

推荐答案

您可以尝试

You can try query-string-query with default_operator as "AND" for you requirement.

例如,假设您正在索引两个字符串"Torx驱动器T9"和平方驱动器T9".如果使用

For example consider you are indexing two strings "Torx drive T9" and "Square drive T9".If you use the whitespace tokenizer for indexing then the string will be analyzed as following tokens

First Document : torx, drive and t9.
Second Document : square, drive and t9.

然后使用查询字符串查询以默认运算符AND匹配文档,将产生预期的结果.

Then using query string query to match documents with default operator as AND will produce the expected result.

样品映射

{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace": {
          "type": "pattern",
          "pattern": "\\s+"
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer": "whitespace"
        }
      }
    }
  }
}

示例查询

{
   "query": {
    "query_string": {
       "default_field": "name",
       "query": "Torx T9",
       "default_operator": "AND"
        }
     }
 }

仅当文档中同时存在 torx t9 时,此查询才会产生结果.

This query will yield result only when both torx and t9 presents in the document.

这篇关于如何在ElasticSearch中基于正则表达式过滤令牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆