如何明智地结合shingles和edgeNgram来提供灵活的全文搜索? [英] How to wisely combine shingles and edgeNgram to provide flexible full text search?

查看:95
本文介绍了如何明智地结合shingles和edgeNgram来提供灵活的全文搜索?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个符合OData的API,将其全部搜索需求的一部分委派给Elasticsearch集群。
由于OData表达式可能会变得相当复杂,因此我们决定将它们简单地转换为相同的Lucene查询语法,并将其提供给 query_string 查询。

We have an OData-compliant API that delegates some of its full text search needs to an Elasticsearch cluster. Since OData expressions can get quite complex, we decided to simply translate them into their equivalent Lucene query syntax and feed it into a query_string query.

我们支持一些文本相关的OData过滤器表达式,例如:

We do support some text-related OData filter expressions, such as:


  • startswith(field,'bla')

  • endswith(field,'bla') / li>
  • substringof('bla',field)

  • name eq 'bla'

  • startswith(field,'bla')
  • endswith(field,'bla')
  • substringof('bla',field)
  • name eq 'bla'

我们匹配的字段可以分析 / code>, not_analyzed 或两者(即通过多字段)。
搜索到的文本可以是单个令牌(例如 table ),只有其中的一部分(例如选项卡 )或几个令牌(例如表1。表10 等)。
搜索必须不区分大小写。

The fields we're matching against can be analyzed, not_analyzed or both (i.e. via a multi-field). The searched text can be a single token (e.g. table), only a part thereof (e.g. tab), or several tokens (e.g. table 1., table 10, etc). The search must be case-insensitive.

以下是我们需要支持的行为的一些示例:

Here are some examples of the behavior we need to support:


  • startswith(name,'table 1')必须匹配表1 表1 00,表1 .5,表1 12上级

  • code> endswith(name,'table 1')必须匹配Room 1,表1 ,Sub 表1 表1 ,杰夫表1

  • substringof )必须匹配Big 表1 返回,表1 表1 ,小 Table1 2

  • name eq'table 1'必须匹配 strong>,表1 表1

  • startswith(name,'table 1') must match "Table 1", "table 100", "Table 1.5", "table 112 upper level"
  • endswith(name,'table 1') must match "Room 1, Table 1", "Subtable 1", "table 1", "Jeff table 1"
  • substringof('table 1',name) must match "Big Table 1 back", "table 1", "Table 1", "Small Table12"
  • name eq 'table 1' must match "Table 1", "TABLE 1", "table 1"

所以基本上,我们接受用户输入(即传递到的第二个参数startwith / endswith substringof 的第一个参数, eq 的右侧值),并尝试完全匹配,无论令牌是完全匹配还是仅部分。

So basically, we take the user input (i.e. what is passed into the 2nd parameter of startswith/endswith, resp. the 1st parameter of substringof, resp. the right-hand side value of the eq) and try to match it exactly, whether the tokens fully match or only partially.

现在,我们正在摆脱下面突出显示的笨拙解决方案,效果很好,但远远不够理想。

Right now, we're getting away with a clumsy solution highlighted below which works pretty well, but is far from being ideal.

在我们的 query_string ,我们使用 not_analyzed 字段进行匹配en / elasticsearch / reference / current / query-dsl-regexp-query.html#regexp-syntaxrel =nofollow>正则表达式语法。由于该字段是 not_analyzed ,搜索必须不区分大小写,所以我们在准备正则表达式以提供查询时,进行自己的标记化,以便得到类似的东西这就等于OData过滤器 endswith(name,'table 8')(=>匹配所有$ 名称以表8结尾)

In our query_string, we match against a not_analyzed field using the Regular Expression syntax. Since the field is not_analyzed and the search must be case-insensitive, we do our own tokenizing while preparing the regular expression to feed into the query in order to come up with something like this, i.e. this is equivalent to the OData filter endswith(name,'table 8') (=> match all documents whose name ends with "table 8")

  "query": {
    "query_string": {
      "query": "name.raw:/.*(T|t)(A|a)(B|b)(L|l)(E|e) 8/",
      "lowercase_expanded_terms": false,
      "analyze_wildcard": true
    }
  }



So, even though, this solution works pretty well and the performance is not too bad (which came out as a surprise), we'd like to do it differently and leverage the full power of analyzers in order to shift all this burden at indexing time instead of searching time. However, since reindexing all our data will take weeks, we'd like to first investigate if there's a good combination of token filters and analyzers that would help us achieve the same search requirements enumerated above.

我的想法是,理想的解决方案将包含一些明智的混合(带有一些令牌在一起)和edge-nGram(即在令牌的开始或结尾匹配)。然而,我不知道的是,是否可以使它们一起工作,以匹配多个令牌,其中一个令牌可能不被用户完全输入)。例如,如果索引的名称字段是Big Table 123,则需要 substringof('table 1',name)来匹配它,因此table是一个完整的匹配的令牌,而1只是下一个令牌的前缀。

My thinking is that the ideal solution would contain some wise mix of shingles (i.e. several tokens together) and edge-nGram (i.e. to match at the start or end of a token). What I'm not sure of, though, is whether it is possible to make them work together in order to match several tokens, where one of the tokens might not be fully input by the user). For instance, if the indexed name field is "Big Table 123", I need substringof('table 1',name) to match it, so "table" is a fully matched token, while "1" is only a prefix of the next token.

提前感谢您共享您的脑细胞。

Thanks in advance for sharing your braincells on this one.

更新1:测试Andrei的解决方案

=>完全匹配( eq )和 startswith 完美工作。

=> Exact match (eq) and startswith work perfectly.

A。 endswith glitches

A. endswith glitches

搜索 substringof('table 112',name)产生107个文档。搜索更具体的情况,如 endswith(name,'table 112')产生1525个文档,而它应该减少文档(后缀匹配应该是子串匹配的子集)。检查更深入我发现一些不匹配,如社交俱乐部,表12(不包含112)或订单312(不包含表或112)。我想是因为他们以12结尾,这是令牌112的有效的克,因此是匹配。

Searching for substringof('table 112', name) yields 107 docs. Searching for a more specific case such as endswith(name, 'table 112') yields 1525 docs, while it should yield less docs (suffix matches should be a subset of substring matches). Checking in more depth I've found some mismatches, such as "Social Club, Table 12" (doesn't contain "112") or "Order 312" (contains neither "table" nor "112"). I guess it's because they end with "12" and that's a valid gram for the token "112", hence the match.

B。 substringof 毛刺

搜索 substringof('table',name)匹配Party table,Alex on big table,但不匹配Table 1,table 112等。搜索 substringof('tabl',name) code>不匹配任何

Searching for substringof('table',name) matches "Party table", "Alex on big table" but doesn't match "Table 1", "table 112", etc. Searching for substringof('tabl',name) doesn't match anything

更新2

它这是一种暗示,但我忘了明确地提到该解决方案必须使用 query_string 查询,主要是因为OData表达式(但是它们可能是复杂的)将不断被翻译成他们的Lucene等价物。我知道我们使用Lucene的查询语法来消除Elasticsearch Query DSL的强大功能,这种语法的功能不那么强大,而且表现力不大,但这是我们无法真正改变的。我们很亲近,但是!

It was sort of implied but I forgot to explicitely mention that the solution will have to work with the query_string query, mainly due to the fact that the OData expressions (however complex they might be) will keep getting translated into their Lucene equivalent. I'm aware that we're trading off the power of the Elasticsearch Query DSL with the Lucene's query syntax, which is a bit less powerful and less expressive, but that's something that we can't really change. We're pretty d**n close, though!

推荐答案

这是一个有趣的用例。这是我的采取:

This is an interesting use case. Here's my take:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ngram_analyzer": {
          "tokenizer": "my_ngram_tokenizer",
          "filter": ["lowercase"]
        },
        "my_edge_ngram_analyzer": {
          "tokenizer": "my_edge_ngram_tokenizer",
          "filter": ["lowercase"]
        },
        "my_reverse_edge_ngram_analyzer": {
          "tokenizer": "keyword",
          "filter" : ["lowercase","reverse","substring","reverse"]
        },
        "lowercase_keyword": {
          "type": "custom",
          "filter": ["lowercase"],
          "tokenizer": "keyword"
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "nGram",
          "min_gram": "2",
          "max_gram": "25"
        },
        "my_edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "2",
          "max_gram": "25"
        }
      },
      "filter": {
        "substring": {
          "type": "edgeNGram",
          "min_gram": 2,
          "max_gram": 25
        }
      }
    }
  },
  "mappings": {
    "test_type": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "my_ngram_analyzer",
          "fields": {
            "starts_with": {
              "type": "string",
              "analyzer": "my_edge_ngram_analyzer"
            },
            "ends_with": {
              "type": "string",
              "analyzer": "my_reverse_edge_ngram_analyzer"
            },
            "exact_case_insensitive_match": {
              "type": "string",
              "analyzer": "lowercase_keyword"
            }
          }
        }
      }
    }
  }
}




  • my_ngram_analyzer 用于将每个文本分割成小块,大小取决于用例。我为了测试目的选择了25个字符。使用小写,因为您不区分大小写。基本上,这是用于 substringof('table 1',name)的标记器。查询很简单:

    • my_ngram_analyzer is used to split every text into small pieces, how large the pieces are depends on your use case. I chose, for testing purposes, 25 chars. lowercase is used since you said case-insensitive. Basically, this is the tokenizer used for substringof('table 1',name). The query is simple:
    • {
        "query": {
          "term": {
            "text": {
              "value": "table 1"
            }
          }
        }
      }
      




      • my_edge_ngram_analyzer 用于从开头开始分割文本,特别用于 startswith(name,'table 1')用例。再次,查询很简单:

        • my_edge_ngram_analyzer is used to split the text starting from the beginning and this is specifically used for the startswith(name,'table 1') use case. Again, the query is simple:
        • {
            "query": {
              "term": {
                "text.starts_with": {
                  "value": "table 1"
                }
              }
            }
          }
          




          • 我发现这个最多 endswith(name,'table 1')中的棘手部分。为此我定义了 my_reverse_edge_ngram_analyzer ,它使用关键字标记器与小写 edgeNGram 过滤器前面跟着一个 reverse filter 。这个tokenizer基本上是将edgeNGram中的文本分割,但边缘是文本的结尾,而不是开始(就像常规的 edgeNGram )一样。
            查询:

            • I found this the most tricky part - the one for endswith(name,'table 1'). For this I defined my_reverse_edge_ngram_analyzer which uses a keyword tokenizer together with lowercase and an edgeNGram filter preceded and followed by a reverse filter. What this tokenizer basically does is to split the text in edgeNGrams but the edge is the end of the text, not the start (like with the regular edgeNGram). The query:
            • {
                "query": {
                  "term": {
                    "text.ends_with": {
                      "value": "table 1"
                    }
                  }
                }
              }
              




              • $ c> name eq'table 1' case,一个简单的关键字标记符与小写过滤器应该这样做
                查询:

                • for the name eq 'table 1' case, a simple keyword tokenizer together with a lowercase filter should do it The query:
                • {
                    "query": {
                      "term": {
                        "text.exact_case_insensitive_match": {
                          "value": "table 1"
                        }
                      }
                    }
                  }
                  






                  关于 query_string ,这会更改解决方案,因为我指望术语不分析输入文本,并将其与索引中的一个条款完全匹配。


                  Regarding query_string, this changes the solution a bit, because I was counting on term to not analyze the input text and to match it exactly with one of the terms in the index.

                  但是,可以使用 query_string 如果适当的分析器是为此指定

                  But this can be "simulated" with query_string if the appropriate analyzer is specified for it.

                  该解决方案将是一组查询,如下所示(始终使用该分析器,仅更改字段名称):

                  The solution would be a set of queries like the following (always use that analyzer, changing only the field name):

                  {
                    "query": {
                      "query_string": {
                        "query": "text.starts_with:(\"table 1\")",
                        "analyzer": "lowercase_keyword"
                      }
                    }
                  }
                  

                  这篇关于如何明智地结合shingles和edgeNgram来提供灵活的全文搜索?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆