如何明智地结合 shingles 和 edgeNgram 来提供灵活的全文搜索? [英] How to wisely combine shingles and edgeNgram to provide flexible full text search?

查看：21 发布时间：2021/12/3 8:05:53 regex elasticsearch lucene odata analyzer

本文介绍了如何明智地结合 shingles 和 edgeNgram 来提供灵活的全文搜索?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们有一个符合 OData 的 API，可将部分全文搜索需求委托给 Elasticsearch 集群.由于 OData 表达式可能变得非常复杂，我们决定将它们简单地转换为等效的 Lucene 查询语法，并将其提供给 query_string 查询.

We have an OData-compliant API that delegates some of its full text search needs to an Elasticsearch cluster. Since OData expressions can get quite complex, we decided to simply translate them into their equivalent Lucene query syntax and feed it into a query_string query.

我们确实支持一些与文本相关的 OData 过滤器表达式，例如:

We do support some text-related OData filter expressions, such as:

startswith(field,'bla')
endswith(field,'bla')
substringof('bla',field)
name eq 'bla'

我们匹配的字段可以是 analyzed、not_analyzed 或两者(即通过多字段).搜索到的文本可以是单个标记(例如table)，也可以是其中的一部分(例如tab)，也可以是多个标记(例如table 1.)>、table 10 等).搜索必须不区分大小写.

The fields we're matching against can be analyzed, not_analyzed or both (i.e. via a multi-field). The searched text can be a single token (e.g. table), only a part thereof (e.g. tab), or several tokens (e.g. table 1., table 10, etc). The search must be case-insensitive.

以下是我们需要支持的一些行为示例:

Here are some examples of the behavior we need to support:

startswith(name,'table 1') 必须匹配Table 1"、table 100"、表 1.5"、"表 112 上层"
endswith(name,'table 1') 必须匹配 "Room 1, Table 1", "Subtable 1", "表 1"、杰夫 表 1"
substringof('table 1',name) 必须匹配 "Big Table 1 back", "table 1", "表 1"、小表 12"
name eq 'table 1' 必须匹配Table 1"、TABLE 1"、table 1"强>"

startswith(name,'table 1') must match "Table 1", "table 100", "Table 1.5", "table 112 upper level"
endswith(name,'table 1') must match "Room 1, Table 1", "Subtable 1", "table 1", "Jeff table 1"
substringof('table 1',name) must match "Big Table 1 back", "table 1", "Table 1", "Small Table12"
name eq 'table 1' must match "Table 1", "TABLE 1", "table 1"

所以基本上，我们接受用户输入(即传递给startswith/endswith的第二个参数的内容，以及substringof<的第一个参数/code>，分别是 eq 的右侧值)并尝试完全匹配它，无论标记完全匹配还是仅部分匹配.


So basically, we take the user input (i.e. what is passed into the 2nd parameter of startswith/endswith, resp. the 1st parameter of substringof, resp. the right-hand side value of the eq) and try to match it exactly, whether the tokens fully match or only partially.
现在，我们正在使用下面突出显示的笨拙解决方案，该解决方案运行良好，但远非理想.
Right now, we're getting away with a clumsy solution highlighted below which works pretty well, but is far from being ideal.
在我们的 query_string 中，我们使用 正则表达式语法.由于该字段是 not_analyzed 并且搜索必须不区分大小写，因此我们在准备正则表达式以提供给查询的同时进行自己的标记化以提出类似的内容，即这是等效的到 OData 过滤器 endswith(name,'table 8')(=> 匹配 name 以table 8"结尾的所有文档)
In our query_string, we match against a not_analyzed field using the Regular Expression syntax. Since the field is not_analyzed and the search must be case-insensitive, we do our own tokenizing while preparing the regular expression to feed into the query in order to come up with something like this, i.e. this is equivalent to the OData filter endswith(name,'table 8') (=> match all documents whose name ends with "table 8")
  "query": {
    "query_string": {
      "query": "name.raw:/.*(T|t)(A|a)(B|b)(L|l)(E|e) 8/",
      "lowercase_expanded_terms": false,
      "analyze_wildcard": true
    }
  }

所以，即使这个解决方案运行良好，性能也不算太差(结果出人意料)，我们还是想以不同的方式来做，并利用分析器的全部功能来改变这一切在索引时间而不是搜索时间负担.但是，由于重新索引我们的所有数据需要数周时间，因此我们希望首先调查是否存在可以帮助我们实现上述相同搜索要求的令牌过滤器和分析器的良好组合.
So, even though, this solution works pretty well and the performance is not too bad (which came out as a surprise), we'd like to do it differently and leverage the full power of analyzers in order to shift all this burden at indexing time instead of searching time. However, since reindexing all our data will take weeks, we'd like to first investigate if there's a good combination of token filters and analyzers that would help us achieve the same search requirements enumerated above.
我的想法是，理想的解决方案将包含一些明智的混合带状疱疹(即几个标记在一起)和 edge-nGram(即在标记的开头或结尾匹配).但是，我不确定的是是否可以让它们一起工作以匹配多个令牌，其中一个令牌可能不是由用户完全输入的).例如，如果索引名称字段是Big Table 123"，我需要 substringof('table 1',name) 来匹配它，所以table"是一个完全匹配的标记，而1" 只是下一个标记的前缀.
My thinking is that the ideal solution would contain some wise mix of shingles (i.e. several tokens together) and edge-nGram (i.e. to match at the start or end of a token). What I'm not sure of, though, is whether it is possible to make them work together in order to match several tokens, where one of the tokens might not be fully input by the user). For instance, if the indexed name field is "Big Table 123", I need substringof('table 1',name) to match it, so "table" is a fully matched token, while "1" is only a prefix of the next token.
提前感谢您分享您的脑细胞.
Thanks in advance for sharing your braincells on this one.
更新 1:在测试 Andrei 的解决方案后
=> 完全匹配 (eq) 和 startswith 完美工作.
=> Exact match (eq) and startswith work perfectly.
A.endswith 小故障
搜索 substringof('table 112', name) 会产生 107 个文档.搜索更具体的情况，例如 endswith(name, 'table 112') 会产生 1525 个文档，而它应该会产生较少的文档(后缀匹配应该是子字符串匹配的子集).更深入地检查我发现了一些不匹配的内容，例如Social Club, Table 12"(不包含112")或Order 312"(既不包含table"也不包含112").我想这是因为它们以12"结尾，这是标记112"的有效克，因此匹配.
Searching for substringof('table 112', name) yields 107 docs. Searching for a more specific case such as endswith(name, 'table 112') yields 1525 docs, while it should yield less docs (suffix matches should be a subset of substring matches). Checking in more depth I've found some mismatches, such as "Social Club, Table 12" (doesn't contain "112") or "Order 312" (contains neither "table" nor "112"). I guess it's because they end with "12" and that's a valid gram for the token "112", hence the match.
B.substringof 小故障
搜索 substringof('table',name) 匹配Party table"、Alex on big table"但不匹配Table 1"、table 112"等.对于 substringof('tabl',name) 不匹配任何东西
Searching for substringof('table',name) matches "Party table", "Alex on big table" but doesn't match "Table 1", "table 112", etc. Searching for substringof('tabl',name) doesn't match anything
更新 2
这有点暗示，但我忘了明确提到该解决方案必须使用 query_string 查询，主要是因为 OData 表达式(无论它们可能多么复杂)将不断被翻译成他们的 Lucene 等价物.我知道我们正在用 Elasticsearch Query DSL 的强大功能与 Lucene 的查询语法进行权衡，Lucene 的查询语法有点不强大，表达能力也不强，但这是我们无法真正改变的.不过，我们非常接近！
It was sort of implied but I forgot to explicitely mention that the solution will have to work with the query_string query, mainly due to the fact that the OData expressions (however complex they might be) will keep getting translated into their Lucene equivalent. I'm aware that we're trading off the power of the Elasticsearch Query DSL with the Lucene's query syntax, which is a bit less powerful and less expressive, but that's something that we can't really change. We're pretty d**n close, though!
更新 3(2019 年 6 月 25 日):
ES 7.2 引入了一种名为 search_as_you_type 的新数据类型，它本机允许这种行为.阅读更多信息:https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html
ES 7.2 introduced a new data type called search_as_you_type that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html
推荐答案
这是一个有趣的用例.这是我的看法:
This is an interesting use case. Here's my take:
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ngram_analyzer": {
          "tokenizer": "my_ngram_tokenizer",
          "filter": ["lowercase"]
        },
        "my_edge_ngram_analyzer": {
          "tokenizer": "my_edge_ngram_tokenizer",
          "filter": ["lowercase"]
        },
        "my_reverse_edge_ngram_analyzer": {
          "tokenizer": "keyword",
          "filter" : ["lowercase","reverse","substring","reverse"]
        },
        "lowercase_keyword": {
          "type": "custom",
          "filter": ["lowercase"],
          "tokenizer": "keyword"
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "nGram",
          "min_gram": "2",
          "max_gram": "25"
        },
        "my_edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "2",
          "max_gram": "25"
        }
      },
      "filter": {
        "substring": {
          "type": "edgeNGram",
          "min_gram": 2,
          "max_gram": 25
        }
      }
    }
  },
  "mappings": {
    "test_type": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "my_ngram_analyzer",
          "fields": {
            "starts_with": {
              "type": "string",
              "analyzer": "my_edge_ngram_analyzer"
            },
            "ends_with": {
              "type": "string",
              "analyzer": "my_reverse_edge_ngram_analyzer"
            },
            "exact_case_insensitive_match": {
              "type": "string",
              "analyzer": "lowercase_keyword"
            }
          }
        }
      }
    }
  }
}

my_ngram_analyzer 用于将每个文本分成小块，这些小块的大小取决于您的用例.出于测试目的，我选择了 25 个字符.lowercase 使用，因为你说不区分大小写.基本上，这是用于 substringof('table 1',name) 的标记器.查询很简单:




my_ngram_analyzer is used to split every text into small pieces, how large the pieces are depends on your use case. I chose, for testing purposes, 25 chars. lowercase is used since you said case-insensitive. Basically, this is the tokenizer used for substringof('table 1',name). The query is simple:

{
  "query": {
    "term": {
      "text": {
        "value": "table 1"
      }
    }
  }
}

my_edge_ngram_analyzer 用于从头开始拆分文本，这专门用于 startswith(name,'table 1') 用例.同样，查询很简单:




my_edge_ngram_analyzer is used to split the text starting from the beginning and this is specifically used for the startswith(name,'table 1') use case. Again, the query is simple:

{
  "query": {
    "term": {
      "text.starts_with": {
        "value": "table 1"
      }
    }
  }
}

我发现这是最棘手的部分 - endswith(name,'table 1') 的部分.为此，我定义了 my_reverse_edge_ngram_analyzer，它使用 keyword 分词器、lowercase 和 edgeNGram 过滤器，前后跟反向过滤器.这个分词器的主要作用是在 edgeNGrams 中分割文本，但边缘是文本的结尾，而不是开头(就像常规的 edgeNGram 一样).查询:




I found this the most tricky part - the one for endswith(name,'table 1'). For this I defined my_reverse_edge_ngram_analyzer which uses a keyword tokenizer together with lowercase and an edgeNGram filter preceded and followed by a reverse filter. What this tokenizer basically does is to split the text in edgeNGrams but the edge is the end of the text, not the start (like with the regular edgeNGram).
The query:

{
  "query": {
    "term": {
      "text.ends_with": {
        "value": "table 1"
      }
    }
  }
}

对于 name eq 'table 1' 情况，一个简单的 keyword 标记器和 lowercase 过滤器应该可以做到查询:




for the name eq 'table 1' case, a simple keyword tokenizer together with a lowercase filter should do it
The query:

{
  "query": {
    "term": {
      "text.exact_case_insensitive_match": {
        "value": "table 1"
      }
    }
  }
}

<小时>
关于 query_string，这稍微改变了解决方案，因为我指望 term 不分析输入文本并匹配它与索引中的术语完全一致.




Regarding query_string, this changes the solution a bit, because I was counting on term to not analyze the input text and to match it exactly with one of the terms in the index.
但这可以用 query_string 模拟" 如果为其指定了适当的analyzer.
But this can be "simulated" with query_string if the appropriate analyzer is specified for it.
解决方案将是一组如下查询(始终使用该分析器，仅更改字段名称):
The solution would be a set of queries like the following (always use that analyzer, changing only the field name):
{
  "query": {
    "query_string": {
      "query": "text.starts_with:("table 1")",
      "analyzer": "lowercase_keyword"
    }
  }
}


                        这篇关于如何明智地结合 shingles 和 edgeNgram 来提供灵活的全文搜索?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

如何明智地结合 shingles 和 edgeNgram 来提供灵活的全文搜索? [英] How to wisely combine shingles and edgeNgram to provide flexible full text search?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何明智地结合 shingles 和 edgeNgram 来提供灵活的全文搜索? [英] How to wisely combine shingles and edgeNgram to provide flexible full text search?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭