如何在Azure搜索中实际使用keywordanalyzer? [英] How to practially use a keywordanalyzer in azure-search?

查看:97
本文介绍了如何在Azure搜索中实际使用keywordanalyzer?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与这个问题有关并继续存在: Azure搜索分析器

a little relating and continuing to this question: Azure Search Analyzer

我想使用keywordanalyzer进行单词收集.

I want to use a keywordanalyzer for word collections.

我们有包含不同字段(例如product_name,brand,categorie等)的文档(产品).
为了实现基于关键字的排名(评分),我想添加一个Collection(Edm.String)字段,其中包含不同的(untokenized !!)关键字,例如:棕色泰迪"或绿豆".
为此,我考虑过使用具有以下定义的keywordanalyzer:

We have documents (products) with different fields like product_name, brand, categorie and so on.
To implement a keyword based ranking (scoring) I would like to add a Collection(Edm.String) field which contains different (untokenized!!) keywords, like: "brown teddy" or "green bean".
To achieve this I thought about using a keywordanalyzer with the following definition:

//字段定义:
{
"name":"keyWordList",
"type":"Collection(Edm.String)",
"analyzer":"keywordAnalyzer"
}
...

分析器":[ {
"name":"keywordAnalyzer",
"@ odata.type":#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"keywordTokenizer",
"tokenFilters":[小写字母",经典"]
}]
...

令牌":[{
"name":"keywordTokenizer",
"@ odata.type":#Microsoft.Azure.Search.KeywordTokenizer"
}

// field definition:
{
"name": "keyWordList",
"type": "Collection(Edm.String)",
"analyzer": "keywordAnalyzer"
}
...

"analyzers": [ {
"name":"keywordAnalyzer",
"@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"keywordTokenizer",
"tokenFilters":[ "lowercase", "classic" ]
} ]
...

"tokenizers": [{
"name": "keywordTokenizer",
"@odata.type": "#Microsoft.Azure.Search.KeywordTokenizer"
}

现在,在上传了一些文档之后,我只是无法通过完全输入所选的关键字来查找字段. 例如,是具有以下字段数据的文档:

Now after having uploaded some documents, I just can't find the fields by entering exactly the chosen keywords. For example the is a document with the following field-data:

"keyWordList":[ 蓝熊", 蓝熊", 蓝色的bear123" ]

"keyWordList": [ "Blue Bear", "blue bear", "blue bear123" ]

无法通过查询以下搜索找到任何结果:

Im not able to find any result by querying the following search:

{ 搜索:蓝熊", 计数:"true", queryType:完整" }

{ search:"blue bear", count:"true", queryType:"full" }

这也是我尝试过的:

  • 使用预定义的关键字分析器而不是自定义的->不成功
  • 我不是使用Collection(Edm.String)而是使用普通的String字段(仅包含一个关键字->没有成功
  • )对其进行了测试
  • 将字段定义块中的分析器拆分为searchAnalyzer ="lowercaseAnalyzer"和filterAnalyzer ="keywordAnalyzer"反之亦然->没有成功
  • using the predefined keywordanalyzer instead of a customized one -> no success
  • instead of using Collection(Edm.String) I just tested it with a normal String field, containing only one keyword -> no success
  • splitting up the analyzer in the field definition-block into searchAnalyzer="lowercaseAnalyzer" and filterAnalyzer="keywordAnalyzer" vice versa -> no success

最后,我唯一可以得到的结果是将整个搜索阶段作为一个术语发送.但这应该由分析仪完成,对吧?!

In the end the only result I could get was via sending the whole seach phase as a single term. But this should be done by the analyzer, right?!

{ 搜索:"\"蓝熊\", 计数:"true", queryType:完整" }

{ search:"\"blue bear\"", count:"true", queryType:"full" }

用户不知道他们是搜索现有关键字还是执行标记化搜索.这就是为什么这不是一个选择.

Users don't know if they search for an existing keyword or perform a tokenized search. That's why this won't be an option.

我的这个问题有解决方案吗? 还是对于这种关键字(高得分)搜索来说,有没有更好/更轻松的方法?

Is there any solution to this issue of mine? Or is there maybe a better / easier approach for this kind of keyword (high scoring) seach?

谢谢!

推荐答案

简短答案:

您观察到的行为是正确的.

The behavior you're observing is correct.

从语义上讲,您的搜索查询蓝熊的意思是:查找与术语蓝色 术语
相匹配的所有文档. em>.由于您使用的是关键字标记器,因此您索引的术语是 blue bear blue bear123 .索引中不存在单独的术语 blue bear .这就是为什么只有词组查询才能返回您期望的结果的原因.

Semantically, your search query blue bear means: find all documents that match the term blue or the term bear. Since you are using the keyword tokenizer the terms that you indexed are blue bear and blue bear123. The terms blue and bear individually don't exist in your index. That's why only the phrase query returns the result you are expecting.

长答案:

让我解释一下在查询处理期间如何应用分析器以及在文档索引期间如何应用分析器.

Let me explain how the analyzer is applied during query processing and how it's applied during document indexing.

在索引侧,您定义的分析器将独立处理keyWordList集合的元素.倒排索引中包含的术语为:

On the indexing side, the analyzer you defined processes elements of the keyWordList collection independently. The terms that end up in your inverted index are:

  • 蓝熊(因为您使用小写过滤器,蓝熊蓝熊被标记为同一术语).
  • 蓝熊123

  • blue bear (since you're using the lowercase filter blue bear and Blue Bear are tokenized to the same term).
  • blue bear123

正如您期望的那样,蓝熊是一个术语-在空间上不会分成两个-因为您使用的是关键字标记器.同样适用于 blue bear123

As you'd expect blue bear is one term - not split into two on space - since you're using the keyword tokenizer. Same applies to blue bear123

在查询处理方面,发生两件事:

On the query processing side, two things happen:

  1. 您的搜索查询也被重写: blue | bear (查找文档 blue bear ).这是因为默认情况下使用 searchMode = any .如果您使用searchMode = all,则搜索查询将被重写为 blue + bear (使用 blue bear 查找文档).

  1. Your search query is rewritten too: blue|bear (find documents blue or bear). This is because searchMode=any is used by default. If you used searchMode=all, your search query would be rewritten to blue+bear (find documents with blue and bear).

查询解析器采用您的搜索查询字符串,并将查询运算符(例如+,|,*等)与查询词分开.然后,它将搜索查询分解为受支持类型的子查询,例如,后缀运算符'*'后面的词成为前缀查询,带引号的词词组查询等.不被任何受支持的运算符之前或之后的词成为单个词条查询

The query parser takes your search query string and separates query operators (such as +, |, * etc.) from query terms. Then it decomposes the search query into subqueries of supported types e.g., terms followed by the suffix operator ‘*’ become a prefix query, quoted terms a phrase query etc. Terms that are not preceded or followed by any the supported operators become individual term queries.

在您的示例中,查询解析器将查询字符串 blue | bear 分解为两个词条查询,分别使用术语 blue bear .搜索引擎将查找与任何查询(searchMode = any)匹配的文档.

In your example, the query parser decomposed your query string blue|bear into two term queries with terms blue and bear respectively. The search engine looks for documents that match any of those queries (searchMode=any).

已识别子查询的查询词由搜索分析器处理.

Query terms of the identified subqueries are processed by the search analyzer.

在您的示例中,分析器分别处理术语 blue bear .由于它们已经是小写字母,因此没有被修改.这些标记都不存在于您的索引中,因此不会返回任何结果.

In your example, terms blue and bear are processed by the analyzer individually. They are not modified since they are already lowercase. None of those tokens exist in your index, thus no results are returned.

如果查询如下:蓝熊" (带引号),它将被重写为蓝熊" -注意没有变化,OR运算符具有既然您正在寻找一个短语,就不要把它们放在单词之间.查询解析器将整个短语词(两个词)传递给分析器,后者再输出一个小写的词: blue bear .此令牌与索引中的内容匹配.

If you query looked as follows: "Blue Bear" (with quotes) it would be rewritten to "Blue Bear" - notice no change, the OR operator has not been put between the words since now you're looking for a phrase. The query parser passes the entire phrase term (two words) to the analyzer which in turn outputs a single, lowercased term: blue bear. This token matches what's in your index.

此处的关键课程是查询解析器在应用分析器之前先处理查询字符串.将分析器应用于查询解析器所标识的子查询的各个术语.

The key lesson here is that the query parser processes the query string before the analyzers are applied. The analyzers are applied to individual terms of subqueries identified by the query parser.

我希望这可以帮助您了解所观察到的行为.请注意,您可以使用分析API .

I hope this helps you understand the behavior you're observing. Note, you can test the output of your custom analyzer using the Analyze API.

这篇关于如何在Azure搜索中实际使用keywordanalyzer?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆