如何在可搜索字段内的单词中搜索? “包含"搜索 [英] How do I search within a word within a searchable field? "Contains" search

查看:66
本文介绍了如何在可搜索字段内的单词中搜索? “包含"搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含4个自定义分析器的搜索索引.其中两个用于特定于语言的搜索,另外两个用于精确"搜索(无需词形化).为了简单起见,我只包含特定于语言的自定义分析器的信息,尽管整个解决方案将需要适用于所有自定义分析器.

I have a search index with 4 custom analyzers. Two of them are for language specific searching, and the other 2 are for "exact" searching (no need for lemmatization). For simplicity, I am including only the info for the language specific custom analyzers, although the overall solution will need to be applicable to all the custom analyzers.

{
    "tokenizers": [
        {
            "@odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
            "name": "text_language_search_custom_analyzer_ms_tokenizer",
            "maxTokenLength": 300,
            "isSearchTokenizer": false,
            "language": "french"
        },
        {
            "@odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
            "name": "text_language_search_endsWith_custom_analyzer_ms_tokenizer",
            "maxTokenLength": 300,
            "isSearchTokenizer": false,
            "language": "french"
        }
    ],
    "analyzers": [
        {
            "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
            "name": "text_language_search_custom_analyzer",
            "tokenizer": "text_language_search_custom_analyzer_ms_tokenizer",
            "tokenFilters": [
                "lowercase",
                "lang_text_synonym_token_filter",
                "asciifolding"
            ],
            "charFilters": [
                "html_strip"
            ]
        },
        {
            "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
            "name": "text_language_search_endsWith_custom_analyzer",
            "tokenizer": "text_language_search_endsWith_custom_analyzer_ms_tokenizer",
            "tokenFilters": [
                "lowercase",
                "lang_text_endsWith_synonym_token_filter",
                "asciifolding",
                "reverse"
            ],
            "charFilters": [
                "html_strip"
            ]
        }
    ]
}

为简单起见,假设索引只有2个可搜索字段. -CategoryLangSearch(使用text_language_search_custom_analyzer) -CategoryLangSearchEndsWith(使用text_language_search_endsWith_custom_analyzer)

For simplicity, lets assume the index has only 2 searchable fields. - CategoryLangSearch (uses text_language_search_custom_analyzer) - CategoryLangSearchEndsWith (uses text_language_search_endsWith_custom_analyzer)

现在假定索引只有1个文档,并具有以下内容: -"TELECOMMUNICATIONS"的CategoryLangSearch字段值 -CategoryLangSearchEndsWith字段值为"TELECOMMUNICATIONS"

Now assume the index has only 1 document, with the following: - CategoryLangSearch field value of "TELECOMMUNICATIONS" - CategoryLangSearchEndsWith field value of "TELECOMMUNICATIONS"

我们的UI/API层具有逻辑,因此,如果用户搜索TELE *,现在将使用CategoryLangSearch作为要搜索的字段.同样,我们的UI/API层将检测用户是否在搜索中使用星号通配符正面.因此,如果用户搜索* TIONS,则UI/API层足够聪明,可以针对CategoryLangSearchEndsWith字段进行搜索.

Our UI/API layer has logic so if the user searches TELE*, it will now to use CategoryLangSearch as the field to search in. Likewise, our UI/API layer will detect if the user searches with an asterisk wildcard in the front. So if the user searches for *TIONS, the UI/API layer is smart enough to instead search against the CategoryLangSearchEndsWith field.

一切都很好,它完全可以按预期工作.

All that is great... it works exactly as intended.

但是,问题是,如果用户使用* COMMU *搜索,我们该怎么办? (忽略空格... S.O.将星号视为粗体信号.用户键入asteriskCOMMUasterisk,其中星号为*)

The problem, however, is what can we do if the user searches with * COMMU * (ignore the spaces... S.O. treats the asterisks as signal for bold. The user types in asteriskCOMMUasterisk where asterisk is *)

我认为,如果我建立像这样的azure搜索参数,它将是聪明的":(CategoryLangSearch:(COMMU *)或CategoryLangSearchEndsWith:(* UMMOC)),但实际上,我发现它找不到TELECOMMUNICATIONS ORGANIZATION .当我看到我们建立的查询时,这是非常合理的.

I thought it would be "smart" if I built the azure search param like this: (CategoryLangSearch:(COMMU*) OR CategoryLangSearchEndsWith:(*UMMOC)) but, in practice, I found that this does not find TELECOMMUNICATIONS ORGANIZATION. This makes perfect sense when I see the query we build.

那么,我的问题是,我们如何实现这一目标?无论如何,我们可以在Azure搜索中将其提取出来吗?我认为这并不意味着成功.我看到的唯一可能的解决方案是: 1.如果用户搜索某物 ... 2.首先直接查询我们的MS SQL服务器,以使用SQL支持的%something%语法进行搜索. 3.找到匹配的ID,然后使用THAT来针对Azure搜索索引进行搜索.

SO, my question is, how do we pull this off? Can we pull it off in Azure Search in anyway, shape or form? I don't see a path to success for this one. The only possible solution I could see is the following: 1. If user searches for something... 2. first query our MS SQL server directly to search using %something% syntax which is supported in SQL. 3. find the IDs the match, and then use THAT to search against Azure Search index.

推荐答案

有两种方法可以在Azure搜索中发出包含"搜索.

There are two ways you can issue 'contains' search in Azure Search.

  1. 第一种方法是在Lucene查询语法中使用正则表达式.在您的示例中,如果发出正则表达式查询/.*COMMU.*/,则搜索查询将首先扩展到搜索索引中包含字符串'commu'的所有术语,然后查找结果.您可以针对完全"匹配的字段发出正则表达式查询.搜索查询如下所示: docs?search = exact_field:/.* COMMU.*/& queryType = full.

  1. First approach is using regex expression in the Lucene query syntax. In your example, if you issue a regex query /.*COMMU.*/, the search query will first expand to all terms in the search index that contain the string 'commu' then find the result. You can issue the regex query against the field for "exact" matches. The search query would look like : docs?search=exact_field:/.*COMMU.*/&queryType=full.

如果索引较小,则建议采用上述方法,因为查找查询模式的查询扩展过程成本很高,尤其是对于/.*a.*/等广泛的搜索.您可以在索引编制时间使用ngram令牌过滤器来预加载工作. tokenfilter的配置如下.

The approach above is recommended if you have a small index because the query expansion process to find queried pattern is costly, especially for broad searches like /.*a.*/. You can preload the work by using a ngram tokenfilter at indexing time. The configuration for the tokenfilter will be as below.

{
  "@odata.type": "#Microsoft.Azure.Search.NGramTokenFilterV2",
  "name": "ngram_tokenfilter",
  "minGram": 1,
  "maxGram": 100
}

例如,以文本"hello"为例,此令牌过滤器将ngram令牌生成为

Given a text "hello" for example, this tokenfilter generates ngram tokens as

h,e,l,l,o,he,el,ll,lo,hel,ell,...,你好.

h, e, l, l, o, he, el, ll, lo, hel, ell, ..., hello.

查询使用ngram tokenfilter分析的新字段时,不需要通配符或正则表达式运算符,但可以使用常规术语搜索.搜索查询"docs?search = ell"将找到包含术语"hello"的文档.这种方法避免了昂贵的扩展过程,因为所有包含"的可能性都已经过预处理,并且存在于索引中.请注意,您仅需要在建立索引时进行ngram分析.

When querying against the new field analyzed with ngram tokenfilter, you do not need wildcard or regex operator, but can use a regular term search. The search query "docs?search=ell" will find the document containing the term "hello". This approach avoids the expensive expansion process because all the "contains" possibilities have been preprocessed, and exist in the index. Please note that you need the ngram analysis at indexing time only.

还请注意,此ngram分析会影响索引的大小,因为它会产生更多的令牌.您可以使用参数"minGram"和"maxGram"来控制索引的大小.

Please also note that this ngram analysis impact the size of the index as it produces more tokens. You can use parameters 'minGram' and 'maxGram' to control the size of the index.

由于您已经具有一个基于"*"位置指导搜索的API/UI,因此第二个选项似乎是一种不错的方法.

Since you already have an API/UI that directs the search based on the positions of '*', the second option seems like a good approach.

内特

这篇关于如何在可搜索字段内的单词中搜索? “包含"搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆