我如何将法语文本FEMMES.COM索引为FEMMES的语言变体 [英] How do I get french text FEMMES.COM to index as language variants of FEMMES

查看:65
本文介绍了我如何将法语文本FEMMES.COM索引为FEMMES的语言变体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要FEMMES.COM才能将其标记为基本单词FEMME的单数+复数形式.

I need FEMMES.COM to get tokenized as singular + plural forms of the base word FEMME.

"analyzers":[{"@ odata.type":#Microsoft.Azure.Search.CustomAnalyzer","name":"text_language_search_custom_analyzer","tokenizer":"text_language_search_custom_analyzer_ms_tokenizer","tokenFilters":[[小写," asciifolding]," charFilters:[" html_strip]}]," tokenizers:[{" @ odata.type:"#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer," name:" text_language_search_custom_analyzer_ms_tokenizer","maxTokenLength":300,"isSearchTokenizer":false,"language":"english"}],"tokenFilters":[],"charFilters":[]}

"analyzers": [ { "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer", "name": "text_language_search_custom_analyzer", "tokenizer": "text_language_search_custom_analyzer_ms_tokenizer", "tokenFilters": [ "lowercase", "asciifolding" ], "charFilters": [ "html_strip" ] } ], "tokenizers": [ { "@odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer", "name": "text_language_search_custom_analyzer_ms_tokenizer", "maxTokenLength": 300, "isSearchTokenizer": false, "language": "english" } ], "tokenFilters": [], "charFilters": []}

{"analyzer":"text_language_search_custom_analyzer","text":"FEMMES"}

{ "analyzer": "text_language_search_custom_analyzer", "text": "FEMMES" }

{"@ odata.context":" https://one -adscope-search-eu-stage.search.windows.net/ $ metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult," tokens:[{" token:" femme," startOffset":0,"endOffset":6,6,"position":0},{"token":"femmes","startOffset":0,"endOffset":6,"position":0}]}

{ "@odata.context": "https://one-adscope-search-eu-stage.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femme", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "femmes", "startOffset": 0, "endOffset": 6, "position": 0 } ] }

{"@ odata.context":" https://one -adscope-search-eu-stage.search.windows.net/ $ metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult," tokens:[{" token:" femmes," startOffset":0,"endOffset":6,6,"position":0},{"token":"com","startOffset":7,7,"endOffset":10,"position":1}]}

{ "@odata.context": "https://one-adscope-search-eu-stage.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femmes", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "com", "startOffset": 7, "endOffset": 10, "position": 1 } ] }

{"@ odata.context":" https://one -adscope-search-eu-stage.search.windows.net/ $ metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult," tokens:[{" token:" femme," startOffset":0,"endOffset":6,6,"position":0},{"token":"femmes","startOffset":0,"endOffset":6,"position":0},{"token":"com","startOffset":7,"endOffset":10,"position":1}]}

{ "@odata.context": "https://one-adscope-search-eu-stage.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femme", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "femmes", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "com", "startOffset": 7, "endOffset": 10, "position": 1 } ]}

推荐答案

我以前的答案不正确. Azure搜索实施实际上在令牌筛选器之前应用了语言令牌生成器.在我的用例中,这实际上使WordDelimiterToken过滤器无效.

My previous answer was not correct. Azure Search implementation actually applies the language tokenizer BEFORE token filters. This essentially made the WordDelimiterToken filter useless in my use case.

最终我要做的是在上传到Azure进行索引之前对数据进行预处理.在我的C#代码中,我添加了一些正则表达式逻辑,这些逻辑会将诸如FEMMES2017之类的文本分解为FEMMES 2017,然后再将其发送到Azure.这样,当文本到达Azure时,索引器将单独看到FEMMES,并使用语言标记器将其正确标记为FEMME和FEMMES.

What I ended up having to do was to pre-process data BEFORE I uploaded to Azure for indexing. In my C# code, I added some regex logic that would break apart text like FEMMES2017 into FEMMES 2017, before I sent it to Azure. This way, when the text got to Azure, the indexer would see FEMMES by itself and properly tokenize as FEMME and FEMMES using the language tokenizer.

这篇关于我如何将法语文本FEMMES.COM索引为FEMMES的语言变体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆