具有空格或多个单词的概念的同义词 [英] Synonyms with concepts that have spaces, or are multiple words

查看:26
本文介绍了具有空格或多个单词的概念的同义词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道如何处理包含空格的同义词!我有以下配置:

I don't know how to deal with synonyms which contains a space! I have the following config:

SOLR 配置文件

<fieldType ... >
  <analyzer type="index">
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory" 
                            catenateWords="1" 
                            preserveOriginal="1"
                            splitOnCaseChange="1"
                            generateWordParts="1" 
                            generateNumberParts="1"         
                            catenateNumbers="1" 
                            catenateAll="1" 
                            />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30" side="front"/>
  </analyzer>
  <analyzer type="query">    
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LengthFilterFactory" min="2" max="70" />
    <filter class="solr.SynonymFilterFactory" synonyms="syn.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
</fieldType>

我的文件:syn.txt

My file: syn.txt

st., st => saint
istambul => istanbul
airport, apt => aéroport
NYC => New York
pt., pt => port
brussels => bruxelles

除了同义词外,一切正常:

Everything was working fine except the synonym:

"NYC => New York"

我做了一些研究,发现了以下内容:

I did some research and I found the following:

请记住,虽然 SynonymFilter 很乐意处理包含多个单词的同义词(即:sea biscuit、sea biscit、seabiscuit")

Keep in mind that while the SynonymFilter will happily work with synonyms containing multiple words (ie: "sea biscuit, sea biscit, seabiscuit")

处理这样的同义词的推荐方法是在索引时扩展同义词.这是因为在查询时可能会出现两个潜在问题:

The recommended approach for dealing with synonyms like this, is to expand the synonym when indexing. This is because there are two potential issues that can arise at query time:

Lucene QueryParser 在向分析器提供任何文本之前对空格进行标记,因此如果一个人搜索单词 sea biscit,分析器将获得单词sea";和biscit"分开,并且不会知道它们匹配同义词.

The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" separately, and will not know that they match a synonym.

短语搜索(即:sea biscit")将导致 QueryParser 将整个字符串传递给分析器,但如果 SynonymFilter 被配置为扩展同义词,那么当 QueryParser 从分析器,它将构造一个不会产生预期效果的 MultiPhraseQuery.

Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect.

这是因为分析器可用于指示两个术语占据相同位置的机制有限:没有办法指示短语"在同一位置.与术语占据相同的位置.

This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term.

对于我们的示例,结果 MultiPhraseQuery 将是(sea | sea | seabiscuit) (biscuit | biscit)";这与seabiscuit"的简单情况不符.发生在文档中

For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabiscuit" occuring in a document

所以我尝试更改我的配置文件并在索引处添加我的过滤器,但它不起作用.

So I tried to changed my config file and to add my filters at the indexing but it is not working.

有人有什么想法吗?

推荐答案

您正在使用 => 进行显式映射.

You are doing explicit mapping with =>.

Solr 文档

显式映射匹配=>"的 LHS 上的任何标记序列,并用 RHS 上的所有替代项替换.这些类型的映射会忽略架构中的 expand 参数.

Explicit mappings match any token sequence on the LHS of "=>" and replace with all alternatives on the RHS. These types of mappings ignore the expand parameter in the schema.

所以我猜如果你搜索 NYC 你什么也得不到,因为它在索引时被替换为 New York.

So I am guessing that if you search for NYC you get nothing back, since it got replaced with New York at index time.

相反,您可以尝试将它们声明为等效的同义词吗?即喜欢NYC, New York 而不是 NYC =>纽约.

Instead, can you try declaring them as equivalent synonyms? i.e. like NYC, New York instead of NYC => New York.

那么我相信你可以搜索其中任何一个,结果都是一样的.

Then I believe you can search for either of them and the result will be the same.

这篇关于具有空格或多个单词的概念的同义词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆