Solr Dismax处理程序-空格和特殊字符行为 [英] Solr Dismax handler - whitespace and special character behaviour
问题描述
当查询中包含特殊字符时,我得到了奇怪的结果.
I've got strange results when I have special characters in my query.
这是我的要求:
q=histoire-france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%
已解析的查询:
<str name="parsedquery_toString">+((any:histoir any:franc)) ()</str>
我有17000个结果,因为Solr正在执行OR(应该是AND).
I've got 17000 results because Solr is doing an OR (should be AND).
当我使用空格而不是特殊字符时,我没有问题:
I have no problem when I'm using a whitespace instead of a special char :
q=histoire france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%
<str name="parsedquery_toString">+(((any:histoir) (any:franc))~2) ()</str>
该查询的2000条结果.
2000 results for this query.
这是我的schema.xml(相关部分):
Here is my schema.xml (relevant parts) :
<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!--<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldType>
我什至尝试使用PatternTokenizerFactory对空白&特殊字符,但没有变化...
I even tried with a PatternTokenizerFactory to tokenize on whitespaces & special chars but no change...
我当前的解决方法是在将查询发送到Solr之前,用空格替换所有特殊字符,但这并不令人满意.
My current workaround is to replace all special chars by whitespaces before sending query to Solr, but it is not satisfying.
即使使用charFilter(PatternReplaceCharFilterFactory)将空白替换为特殊字符,它也不起作用...
EDIT : Even with a charFilter (PatternReplaceCharFilterFactory) to replace special characters by whitespace, it doesn't work...
通过solr admin进行分析的第一行,带有详细的输出,用于查询='histoire-france':
First line of analysis via solr admin, with verbose output, for query = 'histoire-france' :
org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement= , pattern=([,;./\\'&-]), luceneMatchVersion=LUCENE_32}
text histoire france
将'-'替换为'',然后由WhitespaceTokenizerFactory标记化.但是,对于"histoire-france"和"histoire France",我仍然有不同数量的结果.
The '-' is replaced by ' ', then tokenized by WhitespaceTokenizerFactory. However I still have different number of results for 'histoire-france' and 'histoire france'.
我想念什么吗?
推荐答案
这是一个错误: https://issues.apache.org/jira/browse/SOLR-3589
如果将令牌之一拆分为两个,则将edismax mm设置为100% 分析器链的令牌(即萤火虫" =>萤火虫),毫米 参数将被忽略,并且等效于"fire OR fly"的OR查询 被生产.对于不使用该语言的语言,这尤其成问题 使用空格分隔中文或日语的单词.
With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced. This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.
Solr 4.1(2013年1月22日)中已修复
It is fixed in Solr 4.1 (22 January 2013)
这篇关于Solr Dismax处理程序-空格和特殊字符行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!