Solr Lucene中的连字符/破折号挑战 [英] Challenge with hyphens/dashes in Solr Lucene
问题描述
我正试图使Solr仅提取格式为n-nnnnnnn的票证的后7位数字
I'm trying to cause Solr to extract only the second 7 digit portion of a ticket formatted like n-nnnnnnn
最初,我希望保留整张票.根据文档,带有数字的数字应保持在一起,但是在解决了一段时间这个问题并查看代码后,我认为不是这种情况. Solr始终生成两个项.因此,我认为我可以从第二部分获得更好的查询结果,而不是对n的第一位进行大量匹配.用A代替破折号:
Originally I hoped to keep the full ticket together. According to documentation digits with numbers should be kept together but after hammering away a this problem for some time and looking at the code I don't think that's the case. Solr always generates two terms. So rather than large numbers of matches for the first digit of n- I'm thinking I can get better query results from just the second portion. Substituting an A for a dash:
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\b\d[A](\d\d\d\d\d\d\d)\b" replacement="$1" replace="all"
maxBlockChars="20000"/>
将解析1A1234567罚款 但 -\ b替换=" $ 1替换="全部 maxBlockChars ="20000"/>
will parse 1A1234567 fine But -\b" replacement="$1" replace="all" maxBlockChars="20000"/>
不会解析1-1234567
will not parse 1-1234567
所以看起来连字符只是一个问题.我尝试了-(转义)和[-]以及\ u002D和\ x {45}和\ x045,但没有成功.
So it looks like just a problem with the hyphen. I've tried -(escaped) and [-] and \u002D and \x{45} and \x045 without success.
我尝试过使用char过滤器:
I've tried putting char filters around it:
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\b\d[-](\d\d\d\d\d\d\d)\b" replacement="$1" replace="all" maxBlockChars="20000"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping2.txt"/>
具有映射:
-" =>"z"
然后
"z" =>-"
我看起来连字符在Flex令牌化中已被占用,甚至对于char过滤器都不可用.
I looks like the hyphen is eaten up in the Flex tokenization and isn't even available to the char filter.
有人在Solr/Lucene中使用连字符/破折号获得了更大的成功吗?谢谢
Has anyone had more success with hyphen/dash in Solr/Lucene? Thanks
推荐答案
如果您的Solr使用的是最新的Lucene(我认为是3.x +),则您将要使用ClassicAnalyzer而不是StandardAnalyzer,因为StandardAnalyzer现在始终会处理连字符作为分隔符.
If your Solr is using a recent Lucene (3.x+ I think), you will want to use a ClassicAnalyzer rather than a StandardAnalyzer, as StandardAnalyzer now always treats hyphens as a delimiter.
这篇关于Solr Lucene中的连字符/破折号挑战的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!