Solr Lucene中的连字符/破折号挑战 [英] Challenge with hyphens/dashes in Solr Lucene

查看:119
本文介绍了Solr Lucene中的连字符/破折号挑战的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图使Solr仅提取格式为n-nnnnnnn的票证的后7位数字

I'm trying to cause Solr to extract only the second 7 digit portion of a ticket formatted like n-nnnnnnn

最初,我希望保留整张票.根据文档,带有数字的数字应保持在一起,但是在解决了一段时间这个问题并查看代码后,我认为不是这种情况. Solr始终生成两个项.因此,我认为我可以从第二部分获得更好的查询结果,而不是对n的第一位进行大量匹配.用A代替破折号:

Originally I hoped to keep the full ticket together. According to documentation digits with numbers should be kept together but after hammering away a this problem for some time and looking at the code I don't think that's the case. Solr always generates two terms. So rather than large numbers of matches for the first digit of n- I'm thinking I can get better query results from just the second portion. Substituting an A for a dash:

    <charFilter class="solr.PatternReplaceCharFilterFactory"
      pattern="\b\d[A](\d\d\d\d\d\d\d)\b" replacement="$1" replace="all" 
      maxBlockChars="20000"/>

将解析1A1234567罚款 但 -\ b替换=" $ 1替换="全部 maxBlockChars ="20000"/>

will parse 1A1234567 fine But -\b" replacement="$1" replace="all" maxBlockChars="20000"/>

不会解析1-1234567

will not parse 1-1234567

所以看起来连字符只是一个问题.我尝试了-(转义)和[-]以及\ u002D和\ x {45}和\ x045,但没有成功.

So it looks like just a problem with the hyphen. I've tried -(escaped) and [-] and \u002D and \x{45} and \x045 without success.

我尝试过使用char过滤器:

I've tried putting char filters around it:

   <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
    <charFilter class="solr.PatternReplaceCharFilterFactory"
      pattern="\b\d[-](\d\d\d\d\d\d\d)\b" replacement="$1" replace="all" maxBlockChars="20000"/>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping2.txt"/>

具有映射:

-" =>"z"

然后

"z" =>-"

我看起来连字符在Flex令牌化中已被占用,甚至对于char过滤器都不可用.

I looks like the hyphen is eaten up in the Flex tokenization and isn't even available to the char filter.

有人在Solr/Lucene中使用连字符/破折号获得了更大的成功吗?谢谢

Has anyone had more success with hyphen/dash in Solr/Lucene? Thanks

推荐答案

如果您的Solr使用的是最新的Lucene(我认为是3.x +),则您将要使用ClassicAnalyzer而不是StandardAnalyzer,因为StandardAnalyzer现在始终会处理连字符作为分隔符.

If your Solr is using a recent Lucene (3.x+ I think), you will want to use a ClassicAnalyzer rather than a StandardAnalyzer, as StandardAnalyzer now always treats hyphens as a delimiter.

这篇关于Solr Lucene中的连字符/破折号挑战的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆