Solr:无法搜索包含字符的数字 [英] Solr: Can't search for numbers mixed with characters
问题描述
我的索引中有一些项目(Solr.4.4),其中包含类似Foobar 135g
的名称,其中135g表示某些砝码.搜索foobar
或foobar 135
确实可以,但是当我尝试搜索确切的短语foobar 135g
时,什么也没找到.
I have some items in my index (Solr. 4.4), which contain names like Foobar 135g
, where the 135g refers to some weights. Searching for foobar
or foobar 135
does work, but when I try to search for the exact phrase foobar 135g
, nothing is found.
我在solr管理面板"Analysis"中分析了查询.在这里,一切看起来都不错.正确地对字段进行了索引,正确地对查询进行了拆分,并且我得到了点击(由标记上的紫色背景表示).
I analysed the query inside the solr admin panel "Analysis". Here everything looks good. The fields are indexed correctly, the query is splitted correctly, and I get hits (indicated by this purple background on the tokens).
但是在索引和/或查询时间上处理字符串的方式必须存在一个问题.这是字段定义,我正在使用:
But there has to be an issue the way I process the strings on index and/or query time. So this is the field definition, I'm using:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
<filter class="solr.ReverseStringFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
<filter class="solr.ReverseStringFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
我将两个ReverseStringFilterFactory
与EdgeNGramFilterFactory
结合使用,以便能够搜索foob
以及bar
或obar
(出现在项目名称末尾的字符串) ).首先,我想这与WordDelimiterFilterFactory
和catenateWords
选项有关.但是此选项对数字没有任何作用(对吗?).
I'm using the two ReverseStringFilterFactory
's with the EdgeNGramFilterFactory
's to be able to search for foob
and for bar
or obar
(strings that appear at the end of an item name). First I thought, it has something to do with the WordDelimiterFilterFactory
and the catenateWords
options. But this option doesn't do anything with numbers in it (am I right?).
阅读文档后( http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters)我发现generateNumberParts
,默认值为1
.这导致将135g
分为135
和g
.但是只要启用了preserveOriginal
选项,135g
也会被索引为整个字符串.这也显示在管理界面的分析"面板中:
After reading the documentation (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters) I found generateNumberParts
which default is 1
. This leads to splitting 135g
into 135
and g
. But as long as I have the preserveOriginal
option enabled, the 135g
is also indexed as a whole string. This is also shown in the Analysis panel from the admin interface:
有人知道导致这种问题的是哪种过滤器,令牌生成器吗?
Does anybody know what kind of filter, tokenizer... is causing this issue?
更新
我发现了一些有趣的东西.当我调试搜索135g
的查询时,得到以下调试输出:
I've found out something interesting. When I debug the query for the search 135g
, I get the following debug output:
<lst name="debug">
<str name="rawquerystring">name_texts:135g</str>
<str name="querystring">name_texts:135g</str>
<str name="parsedquery">MultiPhraseQuery(name_texts:"(135g 135) (g 135g)")</str>
<str name="parsedquery_toString">name_texts:"(135g 135) (g 135g)"</str>
<lst name="explain"/>
<str name="QParser">LuceneQParser</str>
...
</lst>
我知道,由于前面提到了solr.WordDelimiterFilterFactory
,因此将字符串get拆分为这些部分.但是,为什么Solr将其转换为MultiPhraseQuery
?我现在有点困惑,我认为solr.WordDelimiterFilterFactory
在查询时生成的每个单个令牌都会触发单独的搜索(或者至少是令牌之间的OR
语句).
I understand, that because of the earlier mentioned solr.WordDelimiterFilterFactory
, the string get's splitted into this parts. But why is Solr converting it into a MultiPhraseQuery
? I'm a little bite confused right now, I thought that every single token generated by the solr.WordDelimiterFilterFactory
on query time would trigger a seperated search (or at least, a OR
statement between the tokens).
请,有人让我头脑清醒,我有点困惑;)如何避免这种情况?
Please, someone clear up my mind, I'm kinda confused ;) How can I avoid this?
推荐答案
它是WordDelimiterFilterFactory.您应该能够在分析中的管理面板中看到它.为此,请使用:splitOnNumerics ="0"作为属性.
It is the WordDelimiterFilterFactory. You should be able to see it in your admin panel under analysis. To not do that use : splitOnNumerics="0" as attribute.
更新:
在此处了解更多信息: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.
Read more about it here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.
solr.WordDelimiterFilterFactory
solr.WordDelimiterFilterFactory
创建solr.analysis.WordDelimiterFilter.
Creates solr.analysis.WordDelimiterFilter.
将单词拆分为子单词,并对子单词组执行可选的转换.默认情况下,单词会按照以下规则分为子单词:
Splits words into subwords and performs optional transformations on subword groups. By default, words are split into subwords with the following rules:
splitOnNumerics ="1"导致字母=>数字过渡生成新的部分[Solr 1.3]: "j2se" =>"j""2""se" 默认为true("1");设置为0即可关闭
splitOnNumerics="1" causes alphabet => number transitions to generate a new part [Solr 1.3]: "j2se" => "j" "2" "se" default is true ("1"); set to 0 to turn off
更新2
根据您的最新评论,我现在明白您的意思了.我使用了您的字段类型定义,并用您的句子在solr4.5.1上建立了索引,并且能够搜索到test_mytext:"foobar 135g",test_mytext:foobar 135g,test_mytext:foobar 135g,test_mytext:foobar,test_mytext:135g,test_mytext:135.其中test_mytext是您在上面的问题中定义的类型.所以我不知道为什么您无法在自己的索引中找到.确保您的字段定义如下:<field name="text" type="mytext" indexed="true" stored="true"/>
Based on your latest comment, i now understood what you meant. I took your field type definition and indexed on solr4.5.1 with your sentence and was able to search for test_mytext:"foobar 135g" , test_mytext:foobar 135g, test_mytext:foobar 135g , test_mytext:foobar , test_mytext:135g, test_mytext:135. where test_mytext is of type you defined in your question above. So i do not know why you are unable to find in your own index. Make sure your field is defined some thing like this: <field name="text" type="mytext" indexed="true" stored="true"/>
Upadate 3 这是我的调试日志,带有您的字段定义,而不是要求您查看完全不同的处理的原因: 查询=> test_mytext:135g 调试":{ "rawquerystring":"test_mytext:135g", "querystring":"test_mytext:135g", "parsedquery":"test_mytext:135g test_mytext:135 test_mytext:g test_mytext:135g", "parsedquery_toString":"test_mytext:135g test_mytext:135 test_mytext:g test_mytext:135g", 解释": { "200":"\ n0.8563627 =(MATCH)的乘积:\ n 1.141817 =(MATCH)的总和:\ n 0.35407978 =(MATCH)权重(test_mytext:135g in 1)[DefaultSimilarity],结果:\ n 0.35407978 =得分(doc = 1,freq = 2.0 = termFreq = 2.0 \ n),乘积:\ n 0.45980635 = queryWeight,乘积:\ n 3.4849067 = idf(docFreq = 2,maxDocs = 36)\ n 0.13194223 = queryNorm \ n 0.77006286 = fieldWeight in 1,乘积为:\ n 1.4142135 = tf(freq = 2.0),频率为:\ n 2.0 = termFreq = 2.0 \ n 3.4849067 = idf(docFreq = 2,maxDocs = 36)\ n 0.15625 = fieldNorm(doc = 1)\ n 0.4336574 =(MATCH)权重(test_mytext:135 in 1)[DefaultSimilarity],结果为:\ n 0.4336574 =得分(doc = 1,freq = 3.0 = termFreq = 3.0 \ n),的乘积:\ n 0.45980635 = queryWeight,乘积:\ n 3.4849067 = idf(docFreq = 2,maxDocs = 36)\ n 0.13194223 = queryNorm \ n 0.94313055 = fieldWeight in 1,乘积:\ n 1.7320508 = tf(freq = 3.0),频率为:\ n 3.0 =期限频率= 3.0 \ n 3.4849067 = idf(docFreq = 2,maxDocs = 36)\ n 0.15625 = fieldNorm(doc = 1)\ n 0.35407978 =(MATCH)权重(test_mytext:135g in 1)[DefaultSimilarity],结果为:\ n 0.35407978 =得分(doc = 1,freq = 2.0 = termFreq = 2.0 \ n),乘积:\ n 0.45980635 = queryWeight,乘积:\ n 3.4849067 = idf(docFreq = 2,maxDocs = 36)\ n 0.13194223 = queryNorm \ n 0.77006286 = fieldWeight in 1,产品乘积:\ n 1.4142135 = tf(freq = 2.0),freq为:\ n 2.0 = termFreq = 2.0 \ n 3.4849067 = idf(docFreq = 2,maxDocs = 36)\ n 0.15625 = fieldNorm(doc = 1)\ n 0.75 =坐标(3/4)\ n }
Upadate 3 Here is my debug log, with your field definition, not sue why you are seeing completely different processing: Query => test_mytext:135g debug": { "rawquerystring": "test_mytext:135g", "querystring": "test_mytext:135g", "parsedquery": "test_mytext:135g test_mytext:135 test_mytext:g test_mytext:135g", "parsedquery_toString": "test_mytext:135g test_mytext:135 test_mytext:g test_mytext:135g", "explain": { "200": "\n0.8563627 = (MATCH) product of:\n 1.141817 = (MATCH) sum of:\n 0.35407978 = (MATCH) weight(test_mytext:135g in 1) [DefaultSimilarity], result of:\n 0.35407978 = score(doc=1,freq=2.0 = termFreq=2.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.77006286 = fieldWeight in 1, product of:\n 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = termFreq=2.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.4336574 = (MATCH) weight(test_mytext:135 in 1) [DefaultSimilarity], result of:\n 0.4336574 = score(doc=1,freq=3.0 = termFreq=3.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.94313055 = fieldWeight in 1, product of:\n 1.7320508 = tf(freq=3.0), with freq of:\n 3.0 = termFreq=3.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.35407978 = (MATCH) weight(test_mytext:135g in 1) [DefaultSimilarity], result of:\n 0.35407978 = score(doc=1,freq=2.0 = termFreq=2.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.77006286 = fieldWeight in 1, product of:\n 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = termFreq=2.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.75 = coord(3/4)\n" },
我正在使用solr 4.5.1.
I am using solr 4.5.1 .
更新4 然后我注意到您正在使用Solr 4.4.0.我获取了您确切的字段定义和短语,然后进行了查询,它找到了您的结果.
Update 4 Then i noticed that you are using Solr 4.4.0. I took your exact field definition and phrase and ran a query and it finds your result.
查询=> name_texts:"135g"
Query => name_texts:"135g"
结果:
<result name="response" numFound="1" start="0">
<doc>
<str name="id">100</str>
<str name="name_texts">Foobar 135g</str>
<long name="_version_">1456487722571005952</long></doc>
</result>
<lst name="debug">
<str name="rawquerystring">name_texts:"135g"</str>
<str name="querystring">name_texts:"135g"</str>
<str name="parsedquery">MultiPhraseQuery(name_texts:"(135g 135) (g 135g)")</str>
<str name="parsedquery_toString">name_texts:"(135g 135) (g 135g)"</str>
您的处理看起来正确,并且在我的实例中找到了结果.我首先以为你有多余的东西
,但看起来并没有在我的本地实例中引起问题.查找这些问题的最佳位置是使用admin分析页面和调试查询,而您已经在执行这些查询了.我无法想到其他任何东西,因为我无法复制.通过仅使用solr的干净实例,而对字段定义仅更改schema.xml并通过管理面板(文档)对此进行索引,即可帮自己一个忙=> {"id":"100","name_texts":"Foobar 135g}.运行此查询http://localhost:8983/solr/collection1/select?q=name_texts%3A%22135g%22&wt=xml&indent=true&debugQuery=true
Your processing looks correct and it find result in my instance. I first thought you had extra
, but looks like is not causing issue in my local instance. The best place to look for these issues is to use the admin analysis page and debug queries, which you are already doing. I can not think of any thing else as i am unable to reproduce. Do yourself a favor by just taking a clean instance of solr with only change to schema.xml for your field definition and index just this through admin panel (documents) => {"id":"100","name_texts":"Foobar 135g"} . Run this query http://localhost:8983/solr/collection1/select?q=name_texts%3A%22135g%22&wt=xml&indent=true&debugQuery=true
这篇关于Solr:无法搜索包含字符的数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!