Solr:无法搜索包含字符的数字 [英] Solr: Can't search for numbers mixed with characters

查看:149
本文介绍了Solr:无法搜索包含字符的数字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的索引中有一些项目(Solr.4.4),其中包含类似Foobar 135g的名称,其中135g表示某些砝码.搜索foobarfoobar 135确实可以,但是当我尝试搜索确切的短语foobar 135g时,什么也没找到.

I have some items in my index (Solr. 4.4), which contain names like Foobar 135g, where the 135g refers to some weights. Searching for foobar or foobar 135 does work, but when I try to search for the exact phrase foobar 135g, nothing is found.

我在solr管理面板"Analysis"中分析了查询.在这里,一切看起来都不错.正确地对字段进行了索引,正确地对查询进行了拆分,并且我得到了点击(由标记上的紫色背景表示).

I analysed the query inside the solr admin panel "Analysis". Here everything looks good. The fields are indexed correctly, the query is splitted correctly, and I get hits (indicated by this purple background on the tokens).

但是在索引和/或查询时间上处理字符串的方式必须存在一个问题.这是字段定义,我正在使用:

But there has to be an issue the way I process the strings on index and/or query time. So this is the field definition, I'm using:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
    <filter class="solr.ReverseStringFilterFactory" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
    <filter class="solr.ReverseStringFilterFactory" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

我将两个ReverseStringFilterFactoryEdgeNGramFilterFactory结合使用,以便能够搜索foob以及barobar(出现在项目名称末尾的字符串) ).首先,我想这与WordDelimiterFilterFactorycatenateWords选项有关.但是此选项对数字没有任何作用(对吗?).

I'm using the two ReverseStringFilterFactory's with the EdgeNGramFilterFactory's to be able to search for foob and for bar or obar (strings that appear at the end of an item name). First I thought, it has something to do with the WordDelimiterFilterFactory and the catenateWords options. But this option doesn't do anything with numbers in it (am I right?).

阅读文档后( http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters)我发现generateNumberParts,默认值为1.这导致将135g分为135g.但是只要启用了preserveOriginal选项,135g也会被索引为整个字符串.这也显示在管理界面的分析"面板中:

After reading the documentation (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters) I found generateNumberParts which default is 1. This leads to splitting 135g into 135 and g. But as long as I have the preserveOriginal option enabled, the 135g is also indexed as a whole string. This is also shown in the Analysis panel from the admin interface:

有人知道导致这种问题的是哪种过滤器,令牌生成器吗?

Does anybody know what kind of filter, tokenizer... is causing this issue?

更新

我发现了一些有趣的东西.当我调试搜索135g的查询时,得到以下调试输出:

I've found out something interesting. When I debug the query for the search 135g, I get the following debug output:

<lst name="debug">
  <str name="rawquerystring">name_texts:135g</str>
  <str name="querystring">name_texts:135g</str>
  <str name="parsedquery">MultiPhraseQuery(name_texts:"(135g 135) (g 135g)")</str>
  <str name="parsedquery_toString">name_texts:"(135g 135) (g 135g)"</str>
  <lst name="explain"/>
  <str name="QParser">LuceneQParser</str>
  ...
</lst>

我知道,由于前面提到了solr.WordDelimiterFilterFactory,因此将字符串get拆分为这些部分.但是,为什么Solr将其转换为MultiPhraseQuery?我现在有点困惑,我认为solr.WordDelimiterFilterFactory在查询时生成的每个单个令牌都会触发单独的搜索(或者至少是令牌之间的OR语句).

I understand, that because of the earlier mentioned solr.WordDelimiterFilterFactory, the string get's splitted into this parts. But why is Solr converting it into a MultiPhraseQuery? I'm a little bite confused right now, I thought that every single token generated by the solr.WordDelimiterFilterFactory on query time would trigger a seperated search (or at least, a OR statement between the tokens).

请,有人让我头脑清醒,我有点困惑;)如何避免这种情况?

Please, someone clear up my mind, I'm kinda confused ;) How can I avoid this?

推荐答案

它是WordDelimiterFilterFactory.您应该能够在分析中的管理面板中看到它.为此,请使用:splitOnNumerics ="0"作为属性.

It is the WordDelimiterFilterFactory. You should be able to see it in your admin panel under analysis. To not do that use : splitOnNumerics="0" as attribute.

更新:

在此处了解更多信息: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

Read more about it here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

solr.WordDelimiterFilterFactory

solr.WordDelimiterFilterFactory

创建solr.analysis.WordDelimiterFilter.

Creates solr.analysis.WordDelimiterFilter.

将单词拆分为子单词,并对子单词组执行可选的转换.默认情况下,单词会按照以下规则分为子单词:

Splits words into subwords and performs optional transformations on subword groups. By default, words are split into subwords with the following rules:

splitOnNumerics ="1"导致字母=>数字过渡生成新的部分[Solr 1.3]: "j2se" =>"j""2""se" 默认为true("1");设置为0即可关闭

splitOnNumerics="1" causes alphabet => number transitions to generate a new part [Solr 1.3]: "j2se" => "j" "2" "se" default is true ("1"); set to 0 to turn off

更新2

根据您的最新评论,我现在明白您的意思了.我使用了您的字段类型定义,并用您的句子在solr4.5.1上建立了索引,并且能够搜索到test_mytext:"foobar 135g",test_mytext:foobar 135g,test_mytext:foobar 135g,test_mytext:foobar,test_mytext:135g,test_mytext:135.其中test_mytext是您在上面的问题中定义的类型.所以我不知道为什么您无法在自己的索引中找到.确保您的字段定义如下:<field name="text" type="mytext" indexed="true" stored="true"/>

Based on your latest comment, i now understood what you meant. I took your field type definition and indexed on solr4.5.1 with your sentence and was able to search for test_mytext:"foobar 135g" , test_mytext:foobar 135g, test_mytext:foobar 135g , test_mytext:foobar , test_mytext:135g, test_mytext:135. where test_mytext is of type you defined in your question above. So i do not know why you are unable to find in your own index. Make sure your field is defined some thing like this: <field name="text" type="mytext" indexed="true" stored="true"/>

Upadate 3 这是我的调试日志,带有您的字段定义,而不是要求您查看完全不同的处理的原因: 查询=> test_mytext:135g 调试":{ "rawquerystring":"test_mytext:135g", "querystring":"test_mytext:135g", "parsedquery":"test_mytext:135g test_mytext:135 test_mytext:g test_mytext:135g", "parsedquery_toString":"test_mytext:135g test_mytext:135 test_mytext:g test_mytext:135g", 解释": { "200":"\ n0.8563627 =(MATCH)的乘积:\ n 1.141817 =(MATCH)的总和:\ n 0.35407978 =(MATCH)权重(test_mytext:135g in 1)[DefaultSimilarity],结果:\ n 0.35407978 =得分(doc = 1,freq = 2.0 = termFreq = 2.0 \ n),乘积:\ n 0.45980635 = queryWeight,乘积:\ n 3.4849067 = idf(docFreq = 2,maxDocs = 36)\ n 0.13194223 = queryNorm \ n 0.77006286 = fieldWeight in 1,乘积为:\ n 1.4142135 = tf(freq = 2.0),频率为:\ n 2.0 = termFreq = 2.0 \ n 3.4849067 = idf(docFreq = 2,maxDocs = 36)\ n 0.15625 = fieldNorm(doc = 1)\ n 0.4336574 =(MATCH)权重(test_mytext:135 in 1)[DefaultSimilarity],结果为:\ n 0.4336574 =得分(doc = 1,freq = 3.0 = termFreq = 3.0 \ n),的乘积:\ n 0.45980635 = queryWeight,乘积:\ n 3.4849067 = idf(docFreq = 2,maxDocs = 36)\ n 0.13194223 = queryNorm \ n 0.94313055 = fieldWeight in 1,乘积:\ n 1.7320508 = tf(freq = 3.0),频率为:\ n 3.0 =期限频率= 3.0 \ n 3.4849067 = idf(docFreq = 2,maxDocs = 36)\ n 0.15625 = fieldNorm(doc = 1)\ n 0.35407978 =(MATCH)权重(test_mytext:135g in 1)[DefaultSimilarity],结果为:\ n 0.35407978 =得分(doc = 1,freq = 2.0 = termFreq = 2.0 \ n),乘积:\ n 0.45980635 = queryWeight,乘积:\ n 3.4849067 = idf(docFreq = 2,maxDocs = 36)\ n 0.13194223 = queryNorm \ n 0.77006286 = fieldWeight in 1,产品乘积:\ n 1.4142135 = tf(freq = 2.0),freq为:\ n 2.0 = termFreq = 2.0 \ n 3.4849067 = idf(docFreq = 2,maxDocs = 36)\ n 0.15625 = fieldNorm(doc = 1)\ n 0.75 =坐标(3/4)\ n }

Upadate 3 Here is my debug log, with your field definition, not sue why you are seeing completely different processing: Query => test_mytext:135g debug": { "rawquerystring": "test_mytext:135g", "querystring": "test_mytext:135g", "parsedquery": "test_mytext:135g test_mytext:135 test_mytext:g test_mytext:135g", "parsedquery_toString": "test_mytext:135g test_mytext:135 test_mytext:g test_mytext:135g", "explain": { "200": "\n0.8563627 = (MATCH) product of:\n 1.141817 = (MATCH) sum of:\n 0.35407978 = (MATCH) weight(test_mytext:135g in 1) [DefaultSimilarity], result of:\n 0.35407978 = score(doc=1,freq=2.0 = termFreq=2.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.77006286 = fieldWeight in 1, product of:\n 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = termFreq=2.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.4336574 = (MATCH) weight(test_mytext:135 in 1) [DefaultSimilarity], result of:\n 0.4336574 = score(doc=1,freq=3.0 = termFreq=3.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.94313055 = fieldWeight in 1, product of:\n 1.7320508 = tf(freq=3.0), with freq of:\n 3.0 = termFreq=3.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.35407978 = (MATCH) weight(test_mytext:135g in 1) [DefaultSimilarity], result of:\n 0.35407978 = score(doc=1,freq=2.0 = termFreq=2.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.77006286 = fieldWeight in 1, product of:\n 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = termFreq=2.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.75 = coord(3/4)\n" },

我正在使用solr 4.5.1.

I am using solr 4.5.1 .

更新4 然后我注意到您正在使用Solr 4.4.0.我获取了您确切的字段定义和短语,然后进行了查询,它找到了您的结果.

Update 4 Then i noticed that you are using Solr 4.4.0. I took your exact field definition and phrase and ran a query and it finds your result.

查询=> name_texts:"135g"

Query => name_texts:"135g"

结果:

<result name="response" numFound="1" start="0">
  <doc>
    <str name="id">100</str>
    <str name="name_texts">Foobar 135g</str>
    <long name="_version_">1456487722571005952</long></doc>
</result>
<lst name="debug">
  <str name="rawquerystring">name_texts:"135g"</str>
  <str name="querystring">name_texts:"135g"</str>
  <str name="parsedquery">MultiPhraseQuery(name_texts:"(135g 135) (g 135g)")</str>
  <str name="parsedquery_toString">name_texts:"(135g 135) (g 135g)"</str>

您的处理看起来正确,并且在我的实例中找到了结果.我首先以为你有多余的东西 ,但看起来并没有在我的本地实例中引起问题.查找这些问题的最佳位置是使用admin分析页面和调试查询,而您已经在执行这些查询了.我无法想到其他任何东西,因为我无法复制.通过仅使用solr的干净实例,而对字段定义仅更改schema.xml并通过管理面板(文档)对此进行索引,即可帮自己一个忙=> {"id":"100","name_texts":"Foobar 135g}.运行此查询http://localhost:8983/solr/collection1/select?q=name_texts%3A%22135g%22&wt=xml&indent=true&debugQuery=true

Your processing looks correct and it find result in my instance. I first thought you had extra , but looks like is not causing issue in my local instance. The best place to look for these issues is to use the admin analysis page and debug queries, which you are already doing. I can not think of any thing else as i am unable to reproduce. Do yourself a favor by just taking a clean instance of solr with only change to schema.xml for your field definition and index just this through admin panel (documents) => {"id":"100","name_texts":"Foobar 135g"} . Run this query http://localhost:8983/solr/collection1/select?q=name_texts%3A%22135g%22&wt=xml&indent=true&debugQuery=true

这篇关于Solr:无法搜索包含字符的数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆