在solr中搜索特殊字符 [英] Search in solr with special characters

查看:208
本文介绍了在solr中搜索特殊字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在solr中搜索特殊字符时遇到问题。
我的文档有一个字段标题,有时它可以像泰坦尼克号 - 1999(它具有字符 - )。
当我尝试在solr中搜索 - 时,我收到400错误。我试图逃避角色,所以我尝试了 - 和\-之类的东西。在这种情况下,solr不会响应我的错误,但它会返回0个结果。



如何在solr管理员中搜索特殊字符(例如 - 或'



问候

更新
在这里你可以看到我目前的解决方案 https://gist.github.com/cpalomaresbazuca/6269375



我的搜索是在字段标题。

摘录自架构。 xml:

  ... 
<! - 一般文本字段具有合理的,通用
跨语言默认值:它用StandardTokenizer进行标记,
从不区分大小写的stopwords.txt
(默认为空)和下来情况中删除停用词。 - >
< fieldType name =text_generalclass =solr.TextFieldpositionIncrementGap =100>
<分析器类型=索引>
< tokenizer class =solr.StandardTokenizerFactory/>
< filter class =solr.StopFilterFactoryignoreCase =truewords =stopwords.txtenablePositionIncrements =true/>
<! - 在这个例子中,我们将只在查询时使用同义词
< filter class =solr.SynonymFilterFactorysynonyms同义词=index_synonyms.txtignoreCase =trueexpand =假/>
- >
< filter class =solr.LowerCaseFilterFactory/>

< / analyzer>
< analyzer type =query>
< tokenizer class =solr.StandardTokenizerFactory/>
< filter class =solr.StopFilterFactoryignoreCase =truewords =stopwords.txtenablePositionIncrements =true/>
< filter class =solr.SynonymFilterFactorysynonyms同义词=同义词.txtignoreCase =trueexpand =true/>
< filter class =solr.LowerCaseFilterFactory/>

< / analyzer>
< / fieldType>
...
< field name =Titletype =text_generalindexed =truestored =true/>


解决方案

您正在使用标准 title属性的text_general 字段。这可能不是一个好的选择。 text_general 应该是用于大量文本(或者至少是句子),而不是用于精确匹配名称或标题。



这里的问题是 text_general 使用 StandardTokenizerFactory

 < fieldType name =text_generalclass =solr.TextFieldpositionIncrementGap =100> 
< analyzer type =index>
< tokenizer class =solr.StandardTokenizerFactory/>
< filter class =solr.StopFilterFactoryignoreCase =truewords =stopwords.txtenablePositionIncrements =true/>
<! - 在这个例子中,我们将只在查询时使用同义词
< filter class =solr.SynonymFilterFactorysynonyms同义词=index_synonyms.txtignoreCase =trueexpand =假/>
- >
< filter class =solr.LowerCaseFilterFactory/>

< / analyzer>
< analyzer type =query>
< tokenizer class =solr.StandardTokenizerFactory/>
< filter class =solr.StopFilterFactoryignoreCase =truewords =stopwords.txtenablePositionIncrements =true/>
< filter class =solr.SynonymFilterFactorysynonyms同义词=同义词.txtignoreCase =trueexpand =true/>
< filter class =solr.LowerCaseFilterFactory/>

< / analyzer>
< / fieldType>

StandardTokenizerFactory 执行以下操作:


一个很好的通用标记器,可以删除许多无关的
字符并将标记类型设置为有意义的值。令牌类型是
,仅用于识别
相同令牌类型的后续令牌过滤器。


这意味着' - '字符将被完全忽略并用于对字符串进行标记。
$ b


功夫将被表示为kong和fu。 ' - '消失。

这也解释了为什么 select?q = title:\- 在这里不起作用。



选择更合适的字段类型:



您可以使用 solr.WhitespaceTokenizerFactory 来代替 StandardTokenizerFactory ,它仅在空白符词的精确匹配。因此,为title属性创建自己的字段类型将是一个解决方案。



Solr还有一个叫做 text_ws 。根据您的要求,这可能就足够了。

I have a problem with a search with special characters in solr. My document has a field "title" and sometimes it can be like "Titanic - 1999" (it has the character "-"). When i try to search in solr with "-" i receive a 400 error. I've tried to escape the character, so I tried something like "-" and "\-". With that changes solr doesn't response me with an error, but it returns 0 results.

How can i search in the solr admin with that special character(something like "-" or "'"???

Regards

UPDATE Here you can see my current solr scheme https://gist.github.com/cpalomaresbazuca/6269375

My search is to the field "Title".

excerpt from the schema.xml:

 ...
 <!-- A general text field that has reasonable, generic
     cross-language defaults: it tokenizes with StandardTokenizer,
     removes stop words from case-insensitive "stopwords.txt"
     (empty by default), and down cases.  At query time only, it
     also applies synonyms. -->
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <!-- in this example, we will only use synonyms at query time
             <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
             -->
            <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>
    </fieldType>
...
<field name="Title" type="text_general" indexed="true" stored="true"/>

解决方案

You are using the standard text_general field for the title attribute. This might not be a good choice. text_general is meant to be for huge chunks of text (or at least sentences) and not so much for exact matching of names or titles.

The problem here is that text_general uses the StandardTokenizerFactory.

 <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <!-- in this example, we will only use synonyms at query time
             <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
             -->
            <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>
    </fieldType>

StandardTokenizerFactory does the following:

A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware of the same token types.

This means the '-' character will be completely ignored and be used to tokenize the String.

"kong-fu" will be represented as "kong" and "fu". The '-' disappears.

This does also explain why select?q=title:\- won't work here.

Choose a better fitting field type:

Instead of the StandardTokenizerFactory you could use the solr.WhitespaceTokenizerFactory, that only splits on whitespace for exact matching of words. So making your own field type for the title attribute would be a solution.

Solr also has a mininal fieldtype called text_ws. Depending on your requirements this might be enough.

这篇关于在solr中搜索特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆