使用空格、连字符、大小写和标点符号的各种组合进行搜索 [英] Search with various combinations of space, hyphen, casing and punctuations

查看:23
本文介绍了使用空格、连字符、大小写和标点符号的各种组合进行搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的架构:

<分析器><tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.StopFilterFactory"ignoreCase="true"词=停用词.txt"enablePositionIncrements="true"/><filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1"catenateWords="1" catenateNumbers="1" catenateAll="0"splitOnCaseChange="1" splitOnNumerics="0"/><filter class="solr.LowerCaseFilterFactory"/><filter class="solr.SnowballPorterFilterFactory" language="英文"protected="protwords.txt"/></分析器></fieldType>

我想工作的组合:

<块引用>

沃尔玛"、沃尔玛"、沃尔玛"、沃尔玛"、沃尔玛"

给定这些字符串中的任何一个,我想找到另一个.

因此,有 25 种这样的组合,如下所示:

(第一列表示搜索的输入文本,第二列表示预期匹配)

(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)

我的架构的当前限制:

1.沃尔玛"->沃尔玛",2.沃尔玛"->沃尔玛",3.沃尔玛"->沃尔玛",4.沃尔玛"->沃尔玛",5.沃尔玛"->沃尔玛"

分析器截图:

我尝试了各种过滤器组合来解决这些限制,因此我被以下提供的解决方案绊倒:)?这对性能有何影响?

我的 Solr 架构中的默认运算符是 AND.我无法将其更改为 OR.

解决方案

升级 solrconfig.xml 中的 Lucene 版本(4.4 到 4.10)神奇地解​​决了这个问题!我不再有任何限制,我的查询分析器也按预期运行.

My schema:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1" generateNumberParts="1"
            catenateWords="1" catenateNumbers="1" catenateAll="0"
            splitOnCaseChange="1" splitOnNumerics="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English"
            protected="protwords.txt"/>
  </analyzer>
</fieldType>

Combinations that I want to work:

"Walmart", "WalMart", "Wal Mart", "Wal-Mart", "Wal-mart"

Given any of these strings, I want to find the other one.

So, there are 25 such combinations as given below:

(First column denotes input text for search, second column denotes expected match)

(Walmart,Walmart)
(Walmart,WalMart)
(Walmart,Wal Mart)
(Walmart,Wal-Mart)
(Walmart,Wal-mart)
(WalMart,Walmart)
(WalMart,WalMart)
(WalMart,Wal Mart)
(WalMart,Wal-Mart)
(WalMart,Wal-mart)
(Wal Mart,Walmart)
(Wal Mart,WalMart)
(Wal Mart,Wal Mart)
(Wal Mart,Wal-Mart)
(Wal Mart,Wal-mart)
(Wal-Mart,Walmart)
(Wal-Mart,WalMart)
(Wal-Mart,Wal Mart)
(Wal-Mart,Wal-Mart)
(Wal-Mart,Wal-mart)
(Wal-mart,Walmart)
(Wal-mart,WalMart)
(Wal-mart,Wal Mart)
(Wal-mart,Wal-Mart)
(Wal-mart,Wal-mart)

Current limitations with my schema:

1. "Wal-Mart" -> "Walmart",
2. "Wal Mart" -> "Walmart",
3. "Walmart"  -> "Wal Mart",
4. "Wal-mart" -> "Walmart",
5. "WalMart"  -> "Walmart"

Screenshot of the analyzer:

I tried various combinations of filters trying to resolve these limitations, so I got stumbled by the solution provided at: Solr - case-insensitive search do not work

While it seems to overcome one of the limitations that I have (see #5 WalMart -> Walmart), it is overall worse than what I had earlier. Now it does not work for cases like:

(Wal Mart,WalMart), 
(Wal-Mart,WalMart), 
(Wal-mart,WalMart), 
(WalMart,Wal Mart)
besides cases 1 to 4 as mentioned above

Analyzer after schema change:

Questions:

  1. Why does "WalMart" not match "Walmart" with my initial schema ? Solr analyzer clearly shows me that it had produced 3 tokens during index time: wal, mart, walmart. During query time: It has produced 1 token: walmart (while it's not clear why it would produce just 1 token), I fail to understand why it does not match given that walmart is contained in both query and index tokens.

  2. The problem that I mentioned here is just a single use-case. There are more slightly complex ones like:

    Words with apostrophes: "Mc Donalds", "Mc Donald's", "McDonald's", "Mc donalds", "Mc donald's", "Mcdonald's"

    Words with different punctuations: "Mc-Donald Engineering Company, Inc."

In general, what's the best way to go around modeling the schema with this kind of requirement ? NGrams ? Index same data in different fields (in different formats) and use copyField directive (https://wiki.apache.org/solr/SchemaXml#Indexing_same_data_in_multiple_fields) ? What are the performance implications of this ?

EDIT: The default operator in my Solr schema is AND. I cannot change it to OR.

解决方案

Upgrading the Lucene version (4.4 to 4.10) in solrconfig.xml fixed the problem magically! I do not have anymore limitations and my query analyzer behaves as expected too.

这篇关于使用空格、连字符、大小写和标点符号的各种组合进行搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆