搜索空格,连字符,大小写和标点符号的各种组合 [英] Search with various combinations of space, hyphen, casing and punctuations

查看:232
本文介绍了搜索空格,连字符,大小写和标点符号的各种组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的模式:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1" generateNumberParts="1"
            catenateWords="1" catenateNumbers="1" catenateAll="0"
            splitOnCaseChange="1" splitOnNumerics="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English"
            protected="protwords.txt"/>
  </analyzer>
</fieldType>

我要使用的组合:

沃尔玛",沃尔玛",沃尔玛",沃尔玛",沃尔玛"

"Walmart", "WalMart", "Wal Mart", "Wal-Mart", "Wal-mart"

鉴于这些字符串中的任何一个,我想找到另一个.

Given any of these strings, I want to find the other one.

因此,有25种这样的组合,如下所示:

So, there are 25 such combinations as given below:

(第一列表示要搜索的输入文本,第二列表示预期的匹配项)

(First column denotes input text for search, second column denotes expected match)

(Walmart,Walmart)
(Walmart,WalMart)
(Walmart,Wal Mart)
(Walmart,Wal-Mart)
(Walmart,Wal-mart)
(WalMart,Walmart)
(WalMart,WalMart)
(WalMart,Wal Mart)
(WalMart,Wal-Mart)
(WalMart,Wal-mart)
(Wal Mart,Walmart)
(Wal Mart,WalMart)
(Wal Mart,Wal Mart)
(Wal Mart,Wal-Mart)
(Wal Mart,Wal-mart)
(Wal-Mart,Walmart)
(Wal-Mart,WalMart)
(Wal-Mart,Wal Mart)
(Wal-Mart,Wal-Mart)
(Wal-Mart,Wal-mart)
(Wal-mart,Walmart)
(Wal-mart,WalMart)
(Wal-mart,Wal Mart)
(Wal-mart,Wal-Mart)
(Wal-mart,Wal-mart)

我的架构的当前限制:

1. "Wal-Mart" -> "Walmart",
2. "Wal Mart" -> "Walmart",
3. "Walmart"  -> "Wal Mart",
4. "Wal-mart" -> "Walmart",
5. "WalMart"  -> "Walmart"

分析仪的屏幕截图:

我尝试使用各种过滤器组合来尝试解决这些限制,所以我迷失了以下位置提供的解决方案:

I tried various combinations of filters trying to resolve these limitations, so I got stumbled by the solution provided at: Solr - case-insensitive search do not work

尽管它似乎克服了我的局限性之一(请参阅#5沃尔玛->沃尔玛),但总体上比我以前的情况要糟糕.现在它不适用于以下情况:

While it seems to overcome one of the limitations that I have (see #5 WalMart -> Walmart), it is overall worse than what I had earlier. Now it does not work for cases like:

(Wal Mart,WalMart), 
(Wal-Mart,WalMart), 
(Wal-mart,WalMart), 
(WalMart,Wal Mart)
besides cases 1 to 4 as mentioned above

更改架构后的分析器:

Analyzer after schema change:

问题:

  1. 为什么"WalMart"与我的初始架构不匹配"Walmart"? Solr分析器清楚地向我展示了它在索引时间内生成了3个令牌:walmartwalmart.在查询期间:它产生了1个令牌:walmart(虽然不清楚为什么它只产生1个令牌),但由于查询和索引令牌中都包含walmart,所以我无法理解为什么它不匹配.

  1. Why does "WalMart" not match "Walmart" with my initial schema ? Solr analyzer clearly shows me that it had produced 3 tokens during index time: wal, mart, walmart. During query time: It has produced 1 token: walmart (while it's not clear why it would produce just 1 token), I fail to understand why it does not match given that walmart is contained in both query and index tokens.

我在这里提到的问题只是一个用例.还有一些稍微复杂的东西,例如:

The problem that I mentioned here is just a single use-case. There are more slightly complex ones like:

带撇号的单词:"Mc Donalds","Mc Donald's","McDonald's","Mc donalds","Mc donald's","Mcdonald's"

Words with apostrophes: "Mc Donalds", "Mc Donald's", "McDonald's", "Mc donalds", "Mc donald's", "Mcdonald's"

标点符号不同的单词:麦当劳工程公司"

Words with different punctuations: "Mc-Donald Engineering Company, Inc."

通常,围绕这种需求对模式进行建模的最佳方法是什么? NGrams?为不同字段(以不同格式)中的相同数据编制索引,并使用copyField指令( https://wiki. apache.org/solr/SchemaXml#Indexing_same_data_in_multiple_fields )?这对性能有何影响?

In general, what's the best way to go around modeling the schema with this kind of requirement ? NGrams ? Index same data in different fields (in different formats) and use copyField directive (https://wiki.apache.org/solr/SchemaXml#Indexing_same_data_in_multiple_fields) ? What are the performance implications of this ?

我的Solr模式中的默认运算符为AND.我无法将其更改为OR.

The default operator in my Solr schema is AND. I cannot change it to OR.

推荐答案

在solrconfig.xml中升级Lucene版本(4.4至4.10)可以神奇地解决了该问题!我没有任何限制,而且查询分析器的行为也符合预期.

Upgrading the Lucene version (4.4 to 4.10) in solrconfig.xml fixed the problem magically! I do not have anymore limitations and my query analyzer behaves as expected too.

这篇关于搜索空格,连字符,大小写和标点符号的各种组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆