使用空格、连字符、大小写和标点符号的各种组合进行搜索 [英] Search with various combinations of space, hyphen, casing and punctuations

查看：23 发布时间：2021/12/30 8:33:12 solr lucene string-matching solrj textmatching

本文介绍了使用空格、连字符、大小写和标点符号的各种组合进行搜索的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的架构:

<分析器><tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.StopFilterFactory"ignoreCase="true"词=停用词.txt"enablePositionIncrements="true"/><filter class="solr.WordDelimiterFilterFactory"generateWordParts="1" generateNumberParts="1"catenateWords="1" catenateNumbers="1" catenateAll="0"splitOnCaseChange="1" splitOnNumerics="0"/><filter class="solr.LowerCaseFilterFactory"/><filter class="solr.SnowballPorterFilterFactory" language="英文"protected="protwords.txt"/></分析器></fieldType>

我想工作的组合:

<块引用>

沃尔玛"、沃尔玛"、沃尔玛"、沃尔玛"、沃尔玛"

给定这些字符串中的任何一个，我想找到另一个.

因此，有 25 种这样的组合，如下所示:

(第一列表示搜索的输入文本，第二列表示预期匹配)

(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛,沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛，沃尔玛)(沃尔玛,沃尔玛)

我的架构的当前限制:

1.沃尔玛"->沃尔玛"，2.沃尔玛"->沃尔玛"，3.沃尔玛"->沃尔玛"，4.沃尔玛"->沃尔玛"，5.沃尔玛"->沃尔玛"

分析器截图:

我尝试了各种过滤器组合来解决这些限制，因此我被以下提供的解决方案绊倒:)?这对性能有何影响?

我的 Solr 架构中的默认运算符是 AND.我无法将其更改为 OR.

解决方案

升级 solrconfig.xml 中的 Lucene 版本(4.4 到 4.10)神奇地解决了这个问题！我不再有任何限制，我的查询分析器也按预期运行.

My schema:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1" generateNumberParts="1"
            catenateWords="1" catenateNumbers="1" catenateAll="0"
            splitOnCaseChange="1" splitOnNumerics="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English"
            protected="protwords.txt"/>
  </analyzer>
</fieldType>

Combinations that I want to work:

"Walmart", "WalMart", "Wal Mart", "Wal-Mart", "Wal-mart"

Given any of these strings, I want to find the other one.

So, there are 25 such combinations as given below:

(First column denotes input text for search, second column denotes expected match)

(Walmart,Walmart)
(Walmart,WalMart)
(Walmart,Wal Mart)
(Walmart,Wal-Mart)
(Walmart,Wal-mart)
(WalMart,Walmart)
(WalMart,WalMart)
(WalMart,Wal Mart)
(WalMart,Wal-Mart)
(WalMart,Wal-mart)
(Wal Mart,Walmart)
(Wal Mart,WalMart)
(Wal Mart,Wal Mart)
(Wal Mart,Wal-Mart)
(Wal Mart,Wal-mart)
(Wal-Mart,Walmart)
(Wal-Mart,WalMart)
(Wal-Mart,Wal Mart)
(Wal-Mart,Wal-Mart)
(Wal-Mart,Wal-mart)
(Wal-mart,Walmart)
(Wal-mart,WalMart)
(Wal-mart,Wal Mart)
(Wal-mart,Wal-Mart)
(Wal-mart,Wal-mart)

Current limitations with my schema:

1. "Wal-Mart" -> "Walmart",
2. "Wal Mart" -> "Walmart",
3. "Walmart"  -> "Wal Mart",
4. "Wal-mart" -> "Walmart",
5. "WalMart"  -> "Walmart"

Screenshot of the analyzer:

I tried various combinations of filters trying to resolve these limitations, so I got stumbled by the solution provided at: Solr - case-insensitive search do not work

While it seems to overcome one of the limitations that I have (see #5 WalMart -> Walmart), it is overall worse than what I had earlier. Now it does not work for cases like:

(Wal Mart,WalMart), 
(Wal-Mart,WalMart), 
(Wal-mart,WalMart), 
(WalMart,Wal Mart)
besides cases 1 to 4 as mentioned above

Analyzer after schema change:

Questions:

Why does "WalMart" not match "Walmart" with my initial schema ? Solr analyzer clearly shows me that it had produced 3 tokens during index time: wal, mart, walmart. During query time: It has produced 1 token: walmart (while it's not clear why it would produce just 1 token), I fail to understand why it does not match given that walmart is contained in both query and index tokens.
The problem that I mentioned here is just a single use-case. There are more slightly complex ones like:

Words with apostrophes: "Mc Donalds", "Mc Donald's", "McDonald's", "Mc donalds", "Mc donald's", "Mcdonald's"

Words with different punctuations: "Mc-Donald Engineering Company, Inc."

In general, what's the best way to go around modeling the schema with this kind of requirement ? NGrams ? Index same data in different fields (in different formats) and use copyField directive (https://wiki.apache.org/solr/SchemaXml#Indexing_same_data_in_multiple_fields) ? What are the performance implications of this ?

EDIT: The default operator in my Solr schema is AND. I cannot change it to OR.

解决方案

Upgrading the Lucene version (4.4 to 4.10) in solrconfig.xml fixed the problem magically! I do not have anymore limitations and my query analyzer behaves as expected too.

这篇关于使用空格、连字符、大小写和标点符号的各种组合进行搜索的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用空格、连字符、大小写和标点符号的各种组合进行搜索 [英] Search with various combinations of space, hyphen, casing and punctuations

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用空格、连字符、大小写和标点符号的各种组合进行搜索 [英] Search with various combinations of space, hyphen, casing and punctuations

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭