在Lucene 4.4.0中搜索词干和精确词 [英] Search stem and exact words in Lucene 4.4.0

查看:114
本文介绍了在Lucene 4.4.0中搜索词干和精确词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我存储了一个Lucene文档,其中的单个TextField包含没有词干的单词.

i've store a lucene document with a single TextField contains words without stems.

我需要实现一个搜索程序,该程序允许用户搜索单词和确切单词, 但是,如果我存储的单词没有词干,则无法进行词干搜索. 有一种方法可以在文档中同时搜索精确词和/或词干词,而无需 存储两个字段?

I need to implement a search program that allow users to search words and exact words, but if i've stored words without stemming, a stem search cannot be done. There's a method to search both exact words and/or stemming words in Documents without store Two fields ?

谢谢.

推荐答案

为两个单独的字段建立索引似乎对我来说是正确的方法.

Indexing two separate fields seems like the right approach to me.

带茎的文本和不带茎的文本需要不同的分析策略,因此要求您为QueryParser提供不同的Analyzer. Lucene并不真正支持使用不同的分析器在同一字段中对文本进行索引.那是设计使然.此外,在同一字段中复制文本可能会产生一些相当奇怪的评分影响(尤其是对词干没有触及的术语进行更严格的评分).

Stemmed and unstemmed text require different analysis strategies, and so require you to provide a different Analyzer to the QueryParser. Lucene doesn't really support indexing text in the same field with different analyzers. That is by design. Furthermore, duplicating the text in the same field could result in some fairly strange scoring impacts (heavier scoring on terms that are not touched by the stemmer, particularly).

无需在每个字段中存储文本,但仅在单独的字段中对其进行索引才有意义.

There is no need to store the text in each of these fields, but it only makes sense to index them in separate fields.

您可以使用

You can apply a different analyzer to different fields by using a PerFieldAnalyzerWrapper, by the way. Like:

Map<String,Analyzer> analyzerList = new HashMap<String,Analyzer>();
analyzerList.put("stemmedText", new EnglishAnalyzer(Version.LUCENE_44));
analyzerList.put("unstemmedText", new StandardAnalyzer(Version.LUCENE_44));
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_44), analyzerList);


如果您真的愿意,我可以看到实现此目标的几种可能性.


I can see a couple of possibilities to accomplish it though, if you really want to.

一种方法是基于(或可能扩展)您希望使用的词干过滤器来创建自己的词干过滤器,并增加词干过滤后保留原始令牌的功能.在这种情况下,请注意您的位置增加.短语查询等可能有问题.

One would be to create your own stem filter, based on (or possibly extending) the one you wish to use already, and add in the ability to keep the original tokens after stemming. Mind your position increments, in this case. Phrase queries and the like may be problematic.

另一种(可能更糟)的可能性是将文本正常添加到字段中,然后再次将其添加到同一字段中,但这一次是在手动阻止之后.具有相同名称的两个字段将被有效地串联.在这种情况下,您想将其存储在单独的字段中.期望得分很差.

The other (probably worse) possibility, would be to add the text to the field normally, then add it again to the same field, but this time after manually stemming. Two fields added with the same name will be effectively concatenated. You'dd want to store in a separate field, in this case. Expect wonky scoring.

同样,这两个都是坏主意.对于仅对两个字段建立索引的更简单,更有用的方法,我认为这两种策略都没有任何好处.

Again, though, both of these are bad ideas. I see no benefit whatsoever to either of these strategies over the much easier and more useful approach of just indexing two fields.

这篇关于在Lucene 4.4.0中搜索词干和精确词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆