使用通配符和词干的组合 [英] Using a Combination of Wildcards and Stemming

查看:167
本文介绍了使用通配符和词干的组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用雪球分析器来遏制多个文档的标题。一切运作良好,但他们有些怪癖。



例如:

搜索valv,valve或valves会返回相同的结果数量。这是有道理的,因为雪球分析器将所有内容都缩小到valv。



使用通配符时遇到问题。搜索valve *或valves *不会返回任何结果。搜索valv *按预期工作。

我明白为什么会发生这种情况,但我不知道如何解决。



我曾想过编写一个分析器来存储干扰标记和非干扰标记。基本上应用两个分析器并组合两个令牌流。但我不确定这是否是一种实用的解决方案。

我也想过使用AnalyzingQueryParser,但我不知道如何将其应用于多字段查询。另外,当搜索阀门*时,使用analyzeQueryParser将返回阀门的结果,这不是预期的行为。



是否有一种首选的方式来利用通配符和词干算法?

我使用了两种不同的方法来解决此问题。


  1. 使用两个字段,一个包含词干术语,另一个包含由 StandardAnalyzer 生成的术语。如果在标准字段中对其进行通配符搜索,则解析搜索查询时,如果不使用具有词干术语的字段。如果你让用户直接在Lucene的QueryParser中输入他们的查询,这可能会更难。

  2. 编写自定义分析器和索引重叠标记。它基本上包括使用 PositionIncrementAttribute 索引原始词语和词干在索引中的相同位置。你可以看看SynonymFilter 来获得一些如何正确使用 PositionIncrementAttribute 的例子。

I倾向解决方案#2。


I'm using a snowball analyzer to stem the titles of multiple documents. Everything works well, but their are some quirks.

Example:

A search for "valv", "valve", or "valves" returns the same number of results. This makes sense since the snowball analyzer reduces everything down to "valv".

I run into problems when using a wildcard. A search for "valve*" or "valves*" does not return any results. Searching for "valv*" works as expected.

I understand why this is happening, but I don't know how to fix it.

I thought about writing an analyzer that stores the stemmed and non-stemmed tokens. Basically applying two analyzers and combining the two token streams. But I'm not sure if this is a practical solution.

I also thought about using the AnalyzingQueryParser, but I don't know how to apply this to a multifield query. Also, the using AnalyzingQueryParser would return results for "valve" when searching for "valves*" and that's not the expected behavior.

Is there a "preferred" way of utilizing both wildcards and stemming algorithms?

解决方案

I used 2 different approach to solve this before

  1. Use two fields, one that contain stemmed terms, the other one containing terms generated by say, the StandardAnalyzer. When you parse the search query if its a wildcard search in the "standard" field, if not use the field with stemmed terms. This may be harder to use if you have the user input their queries directly in the Lucene's QueryParser.

  2. Write a custom analyzer and index overlapping tokens. It basically consist of indexing the original term and the stem at the same position in the index using the PositionIncrementAttribute. You can look into SynonymFilter to get some example of how to use the PositionIncrementAttribute correctly.

I Prefer solution #2.

这篇关于使用通配符和词干的组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆