连字的Lucene索引/查询策略 [英] Lucene Indexing/Query strategy for hyphenated words

查看:70
本文介绍了连字的Lucene索引/查询策略的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有许多带连字符或空格分隔的单词,但通常用作一个单词. 例如:篮球或篮球可以写成篮球.

There are many words which are hyphenated or whitespace separated but often used as one word. Eg : Basket Ball or basket-ball can be written as basketball.

现在当我将其索引为句子时,说:"Hey dude, I played basket ball yesterday". 现在,我尝试查询"basketball" [不带双引号].

Now when i index as sentence, say : "Hey dude, I played basket ball yesterday". Now i try to query "basketball" [without double quotes]..

在这种情况下,反之亦然(索引basketball和查询basket ball),我将不会得到任何结果.有什么办法可以直接或间接解决这个问题?

This case, or in the vice versa case, (index basketball and query basket ball) I will not get any results. Is there any way to solve this problem directly or indirectly ?

Edit:
我举了一个例子来演示这个问题.在我的实际应用场景中,我将索引和搜索ID. 如果i索引:011 12345,
我应该能够使用01112345进行查询.


I gave the example to just to demonstarte the problem. In my actual application scenario, i'll be indexing and searching IDs. If i index : 011 12345,
I should be able to query it using 01112345.

谢谢.

推荐答案

此处的连字符不是问题,假设您使用的是诸如StandardTokenizer之类的东西,可对诸如连字符之类的标记进行拆分,那么搜索篮子球"的用户将与之匹配原始文本"Basket-Ball"(反之亦然),所以在那里没有问题.

Hyphens are not the issue here, assuming you are using something like the StandardTokenizer that splits on tokens such as hyphens, then users searching for "basket ball" will match the original text "Basket-Ball" (and vica-versa), so no problem there.

问题发生在两个单词和一个单词的对等物之间,例如篮球"和篮球".您基本上需要处理同义词(例如夹克/外套或篮球/篮球).

The issue is going between two word and one word equivalents, e.g. "basketball" and "basket ball". You basically need to handle synonyms (e.g. jacket/coat or in your case basketball/ 'basket ball').

您可以自己创建一个等效单词列表,或者使用类似 WordNet 的字典来克服这一问题用每个词的同义词补充索引或搜索. Solr可能可以利用 SynonymFilter (另请参阅

You can overcome this by creating a list of equivalent words yourself, or using a dictionary like WordNet, and supplementing either the index or the search with the synonyms for each term. Solr has a SynonymFilter you can probably leverage (also see here).

这是我前一段时间写的一个非常基本的同义词过滤器的代码.同义词没有外部化,但是您自己可以轻松添加它.

Here's the code for a very basic synonym filter I wrote a while ago. The synonyms are not externalized, but you an easily add that yourself.

public class SynonymFilter extends TokenFilter {
    private static final Logger log = Logger.getLogger(SynonymFilter.class);

    private Stack<Token> synStack = new Stack<Token>();

    static CharArrayMap<String[]> synLookup = new CharArrayMap<String[]>(5, true);
    static {
        synLookup.put("basketball".toCharArray(), new String[]{"basket ball"});
        synLookup.put("trainer".toCharArray(), new String[]{"sneaker"});
        synLookup.put("burger".toCharArray(), new String[]{"hamburger"});
        synLookup.put("bike".toCharArray(), new String[]{"bicycle", "cycle"});
    }

    // TODO reverse map all the syns to each other e.g. sneaker to trainer

    protected SynonymFilter(TokenStream input) {
        super(input);
    }

    @Override
    public Token next(Token reusableToken) throws IOException {
        if (synStack.size() > 0)
            return synStack.pop();

        Token nextToken = input.next(reusableToken);
        if (nextToken != null) {
            addSynonyms(nextToken);
        }

        return nextToken;
    }

    private void addSynonyms(Token nextToken) {
        char[] word = Arrays.copyOf(nextToken.termBuffer(), nextToken.termLength());
        String[] synonyms = synLookup.get(word);
        if (synonyms != null) {
            for (String s : synonyms) {
                if (!equals(word, s)) {
                    char[] chars = s.toCharArray();
                    Token synToken = new Token(chars, 0, chars.length, nextToken.startOffset(),  nextToken.endOffset());
                    synToken.setPositionIncrement(0);
                    synStack.add(synToken);
                    log.info("Found synonym: " + s + " for: " + new String(nextToken.term()));
                }
            }
        }
    }

public static boolean equals(char[] word, String subString) {
    return equals(word, word.length, subString);
}

public static boolean equals(char[] word, int len, String subString) {

    if (len != subString.length())
        return false;

    for (int i = 0 ; i < subString.length(); i++) {
        if (word[len - i - 1] != subString.charAt(subString.length() - i - 1))
            return false;
    }

    return true;

}
}

这篇关于连字的Lucene索引/查询策略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆