Solr自定义过滤器,用于串联令牌 [英] Solr Custom Filter for concatenating tokens

查看:106
本文介绍了Solr自定义过滤器,用于串联令牌的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要为solr分析器阶段编写一个自定义过滤器.想法是首先通过空格对输入的公司名称进行标记,然后应用一组用于小写字母,模式替换和删除停用词的过滤器.经过这些过滤器之后,我想将所有令牌合并(串联)到一个令牌中,然后应用NGramFilterFactory从令牌生成N-Grams.

I need to write a custom filter for solr analyzer phase. The idea is to first tokenize the input business name by whitespace then apply a set of filters for lower case, patterns replacement and removing the stop words. After these filters, I want to merge (concatenate) all the token into one token and then apply the NGramFilterFactory for generating N-Grams from the token.

我要合并所有令牌(最初是从公司名称生成的)的原因是,我不会在solr中建立索引而丢失令牌(其长度小于N,在NGramFilter中为N),并且用户可能不会插入输入公司名称时,请留出适当的空格.请让我知道进一步的澄清.

The reason I want to combine the all the token (generated initially from business name) is that I would not miss the tokens (whose length is less then N, in NGramFilter) from indexing in the solr and user might not insert the proper spaces while entering the business name. Please let me know for more clarification.

我曾尝试为此编写一个自定义过滤器,但这无法正常工作,我能够理解它的行为.

I made an attempt to write one custom filter for the same but this is not working properly and I am able to understand the behavior of it.

当我查询名称"apple"时,它将返回n1个结果.

When I query the name "apple" then it return n1 number of results.

当我查询名称"computers"时,它将返回n2个结果.

when I query the name "computers" then it returns n2 results.

当我查询名称苹果计算机"时,它将返回n3个结果.

when I query the name "apple computers" then it returns n3 results.

当我查询名称"computers apple"时,它将返回n4个结果.

when I query the name "computers apple" then it returns n4 results.

这里n3< (n1,n2)和n3!= n4

Here n3 < (n1,n2) and n3 != n4

这里是代码:我使用的是solr 4.10.2版本,并包含相同的solr-core jars.

Here is the code: I am using solr 4.10.2 version and included same solr-core jars.

public class ConcatFilter extends TokenFilter {

private CharTermAttribute charTermAtt;
private StringBuilder builder = new StringBuilder();

public ConcatFilter(TokenStream input)
{
    super(input);
    charTermAtt = addAttribute(CharTermAttribute.class);
}

@Override
public boolean incrementToken()  throws IOException  {

    if(input.incrementToken()) {
        int len = charTermAtt.length();
        char buffer[] = charTermAtt.buffer();
        builder.append(buffer, 0, len);
        char[] newBuffer = builder.toString().toCharArray();
        int newLength = builder.length();
        charTermAtt.setEmpty();
        charTermAtt.copyBuffer(newBuffer, 0, newLength);
        charTermAtt.setLength(newLength);
        return true;
    } else {
        builder.delete(0, builder.length());
        return false;
        }
    }
}

推荐答案

我已经编写了连接词过滤器,并将修补程序也提交给solr社区.任何面临相同问题的人都可以在这里找到它: ConcatenateWordsFilter

I have written the concatenate word filter and submitted the patch as well into solr community. Anyone facing the same problem can find it here: ConcatenateWordsFilter

这篇关于Solr自定义过滤器,用于串联令牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆