用于连接令牌的 Solr 自定义过滤器 [英] Solr Custom Filter for concatenating tokens

查看:22
本文介绍了用于连接令牌的 Solr 自定义过滤器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要为 solr 分析器阶段编写一个自定义过滤器.这个想法是首先用空格标记输入的企业名称,然后应用一组小写过滤器、模式替换和删除停用词.在这些过滤器之后,我想将所有令牌合并(连接)为一个令牌,然后应用 NGramFilterFactory 从令牌生成 N-Gram.

I need to write a custom filter for solr analyzer phase. The idea is to first tokenize the input business name by whitespace then apply a set of filters for lower case, patterns replacement and removing the stop words. After these filters, I want to merge (concatenate) all the token into one token and then apply the NGramFilterFactory for generating N-Grams from the token.

我想合并所有令牌(最初从公司名称生成)的原因是我不会错过 solr 中索引的令牌(其长度小于 N,在 NGramFilter 中),并且用户可能不会插入输入公司名称时使用适当的空格.请让我知道更多的澄清.

The reason I want to combine the all the token (generated initially from business name) is that I would not miss the tokens (whose length is less then N, in NGramFilter) from indexing in the solr and user might not insert the proper spaces while entering the business name. Please let me know for more clarification.

我尝试为其编写一个自定义过滤器,但这不能正常工作,我能够理解它的行为.

I made an attempt to write one custom filter for the same but this is not working properly and I am able to understand the behavior of it.

当我查询名称apple"时,它返回 n1 个结果.

When I query the name "apple" then it return n1 number of results.

当我查询名称computers"时,它返回 n2 个结果.

when I query the name "computers" then it returns n2 results.

当我查询名称苹果计算机"时,它返回 n3 个结果.

when I query the name "apple computers" then it returns n3 results.

当我查询名称computers apple"时,它返回 n4 个结果.

when I query the name "computers apple" then it returns n4 results.

这里 n3 <(n1,n2) 和 n3 != n4

Here n3 < (n1,n2) and n3 != n4

这是代码:我使用的是 solr 4.10.2 版本并包含相同的 solr-core jars.

Here is the code: I am using solr 4.10.2 version and included same solr-core jars.

public class ConcatFilter extends TokenFilter {

private CharTermAttribute charTermAtt;
private StringBuilder builder = new StringBuilder();

public ConcatFilter(TokenStream input)
{
    super(input);
    charTermAtt = addAttribute(CharTermAttribute.class);
}

@Override
public boolean incrementToken()  throws IOException  {

    if(input.incrementToken()) {
        int len = charTermAtt.length();
        char buffer[] = charTermAtt.buffer();
        builder.append(buffer, 0, len);
        char[] newBuffer = builder.toString().toCharArray();
        int newLength = builder.length();
        charTermAtt.setEmpty();
        charTermAtt.copyBuffer(newBuffer, 0, newLength);
        charTermAtt.setLength(newLength);
        return true;
    } else {
        builder.delete(0, builder.length());
        return false;
        }
    }
}

推荐答案

我已经编写了连接词过滤器并将补丁提交给了 solr 社区.任何面临同样问题的人都可以在这里找到它:ConcatenateWordsFilter

I have written the concatenate word filter and submitted the patch as well into solr community. Anyone facing the same problem can find it here: ConcatenateWordsFilter

这篇关于用于连接令牌的 Solr 自定义过滤器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆