Lucene.Net下划线导致令牌拆分 [英] Lucene.Net Underscores causing token split

查看:70
本文介绍了Lucene.Net下划线导致令牌拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经将MsSqlServer数据库表,视图和存储过程编写为脚本,然后编入目录结构,然后使用Lucene.net进行索引.我的大多数表,视图和过程名称都包含下划线.

I've scripted a MsSqlServer databases tables,views and stored procedures into a directory structure that I am then indexing with Lucene.net. Most of my table, view and procedure names contain underscores.

我使用StandardAnalyzer.例如,如果我查询名为tIr_ InvoiceBtnWtn01的表,则会收到回击的结果是tIr和InvoiceBtnWtn01,而不仅是tIr _InvoiceBtnWtn01.

I use the StandardAnalyzer. If I query for a table named tIr_InvoiceBtnWtn01, for example, I recieve hits back for tIr and for InvoiceBtnWtn01, rather than for just tIr_InvoiceBtnWtn01.

我认为问题是令牌生成器在_(下划线)上出现了拆分,因为它是标点符号.

I think the issue is the tokenizer is splitting on _ (underscore) since it is punctuation.

是否有一种(简单的)方法可以从标点符号列表中删除下划线?或者我应该使用其他分析器来处理sql和编程语言吗?

Is there a (simple) way to remove underscores from the punctuation list or is there another analyzer that I should be using for sql and programming languages?

推荐答案

是的,StandardAnalyzer在下划线处分开. WhitespaceAnalyzer不支持.请注意,您可以使用PerFieldAnalyzerWrapper为每个字段使用不同的分析器-您可能希望保留除表/列名以外的所有内容的某些标准分析器功能.

Yes, the StandardAnalyzer splits on underscore. WhitespaceAnalyzer does not. Note that you can use a PerFieldAnalyzerWrapper to use different analyzers for each field - you might want to keep some of the standard analyzer's functionality for everything except table/column name.

WhitespaceAnalyzer仅执行空白分割.例如,它不会小写您的令牌.因此,您可能想制作一个结合了WhitespaceTokenizer和LowercaseFilter的分析器,或者研究LowercaseTokenizer.

WhitespaceAnalyzer only does whitespace splitting though. It won't lowercase your tokens, for example. So you might want to make your own analyzer which combines WhitespaceTokenizer and LowercaseFilter, or look into LowercaseTokenizer.

简单的自定义分析器(在C#中,但是您可以很容易地将其转换为Java):

Simple custom analyzer (in C#, but you can translate it to Java pretty easily):

// Chains together standard tokenizer, standard filter, and lowercase filter
class MyAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
    {
        StandardTokenizer baseTokenizer = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
        StandardFilter standardFilter = new StandardFilter(baseTokenizer);
        LowerCaseFilter lcFilter = new LowerCaseFilter(standardFilter);
        return lcFilter; 
    }
}

这篇关于Lucene.Net下划线导致令牌拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆