使用Lucene来搜索电子邮件地址 [英] Using Lucene to search for email addresses

查看:133
本文介绍了使用Lucene来搜索电子邮件地址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

欲利用Lucene(特别是Lucene.NET)来搜索电子邮件地址的域

I want to use Lucene (in particular, Lucene.NET) to search for email address domains.

例如。我想搜索@ gmail.com找到发送到Gmail地址的所有邮件。

E.g. I want to search for "@gmail.com" to find all emails sent to a gmail address.

运行一个Lucene查询中一个错误*@gmail.com的结果中,星号不能在查询的开始。竞选@ gmail.com不返回任何匹配项的查询,因为foo@gmail.com被看作一个整体的话,你不能搜索只是部分的字。

Running a Lucene query for "*@gmail.com" results in an error, asterisks cannot be at the start of queries. Running a query for "@gmail.com" doesn't return any matches, because "foo@gmail.com" is seen as a whole word, and you cannot search for just parts of a word.

我怎样才能做到这一点?

How can I do this?

推荐答案

没有人给一个满意的答复,所以我们开始围绕Lucene的文件戳,发现我们可以使用自定义分析仪和断词做到这一点。

No one gave a satisfactory answer, so we started poking around Lucene documentation and discovered we can accomplish this using custom Analyzers and Tokenizers.

答案是这样的:创建一个WhitespaceAndAtSymbolTokenizer和WhitespaceAndAtSymbolAnalyzer,然后使用该分析仪重新创建索引。一旦你做到这一点,搜索@ gmail.com将返回所有的Gmail地址,因为它看作是一个单独的词的感谢,我们刚刚创建的标记生成器。

The answer is this: create a WhitespaceAndAtSymbolTokenizer and a WhitespaceAndAtSymbolAnalyzer, then recreate your index using this analyzer. Once you do this, a search for "@gmail.com" will return all gmail addresses, because it's seen as a separate word thanks to the Tokenizer we just created.

下面是源$ C ​​$ C,它其实很简单:

Here's the source code, it's actually very simple:

class WhitespaceAndAtSymbolTokenizer : CharTokenizer
{
    public WhitespaceAndAtSymbolTokenizer(TextReader input)
        : base(input)
    {
    }

    protected override bool IsTokenChar(char c)
    {
        // Make whitespace characters and the @ symbol be indicators of new words.
        return !(char.IsWhiteSpace(c) || c == '@');
    }
}


internal class WhitespaceAndAtSymbolAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, TextReader reader)
    {
        return new WhitespaceAndAtSymbolTokenizer(reader);
    }
}

这就是它!现在你只需要使用这个新的分析工具来重建索引,做所有搜索。例如,写文件索引:

That's it! Now you just need to rebuild your index and do all searches using this new Analyzer. For example, to write documents to your index:

IndexWriter index = new IndexWriter(indexDirectory, new WhitespaceAndAtSymbolAnalyzer());
index.AddDocument(myDocument);

执行搜索应该用分析仪,以及:

Performing searches should use the analyzer as well:

IndexSearcher searcher = new IndexSearcher(indexDirectory);
Query query = new QueryParser("TheFieldNameToSearch", new WhitespaceAndAtSymbolAnalyzer()).Parse("@gmail.com");
Hits hits = query.Search(query);

这篇关于使用Lucene来搜索电子邮件地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆