使用 Lucene 搜索电子邮件地址 [英] Using Lucene to search for email addresses

查看:32
本文介绍了使用 Lucene 搜索电子邮件地址的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 Lucene(特别是 Lucene.NET)来搜索电子邮件地址域.

I want to use Lucene (in particular, Lucene.NET) to search for email address domains.

例如我想搜索@gmail.com"以查找发送到 Gmail 地址的所有电子邮件.

E.g. I want to search for "@gmail.com" to find all emails sent to a gmail address.

对*@gmail.com"运行 Lucene 查询会导致错误,星号不能位于查询的开头.运行对@gmail.com"的查询不会返回任何匹配项,因为foo@gmail.com"被视为一个完整的词,您不能只搜索词的一部分.

Running a Lucene query for "*@gmail.com" results in an error, asterisks cannot be at the start of queries. Running a query for "@gmail.com" doesn't return any matches, because "foo@gmail.com" is seen as a whole word, and you cannot search for just parts of a word.

我该怎么做?

推荐答案

没有人给出满意的答案,所以我们开始研究 Lucene 文档并发现我们可以使用自定义分析器和分词器来实现这一点.

No one gave a satisfactory answer, so we started poking around Lucene documentation and discovered we can accomplish this using custom Analyzers and Tokenizers.

答案是这样的:创建一个 WhitespaceAndAtSymbolTokenizer 和一个 WhitespaceAndAtSymbolAnalyzer,然后使用这个分析器重新创建您的索引.执行此操作后,搜索@gmail.com"将返回所有 gmail 地址,因为由于我们刚刚创建的 Tokenizer,它被视为一个单独的词.

The answer is this: create a WhitespaceAndAtSymbolTokenizer and a WhitespaceAndAtSymbolAnalyzer, then recreate your index using this analyzer. Once you do this, a search for "@gmail.com" will return all gmail addresses, because it's seen as a separate word thanks to the Tokenizer we just created.

这是源代码,其实很简单:

Here's the source code, it's actually very simple:

class WhitespaceAndAtSymbolTokenizer : CharTokenizer
{
    public WhitespaceAndAtSymbolTokenizer(TextReader input)
        : base(input)
    {
    }

    protected override bool IsTokenChar(char c)
    {
        // Make whitespace characters and the @ symbol be indicators of new words.
        return !(char.IsWhiteSpace(c) || c == '@');
    }
}


internal class WhitespaceAndAtSymbolAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, TextReader reader)
    {
        return new WhitespaceAndAtSymbolTokenizer(reader);
    }
}

就是这样!现在您只需要重建索引并使用这个新的分析器进行所有搜索.例如,要将文档写入您的索引:

That's it! Now you just need to rebuild your index and do all searches using this new Analyzer. For example, to write documents to your index:

IndexWriter index = new IndexWriter(indexDirectory, new WhitespaceAndAtSymbolAnalyzer());
index.AddDocument(myDocument);

执行搜索也应该使用分析器:

Performing searches should use the analyzer as well:

IndexSearcher searcher = new IndexSearcher(indexDirectory);
Query query = new QueryParser("TheFieldNameToSearch", new WhitespaceAndAtSymbolAnalyzer()).Parse("@gmail.com");
Hits hits = query.Search(query);

这篇关于使用 Lucene 搜索电子邮件地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆