使用 Lucene 搜索电子邮件地址 [英] Using Lucene to search for email addresses
问题描述
我想使用 Lucene(特别是 Lucene.NET)来搜索电子邮件地址域.
I want to use Lucene (in particular, Lucene.NET) to search for email address domains.
例如我想搜索@gmail.com"以查找发送到 Gmail 地址的所有电子邮件.
E.g. I want to search for "@gmail.com" to find all emails sent to a gmail address.
对*@gmail.com"运行 Lucene 查询会导致错误,星号不能位于查询的开头.运行对@gmail.com"的查询不会返回任何匹配项,因为foo@gmail.com"被视为一个完整的词,您不能只搜索词的一部分.
Running a Lucene query for "*@gmail.com" results in an error, asterisks cannot be at the start of queries. Running a query for "@gmail.com" doesn't return any matches, because "foo@gmail.com" is seen as a whole word, and you cannot search for just parts of a word.
我该怎么做?
推荐答案
没有人给出满意的答案,所以我们开始研究 Lucene 文档并发现我们可以使用自定义分析器和分词器来实现这一点.
No one gave a satisfactory answer, so we started poking around Lucene documentation and discovered we can accomplish this using custom Analyzers and Tokenizers.
答案是这样的:创建一个 WhitespaceAndAtSymbolTokenizer 和一个 WhitespaceAndAtSymbolAnalyzer,然后使用这个分析器重新创建您的索引.执行此操作后,搜索@gmail.com"将返回所有 gmail 地址,因为由于我们刚刚创建的 Tokenizer,它被视为一个单独的词.
The answer is this: create a WhitespaceAndAtSymbolTokenizer and a WhitespaceAndAtSymbolAnalyzer, then recreate your index using this analyzer. Once you do this, a search for "@gmail.com" will return all gmail addresses, because it's seen as a separate word thanks to the Tokenizer we just created.
这是源代码,其实很简单:
Here's the source code, it's actually very simple:
class WhitespaceAndAtSymbolTokenizer : CharTokenizer
{
public WhitespaceAndAtSymbolTokenizer(TextReader input)
: base(input)
{
}
protected override bool IsTokenChar(char c)
{
// Make whitespace characters and the @ symbol be indicators of new words.
return !(char.IsWhiteSpace(c) || c == '@');
}
}
internal class WhitespaceAndAtSymbolAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
return new WhitespaceAndAtSymbolTokenizer(reader);
}
}
就是这样!现在您只需要重建索引并使用这个新的分析器进行所有搜索.例如,要将文档写入您的索引:
That's it! Now you just need to rebuild your index and do all searches using this new Analyzer. For example, to write documents to your index:
IndexWriter index = new IndexWriter(indexDirectory, new WhitespaceAndAtSymbolAnalyzer());
index.AddDocument(myDocument);
执行搜索也应该使用分析器:
Performing searches should use the analyzer as well:
IndexSearcher searcher = new IndexSearcher(indexDirectory);
Query query = new QueryParser("TheFieldNameToSearch", new WhitespaceAndAtSymbolAnalyzer()).Parse("@gmail.com");
Hits hits = query.Search(query);
这篇关于使用 Lucene 搜索电子邮件地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!