使用Lucene来搜索电子邮件地址 [英] Using Lucene to search for email addresses
问题描述
欲利用Lucene(特别是Lucene.NET)来搜索电子邮件地址的域
I want to use Lucene (in particular, Lucene.NET) to search for email address domains.
例如。我想搜索@ gmail.com找到发送到Gmail地址的所有邮件。
E.g. I want to search for "@gmail.com" to find all emails sent to a gmail address.
运行一个Lucene查询中一个错误*@gmail.com的结果中,星号不能在查询的开始。竞选@ gmail.com不返回任何匹配项的查询,因为foo@gmail.com被看作一个整体的话,你不能搜索只是部分的字。
Running a Lucene query for "*@gmail.com" results in an error, asterisks cannot be at the start of queries. Running a query for "@gmail.com" doesn't return any matches, because "foo@gmail.com" is seen as a whole word, and you cannot search for just parts of a word.
我怎样才能做到这一点?
How can I do this?
推荐答案
没有人给一个满意的答复,所以我们开始围绕Lucene的文件戳,发现我们可以使用自定义分析仪和断词做到这一点。
No one gave a satisfactory answer, so we started poking around Lucene documentation and discovered we can accomplish this using custom Analyzers and Tokenizers.
答案是这样的:创建一个WhitespaceAndAtSymbolTokenizer和WhitespaceAndAtSymbolAnalyzer,然后使用该分析仪重新创建索引。一旦你做到这一点,搜索@ gmail.com将返回所有的Gmail地址,因为它看作是一个单独的词的感谢,我们刚刚创建的标记生成器。
The answer is this: create a WhitespaceAndAtSymbolTokenizer and a WhitespaceAndAtSymbolAnalyzer, then recreate your index using this analyzer. Once you do this, a search for "@gmail.com" will return all gmail addresses, because it's seen as a separate word thanks to the Tokenizer we just created.
下面是源$ C $ C,它其实很简单:
Here's the source code, it's actually very simple:
class WhitespaceAndAtSymbolTokenizer : CharTokenizer
{
public WhitespaceAndAtSymbolTokenizer(TextReader input)
: base(input)
{
}
protected override bool IsTokenChar(char c)
{
// Make whitespace characters and the @ symbol be indicators of new words.
return !(char.IsWhiteSpace(c) || c == '@');
}
}
internal class WhitespaceAndAtSymbolAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, TextReader reader)
{
return new WhitespaceAndAtSymbolTokenizer(reader);
}
}
这就是它!现在你只需要使用这个新的分析工具来重建索引,做所有搜索。例如,写文件索引:
That's it! Now you just need to rebuild your index and do all searches using this new Analyzer. For example, to write documents to your index:
IndexWriter index = new IndexWriter(indexDirectory, new WhitespaceAndAtSymbolAnalyzer());
index.AddDocument(myDocument);
执行搜索应该用分析仪,以及:
Performing searches should use the analyzer as well:
IndexSearcher searcher = new IndexSearcher(indexDirectory);
Query query = new QueryParser("TheFieldNameToSearch", new WhitespaceAndAtSymbolAnalyzer()).Parse("@gmail.com");
Hits hits = query.Search(query);
这篇关于使用Lucene来搜索电子邮件地址的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!