使用Lucene 4正则表达式搜索社会保险号 [英] Searching for Social security number using Lucene 4 regexp
问题描述
我正在尝试使用Lucene 4 Regexp查询来查找社会保险号.如果使用StandardAnalyzer或EnglishAnalyzer分析了该字段,那么还有其他方法可以匹配222-33-4444
或222 33 4444
之类的字符串.
I'm trying to use Lucene 4 Regexp query to find social security numbers. If the field is analyzed using the StandardAnalyzer or the EnglishAnalyzer, is there still some way to match strings like 222-33-4444
or 222 33 4444
.
据我所知,这些分析器将SSN的组件标记化,然后就无法捕获3个组件的连续匹配.理想情况下,我希望222 33 4444
与"/[0-9]{3}/ /[0-9]{2}/ /[0-9]{4}/"
之类的东西匹配,但这似乎不是因为短语查询不适用于regexp的(是吗?)有什么建议吗?
As far as I can see, these analyzers tokenize the components of the SSN, and then there's no way to catch consecutive matches for the 3 components. Ideally, I'd like 222 33 4444
to match something like "/[0-9]{3}/ /[0-9]{2}/ /[0-9]{4}/"
but it doesn't seem to be perhaps because phrase queries do not work with regexp's (yes?) Any suggestions?
推荐答案
如果您仅具有一个标识符字段或类似的字段,请使用StringField
或其他一些未标记的字段,在这种情况下,请使用简单的RegExpQuery
定义起来很简单.
If you simply have a field of identifiers, or some such, use a StringField
, or some other untokenized field, in which case a simple RegExpQuery
is simple enough to define.
如果您尝试将它们从必须进行标记化的全文字段中拉出(我认为是这种情况),则可以使用
If you are trying to pull them out of a full-text field, which must be tokenized (and I assume this is the case), you can use the SpanQuery
API to construct the appropriate query:
SpanQuery span1 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term("text", "[0-9]{3}")));
SpanQuery span2 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term("text", "[0-9]{2}")));
SpanQuery span3 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term("text", "[0-9]{4}")));
Query query = new SpanNearQuery({span1, span2, span3}, 0, true);
searcher.search(query, maxResults)
这篇关于使用Lucene 4正则表达式搜索社会保险号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!