使用Lucene 4正则表达式搜索社会保险号 [英] Searching for Social security number using Lucene 4 regexp

查看:125
本文介绍了使用Lucene 4正则表达式搜索社会保险号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Lucene 4 Regexp查询来查找社会保险号.如果使用StandardAnalyzer或EnglishAnalyzer分析了该字段,那么还有其他方法可以匹配222-33-4444222 33 4444之类的字符串.

I'm trying to use Lucene 4 Regexp query to find social security numbers. If the field is analyzed using the StandardAnalyzer or the EnglishAnalyzer, is there still some way to match strings like 222-33-4444 or 222 33 4444.

据我所知,这些分析器将SSN的组件标记化,然后就无法捕获3个组件的连续匹配.理想情况下,我希望222 33 4444"/[0-9]{3}/ /[0-9]{2}/ /[0-9]{4}/"之类的东西匹配,但这似乎不是因为短语查询不适用于regexp的(是吗?)有什么建议吗?

As far as I can see, these analyzers tokenize the components of the SSN, and then there's no way to catch consecutive matches for the 3 components. Ideally, I'd like 222 33 4444 to match something like "/[0-9]{3}/ /[0-9]{2}/ /[0-9]{4}/" but it doesn't seem to be perhaps because phrase queries do not work with regexp's (yes?) Any suggestions?

推荐答案

如果您仅具有一个标识符字段或类似的字段,请使用StringField或其他一些未标记的字段,在这种情况下,请使用简单的RegExpQuery定义起来很简单.

If you simply have a field of identifiers, or some such, use a StringField, or some other untokenized field, in which case a simple RegExpQuery is simple enough to define.

如果您尝试将它们从必须进行标记化的全文字段中拉出(我认为是这种情况),则可以使用

If you are trying to pull them out of a full-text field, which must be tokenized (and I assume this is the case), you can use the SpanQuery API to construct the appropriate query:

SpanQuery span1 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term("text", "[0-9]{3}")));
SpanQuery span2 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term("text", "[0-9]{2}")));
SpanQuery span3 = new SpanMultiTermQueryWrapper(new RegexpQuery(new Term("text", "[0-9]{4}")));

Query query = new SpanNearQuery({span1, span2, span3}, 0, true);

searcher.search(query, maxResults)

这篇关于使用Lucene 4正则表达式搜索社会保险号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆