Lucene正则表达式中的单词边界 [英] Word boundary in Lucene regex

查看:99
本文介绍了Lucene正则表达式中的单词边界的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在Elastisearch中使用字边界进行正则表达式查询 ,但是它看起来像 Lucene正则表达式引擎不支持\b.我可以使用哪些解决方法?

I'd like to a make a regex query in Elastisearch with word boundaries, however it looks like the Lucene regex engine doesn't support \b. What workarounds can I use?

推荐答案

在ElasticSearch regex风格中,没有直接等效于单词边界的内容.如果word以单词char开头,则初始\b类似于(^|[^A-Za-z0-9_]),如果word以单词char结尾,则尾随的\b类似于($|[^A-Za-z0-9_]).

In ElasticSearch regex flavor, there is no direct equivalent to a word boundary. Initial \b is something like (^|[^A-Za-z0-9_]) if the word starts with a word char, and the trailing \b is like ($|[^A-Za-z0-9_]) if the word ends with a word char.

因此,我们需要确保在word或字符串的开头/结尾前后都有一个非单词char.由于正则表达式是默认锚定的,因此我们需要在字符串的开始/结尾将[^A-Za-z0-9_]设为可选,只需在旁边添加.*并使用可选的分组结构进行包装:

Thus, we need to make sure that there is a non-word char before and after word or start/end of string. Since the regex is anchored by default, all we need to make [^A-Za-z0-9_] optional at start/end of string is add .* beside and wrap with an optional grouping construct:

(.*[^A-Za-z0-9_])?word([^A-Za-z0-9_].*)?

详细信息

  • (.*[^A-Za-z0-9_])?-字符串的开头或任何0+字符(但是有换行符,否则请使用(.|\n)*),然后是除单词以外的任何char(基本上是字符串的开头,后跟1或0组内模式的出现)
  • word-一个单词
  • ([^A-Za-z0-9_].*)?-任何char的可选序列,但单词char后跟任意0+ char,其后是字符串位置的末尾(在Lucene regex中隐含).
  • (.*[^A-Za-z0-9_])? - either start of string or any 0+ chars (but a line break char, else use (.|\n)*) and then any char but a word char (basically, it is start of string followed with 1 or 0 occurrences of the pattern inside the group)
  • word - a word
  • ([^A-Za-z0-9_].*)? - an optional sequence of any char but a word char followed with any 0+ chars, followed by the end of string position (implicit in Lucene regex).

这篇关于Lucene正则表达式中的单词边界的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆