Lucene 正则表达式中的单词边界 [英] Word boundary in Lucene regex
问题描述
我想在 Elastisearch 中使用 单词边界 进行正则表达式查询,但是它看起来像 Lucene 正则表达式引擎 不支持 .我可以使用哪些解决方法?
I'd like to a make a regex query in Elastisearch with word boundaries, however it looks like the Lucene regex engine doesn't support . What workarounds can I use?
推荐答案
在 ElasticSearch regex 风格中,没有直接等价于单词边界.初始 类似于
(^|[^A-Za-z0-9_])
如果 word
以单词 char 开头,如果 word
以单词 char 结尾,则尾随 类似于
($|[^A-Za-z0-9_])
.
In ElasticSearch regex flavor, there is no direct equivalent to a word boundary. Initial is something like
(^|[^A-Za-z0-9_])
if the word
starts with a word char, and the trailing is like
($|[^A-Za-z0-9_])
if the word
ends with a word char.
因此,我们需要确保在 word
或字符串的开头/结尾之前和之后有一个非单词字符.由于正则表达式是默认锚定的,我们只需在字符串的开头/结尾添加 [^A-Za-z0-9_]
即可,只需在旁边添加 .*
和用可选的分组结构包装:
Thus, we need to make sure that there is a non-word char before and after word
or start/end of string. Since the regex is anchored by default, all we need to make [^A-Za-z0-9_]
optional at start/end of string is add .*
beside and wrap with an optional grouping construct:
(.*[^A-Za-z0-9_])?word([^A-Za-z0-9_].*)?
详情
(.*[^A-Za-z0-9_])?
- 字符串开头或任何 0+ 字符(但换行符,否则使用(.| )*
),然后是除单词 char 之外的任何字符(基本上,它是字符串的开头,后跟组内模式的 1 或 0 次出现)word
- 一个词([^A-Za-z0-9_].*)?
- 任何字符的可选序列,但一个单词 char 后跟任何 0+ 个字符,然后是字符串位置的结尾(隐含在 Lucene 正则表达式中).
(.*[^A-Za-z0-9_])?
- either start of string or any 0+ chars (but a line break char, else use(.| )*
) and then any char but a word char (basically, it is start of string followed with 1 or 0 occurrences of the pattern inside the group)word
- a word([^A-Za-z0-9_].*)?
- an optional sequence of any char but a word char followed with any 0+ chars, followed by the end of string position (implicit in Lucene regex).
这篇关于Lucene 正则表达式中的单词边界的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!