使用动态正则表达式匹配字符串中的整个单词 [英] Match a whole word in a string using dynamic regex

查看:38
本文介绍了使用动态正则表达式匹配字符串中的整个单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望使用正则表达式查看某个词是否出现在句子中.单词由空格分隔,但可以在任一侧使用标点符号.如果单词位于字符串的中间,则以下匹配有效(它可以防止部分单词匹配,允许在单词的任一侧使用标点符号).

I am looking to see whether a word occurs in a sentence using regex. Words are separated by spaces, but may have punctuation on either side. If the word is in the middle of the string, the following match works (it prevents part-words from matching, allows punctuation on either side of the word).

match_middle_words = " [^a-zA-Z\d ]{0,}" + word + "[^a-zA-Z\d ]{0,} "

但是这不会匹配第一个或最后一个单词,因为没有尾随/前导空格.所以,对于这些情况,我也一直在使用:

This won't however match the first or last word, since there is no trailing/leading space. So, for these cases, I have also been using:

match_starting_word = "^[^a-zA-Z\d]{0,}" + word + "[^a-zA-Z\d ]{0,} "
match_end_word = " [^a-zA-Z\d ]{0,}" + word + "[^a-zA-Z\d]{0,}$"

然后结合

 match_string = match_middle_words  + "|" + match_starting_word  +"|" + match_end_word 

是否有一种简单的方法可以避免需要三个匹配项.具体来说,有没有办法指定以空格或文件开头(即^")和类似的空格或文件结尾(即$")?

Is there a simple way to avoid the need of three match terms. Specifically, is there a way of specifying 'ether a space or the start of file (i.e. "^") and similar, 'either a space or the end of the file (i.e. "$")?

推荐答案

为什么不使用词边界?

match_string = r'\b' + word + r'\b'
match_string = r'\b{}\b'.format(word)
match_string = rf'\b{word}\b'          # Python 3.7+ required

如果您有一个单词列表(例如,在 words 变量中)要作为整个单词匹配,请使用

If you have a list of words (say, in a words variable) to be matched as a whole word, use

match_string = r'\b(?:{})\b'.format('|'.join(words))
match_string = rf'\b(?:{"|".join(words)})\b'         # Python 3.7+ required

在这种情况下,您将确保仅当单词被非单词字符包围时才被捕获.还要注意 \b 在字符串开始和结束处匹配.所以,添加 3 个选项是没有用的.

In this case, you will make sure the word is only captured when it is surrounded by non-word characters. Also note that \b matches at the string start and end. So, no use adding 3 alternatives.

示例代码:

import re
strn = "word hereword word, there word"
search = "word"
print re.findall(r"\b" + search + r"\b", strn)

我们找到了 3 个匹配项:

And we found our 3 matches:

['word', 'word', 'word']

关于单词"的注意事项边界

当单词"出现时实际上是你应该在传递给正则表达式模式之前re.escape它们的任何字符块:

When the "words" are in fact chunks of any chars you should re.escape them before passing to the regex pattern:

match_string = r'\b{}\b'.format(re.escape(word)) # a single escaped "word" string passed
match_string = r'\b(?:{})\b'.format("|".join(map(re.escape, words))) # words list is escaped
match_string = rf'\b(?:{"|".join(map(re.escape, words))})\b' # Same as above for Python 3.7+

如果要匹配为整个单词的单词可能以特殊字符开头/结尾,\b 不起作用,使用明确的单词边界:

If the words to be matched as whole words may start/end with special characters, \b won't work, use unambiguous word boundaries:

match_string = r'(?<!\w){}(?!\w)'.format(re.escape(word))
match_string = r'(?<!\w)(?:{})(?!\w)'.format("|".join(map(re.escape, words))) 

如果单词边界是空白字符或字符串的开头/结尾,请使用空白边界(?<!\S)...(?!\S):

If the word boundaries are whitespace chars or start/end of string, use whitespace boundaries, (?<!\S)...(?!\S):

match_string = r'(?<!\S){}(?!\S)'.format(word)
match_string = r'(?<!\S)(?:{})(?!\S)'.format("|".join(map(re.escape, words))) 

这篇关于使用动态正则表达式匹配字符串中的整个单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆