如何将 C++ 与单词边界匹配 [英] How to match c++ with word boundaries

查看:74
本文介绍了如何将 C++ 与单词边界匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 Python 3 中将单词c++"与单词边界相匹配.但我的猜测是 \b 也会在加号上触发.

为了清楚起见,我已简化为以下测试用例:

\bc\+\+\b

我希望我可以保持单词边界但以某种方式更改其设置.

这样做的原因是我想将正则表达式放在 TfidfVectorizer 中的 token_pattern 中,我无法控制他们如何使用它.

链接到在线正则表达式工具

解决方案

影响字符类行为"的方法非常有限——它们被称为标志:

re.ASCII ... re.VERBOSE

他们f.e.允许 r'.' 匹配换行符 (re.DOTALL),改变 ^$ 的行为 (re.MULTILINE) 或使您的正则表达式匹配而无需区分大小写 (re.IGNORECASE).

他们都没有将 \b 更改为没有 '+' .如果你想将 c++ 与 wordboundaries 匹配起来,你必须在你的模式中模仿 \b 行为:

<块引用>

\b 匹配空字符串,但只在单词的开头或结尾.一个词被定义为一个词字符序列.请注意,正式地,\b 定义为 \w 和 \W 字符之间的边界(反之亦然),或在 \w 和字符串的开头/结尾之间.这意味着 r'\bfoo\b'匹配 'foo', 'foo.', '(foo)', 'bar foo baz' 但不匹配 'foobar' 或 'foo3'.

来源:https://docs.python.org/3/library/re.html#regular-expression-syntax

最简单的方法可能是在 'c++' 前面加上一个单词边界,后面加上一个空格或非单词字符.r'\bc\+\+[\s\W]' 但这也将匹配 'c+++'.如果你想只匹配 'c++' 而不是 'c++++' 你可能想把 '\s' 放到你的模式中并扩展它与您允许的其他字符:

r'\b(c\+\+)[\s.,!?]'

在括号中扩展字符以容纳在 c++ 之后允许的更多内容 - 将它们从分组 (c++) 中排除将需要它们匹配但不包括在组中.

至于正则表达式测试工具,可能会更改为 https://regex101.com/ - 它有 python支持,您甚至可以保存模式和测试文本并提供链接:

https://regex101.com/r/6XtVTS/1

I want to match the word "c++" with word boundaries in Python 3. But my guess is that the \b also triggers on the plus sign.

I've simplified down to the following test case for clarity:

\bc\+\+\b

I'm hoping that I can keep the word boundaries but change its settings somehow.

The reason for this is that I want to put the regex in a token_pattern in a TfidfVectorizer in which I don't have control over how they use it.

Link to online regex tool

解决方案

There are very limited ways how you can influence the "behaviour" of character classes - they are called flags:

re.ASCII ... re.VERBOSE

They f.e. allow r'.' to match newlines (re.DOTALL), change the behavior of ^$ (re.MULTILINE) or make your regex match without case-awareness (re.IGNORECASE).

None of them changes \b to not having '+' in it. If you want to match c++ with wordboundaries you have to mimic the \b-behaviour in your pattern:

\b    Matches the empty string, but only at the beginning or end of a word. 
      A word is defined as a sequence of word characters. Note that formally, 
      \b is defined as the boundary between a \w and a \W character (or vice versa), 
      or between \w and the beginning/end of the string. This means that r'\bfoo\b' 
      matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

Source: https://docs.python.org/3/library/re.html#regular-expression-syntax

Easiest would probably to mach 'c++' with a word boundary before and a whitespace or non-word character after it. r'\bc\+\+[\s\W]' but this would also match 'c+++'. If you want to match exclusively 'c++' but not 'c+++' you might want to put a '\s' into your pattern and extend it with other characters you allow:

r'\b(c\+\+)[\s.,!?]' 

Extend characters in bracked to accomodate more things allowed after a c++ - excluding them from the grouping (c++) will need them to match but not include them into the group.

As for regex-test tools, maybe change to https://regex101.com/ - it has python support and you can even save patterns and test-text and provide a link:

https://regex101.com/r/6XtVTS/1

这篇关于如何将 C++ 与单词边界匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆