python正则表达式不匹配序列 [英] python regexp not match sequence

查看:67
本文介绍了python正则表达式不匹配序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要用 HTML 标签包装一些 MathJax 字符串.我想知道如何从搜索字符串中排除 \) 以不匹配完整的字符串.使用单个字符很容易,例如 [^)] 但是当我需要做同样的事情时该怎么办,例如两个字符一个接一个 \) ?

I need to wrap some MathJax string with HTML tag. I wonder how to exclude \) from search string not to match full sting. With single char it's easy e.g [^)] but what to do when I need to do the same with e.g. two chars one after another \) ?

search_str = "\(\ce{\sigma_{s}^{b}(H2O)}\) bla bla \(\ce{\sigma_{s}^{b}(H2O)}\)"
out = re.sub(r'(\\\([^\\\)]+\\\))', '<span>\1</span>', search_str)

推荐答案

您正在尝试匹配任何文本,但 \) 2-char substring, 2-char sequence字符,用 [^\\\)]+,这是错误的,因为 [^...]否定字符类 可以匹配一个单个字符落入定义的特定范围或字符集在课堂里.它永远无法匹配字符组合,*+ 量词只会重复单个字符匹配.

You are trying to match any text but \) 2-char substring, 2-char sequence of characters, with [^\\\)]+, which is wrong, because [^...] is a negated cahracter class that can match a single character falling into a specific range or set of chars defined in the class. It can never match char combinations, * or + quantifiers just repeat a single char matching.

你想到的叫做tempered greedy token(?:(?!\\\)).)*(?:(?!\\\)).)*?.

What you think of is called a tempered greedy token, (?:(?!\\\)).)* or (?:(?!\\\)).)*?.

然而,tempered greedy token 在这方面并不是最佳实践案件.请参阅关于何时使用 TGT 的 rexegg.com 说明:

However, the tempered greedy token is not the best practice in this case. See the rexegg.com note on when not to use TGT:

对于手头的任务,这种技术与惰性点星.*?{END}相比没有优势.尽管它们的逻辑不同,但在匹配字符之前的每一步,这两种技术都会强制引擎查看后面的内容是否为 {END}.

For the task at hand, this technique presents no advantage over the lazy dot-star .*?{END}. Although their logic differs, at each step, before matching a character, both techniques force the engine to look if what follows is {END}.

这两个版本的比较性能将取决于您引擎的内部优化.pcretest 实用程序表明,对于惰性点星版本,PCRE 需要的步骤要少得多.在我的笔记本电脑上,当对字符串 {START}Mary {END} 运行这两个表达式一百万次时,pcretest 需要 400 毫秒每 10,000 次运行对于惰性版本和 800 毫秒对于缓和版本.

The comparative performance of these two versions will depend on your engine's internal optimizations. The pcretest utility indicates that PCRE requires far fewer steps for the lazy-dot-star version. On my laptop, when running both expressions a million times against the string {START} Mary {END}, pcretest needs 400 milliseconds per 10,000 runs for the lazy version and 800 milliseconds for the tempered version.

因此,如果调整点的字符串是我们打算最终匹配的分隔符(如我们示例中的 {END}),则此技术不会向惰性点星添加任何内容,这对这项工作进行了更好的优化.

Therefore, if the string that tempers the dot is a delimiter that we intend to match eventually (as with {END} in our example), this technique adds nothing to the lazy dot-star, which is better optimized for this job.

您的字符串似乎格式正确且相当短,仅使用懒点匹配模式,即\\\(.*?\\) 正则表达式.

Your strings seem to be well-formed and rather short, use a mere lazy dot matching pattern, that is, \\\(.*?\\\) regex.

此外,您需要在替换模式定义中使用 r 前缀,一个原始字符串文字,否则 \1 将被解析为十六进制字符(\x01标题开始).

Besides, you need to use r prefix, a raw string literal, in the replacement pattern definition, or \1 will be parsed as a hex char (\x01, start of header).

import re
search_str = r"\(\ce{\sigma_{s}^{b}(H2O)}\) bla bla \(\ce{\sigma_{s}^{b}(H2O)}\)"
print(search_str)
out = re.sub(r'(\\\(.*?\\\))', r'<span>\1</span>', search_str)
print(out)

查看 Python 演示

这篇关于python正则表达式不匹配序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆