正则表达式替换“转义的"带有原始字符的字符 [英] Regular expression to replace "escaped" characters with their originals

查看:503
本文介绍了正则表达式替换“转义的"带有原始字符的字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意:我没有用正则表达式解析很多html或通用html.我知道这很糟糕

TL; DR :

我有类似的字符串

A sentence with an exclamation\! Next is a \* character

原始标记中有转义"字符的位置.我希望用他们的原著"代替它们.并得到:

Where there are "escaped" characters in the original markup. I wish to replace them with their "originals". And get:

A sentence with an exclamation! Next is a * character


我需要从一些Wiki标记中提取少量数据.


I have a small bit data that I need to extract from some wiki markup.

我在这里只处理段落/摘要,所以我不需要强大的解决方案.在python中,我尝试了一个测试:

I'm only dealing with paragraphs/snippets here, so I don't need a big robust solution. In python, I tried a test:

s = "test \\* \\! test * !! **"

r = re.compile("""\\.""") # Slash followed by anything

r.sub("-", s)

应该收益:

test - - test * !! **

但是它什么也没做.我在这里想念东西吗?

But it doesn't do anything. Am I missing something here?

此外,我不确定如何用其原始字符替换任何给定的转义字符,因此我可能只用特定的正则表达式列出和替换:

Furthermore, I'm not sure how to go about replacing any given escaped character with its original, so I would probably just make a list and sub with specific regexes like:

\\\*

\\!

执行此操作的方法可能更简洁,因此非常感谢您的帮助.

There's probably a much cleaner way to do this, so any help is greatly appreciated.

推荐答案

您缺少某些内容,即r前缀:

You are missing something, namely the r prefix:

r = re.compile(r"\\.") # Slash followed by anything

python和re都将\附加含义;当您将字符串值传递给re.compile()时,您加倍的反斜杠将变成一个反斜杠,此时re看到\.,意味着字面句号.:

Both python and re attach meaning to \; your doubled backslash becomes just one backslash when you pass the string value to re.compile(), by which time re sees \., meaning a literal full stop.:

>>> print """\\."""
\.

通过使用r'',您告诉python不要解释转义码,因此现在re被赋予了带有\\.的字符串,这意味着在其后加任何字符的文字反斜杠:

By using r'' you tell python not to interpret escape codes, so now re is given a string with \\., meaning a literal backslash followed by any character:

>>> print r"""\\."""
\\.

演示:

>>> import re
>>> s = "test \\* \\! test * !! **"
>>> r = re.compile(r"\\.") # Slash followed by anything
>>> r.sub("-", s)
'test - - test * !! **'

经验法则是:在定义正则表达式时,请使用r''原始字符串文字,从而使您不必对具有Python和正则表达式语法含义的所有内容进行两次转义.

The rule of thumb is: when defining regular expressions, use r'' raw string literals, saving you to have to double-escape everything that has meaning to both Python and regular expression syntax.

接下来,您要替换转义"字符;使用组,re.sub()允许您将组作为替换值:

Next, you want to replace the 'escaped' character; use groups for that, re.sub() lets you reference groups as the replacement value:

r = re.compile(r"\\(.)") # Note the parethesis, that's a capturing group
r.sub(r'\1', s)          # \1 means: replace with value of first capturing group

现在的输出是:

>>> r = re.compile(r"\\(.)") # Note the parethesis, that's a capturing group
>>> r.sub(r'\1', s) 
'test * ! test * !! **'

这篇关于正则表达式替换“转义的"带有原始字符的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆