在python正则表达式中使用unicode字符的正确方法是什么 [英] What is the correct way to use unicode characters in a python regex
问题描述
在使用Python 2.7抓取某些文档的过程中,我遇到了一些烦人的页面分隔符,因此决定将其删除.分隔符使用一些时髦的字符.我已经在此处 a>有关如何使这些字符显示其utf-8代码的信息.使用了两个非ASCII字符:'\xc2\xad'
和'\x0c'
.现在,我只需要删除这些字符以及一些空格和页码.
In the process of scraping some documents using Python 2.7, I've run into some annoying page separators, which I've decided to remove. The separators use some funky characters. I already asked one question here on how to make these characters reveal their utf-8 codes. There are two non-ASCII characters used: '\xc2\xad'
, and '\x0c'
. Now, I just need to remove these characters, as well some spaces and the page numbers.
SO上的其他地方,我见过与regexp串联使用的unicode字符,但是它是一种奇怪的格式,没有这些字符,例如'\u00ab'
.此外,它们都不使用ASCII字符和非ASCII字符.最后,python文档对正则表达式中的unicode主题非常了解...关于标志的一些信息...我不知道.有人可以帮忙吗?
Elsewhere on SO, I've seen unicode characters used in tandem with regexps, but it's in a strange format that I do not have these characters in, e.g. '\u00ab'
. In addition, none of them are using ASCII as well as non-ASCII characters. Finally, the python docs are very light on the subject of unicode in regexes... something about flags... I don't know. Can anyone help?
这是我当前的用法,它不能满足我的要求:
Here is my current usage, which does not do what I want:
re.sub('\\xc2\\xad\s\d+\s\\xc2\\xad\s\\x0c', '', my_str)
推荐答案
您可以删除不想要的所有内容,而不是查找特定的不需要的字符,
Rather than seek out specific unwanted chars, you could remove everything not wanted:
re.sub('[^\\s!-~]', '', my_str)
这不会丢弃所有字符:
- 空白(空格,制表符,换行符等)
- 可打印的普通" ASCII字符(
!
是第一个可打印的字符,~
是小数点后128位的最后一个字符)
- whitespace (spaces, tabs, newlines, etc)
- printable "normal" ascii characters (
!
is the first printable char and~
is the last under decimal 128)
如果需要,您可以添加更多字符-只需调整字符类即可.
You could include more chars if needed - just adjust the character class.
这篇关于在python正则表达式中使用unicode字符的正确方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!