在python正则表达式中使用unicode字符的正确方法是什么 [英] What is the correct way to use unicode characters in a python regex

查看:187
本文介绍了在python正则表达式中使用unicode字符的正确方法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在使用Python 2.7抓取某些文档的过程中,我遇到了一些烦人的页面分隔符,因此决定将其删除.分隔符使用一些时髦的字符.我已经在此处 a>有关如何使这些字符显示其utf-8代码的信息.使用了两个非ASCII字符:'\xc2\xad''\x0c'.现在,我只需要删除这些字符以及一些空格和页码.

In the process of scraping some documents using Python 2.7, I've run into some annoying page separators, which I've decided to remove. The separators use some funky characters. I already asked one question here on how to make these characters reveal their utf-8 codes. There are two non-ASCII characters used: '\xc2\xad', and '\x0c'. Now, I just need to remove these characters, as well some spaces and the page numbers.

SO上的其他地方,我见过与regexp串联使用的unicode字符,但是它是一种奇怪的格式,没有这些字符,例如'\u00ab'.此外,它们都不使用ASCII字符和非ASCII字符.最后,python文档对正则表达式中的unicode主题非常了解...关于标志的一些信息...我不知道.有人可以帮忙吗?

Elsewhere on SO, I've seen unicode characters used in tandem with regexps, but it's in a strange format that I do not have these characters in, e.g. '\u00ab'. In addition, none of them are using ASCII as well as non-ASCII characters. Finally, the python docs are very light on the subject of unicode in regexes... something about flags... I don't know. Can anyone help?

这是我当前的用法,它不能满足我的要求:

Here is my current usage, which does not do what I want:

re.sub('\\xc2\\xad\s\d+\s\\xc2\\xad\s\\x0c', '', my_str)

推荐答案

您可以删除想要的所有内容,而不是查找特定的不需要的字符,

Rather than seek out specific unwanted chars, you could remove everything not wanted:

re.sub('[^\\s!-~]', '', my_str)

这不会丢弃所有字符:

  • 空白(空格,制表符,换行符等)
  • 可打印的普通" ASCII字符(!是第一个可打印的字符,~是小数点后128位的最后一个字符)
  • whitespace (spaces, tabs, newlines, etc)
  • printable "normal" ascii characters (! is the first printable char and ~ is the last under decimal 128)

如果需要,您可以添加更多字符-只需调整字符类即可.

You could include more chars if needed - just adjust the character class.

这篇关于在python正则表达式中使用unicode字符的正确方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆