在python正则表达式中使用unicode字符的正确方法是什么 [英] What is the correct way to use unicode characters in a python regex

查看：187 发布时间：2020/7/13 5:28:00 python regex unicode utf-8

本文介绍了在python正则表达式中使用unicode字符的正确方法是什么的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在使用Python 2.7抓取某些文档的过程中，我遇到了一些烦人的页面分隔符，因此决定将其删除.分隔符使用一些时髦的字符.我已经在此处 a>有关如何使这些字符显示其utf-8代码的信息.使用了两个非ASCII字符:'\xc2\xad'和'\x0c'.现在，我只需要删除这些字符以及一些空格和页码.

In the process of scraping some documents using Python 2.7, I've run into some annoying page separators, which I've decided to remove. The separators use some funky characters. I already asked one question here on how to make these characters reveal their utf-8 codes. There are two non-ASCII characters used: '\xc2\xad', and '\x0c'. Now, I just need to remove these characters, as well some spaces and the page numbers.

SO上的其他地方，我见过与regexp串联使用的unicode字符，但是它是一种奇怪的格式，没有这些字符，例如'\u00ab'.此外，它们都不使用ASCII字符和非ASCII字符.最后，python文档对正则表达式中的unicode主题非常了解...关于标志的一些信息...我不知道.有人可以帮忙吗?

Elsewhere on SO, I've seen unicode characters used in tandem with regexps, but it's in a strange format that I do not have these characters in, e.g. '\u00ab'. In addition, none of them are using ASCII as well as non-ASCII characters. Finally, the python docs are very light on the subject of unicode in regexes... something about flags... I don't know. Can anyone help?

这是我当前的用法，它不能满足我的要求:

Here is my current usage, which does not do what I want:

re.sub('\\xc2\\xad\s\d+\s\\xc2\\xad\s\\x0c', '', my_str)

推荐答案

您可以删除不想要的所有内容，而不是查找特定的不需要的字符，

Rather than seek out specific unwanted chars, you could remove everything not wanted:

re.sub('[^\\s!-~]', '', my_str)

这不会丢弃所有字符:

空白(空格，制表符，换行符等)
可打印的普通" ASCII字符(!是第一个可打印的字符，~是小数点后128位的最后一个字符)

whitespace (spaces, tabs, newlines, etc)
printable "normal" ascii characters (! is the first printable char and ~ is the last under decimal 128)

如果需要，您可以添加更多字符-只需调整字符类即可.

You could include more chars if needed - just adjust the character class.

这篇关于在python正则表达式中使用unicode字符的正确方法是什么的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在python正则表达式中使用unicode字符的正确方法是什么 [英] What is the correct way to use unicode characters in a python regex

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在python正则表达式中使用unicode字符的正确方法是什么 [英] What is the correct way to use unicode characters in a python regex

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭