python-re.sub() 和 unicode [英] python-re.sub() and unicode

查看:35
本文介绍了python-re.sub() 和 unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用 '' 替换所有表情符号,但我的正则表达式不起作用.
例如,

I want to replace all emoji with '' but my regEx doesn't work.
For example,

content= u'?\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633?'

我想用 '' 替换所有像 \U0001f633 这样的形式,所以我写了代码:

and I want to replace all the forms like \U0001f633 with '' so I write the code:

print re.sub(ur'\\U[0-9a-fA-F]{8}','',content)

但它不起作用.
非常感谢.

But it doesn't work.
Thanks a lot.

推荐答案

您将无法以这种方式识别正确解码的 unicode 代码点(如包含 \uXXXX 的字符串等)正确解码,当正则表达式解析器到达它们时,每个都是一个*字符.

You won't be able to recognize properly decoded unicode codepoints that way (as strings containing \uXXXX, etc.) Properly decoded, by the time the regex parser gets to them, each is a* character.

根据您的 Python 是否仅使用 16 位 unicode 代码点编译,您需要一个类似以下的模式:

Depending on whether your python was compiled with only 16-bit unicode code points or not, you'll want a pattern something like either:

# 16-bit codepoints
re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')

# 32-bit* codepoints
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')

您的代码如下所示:

import re

# Pick a pattern, adjust as necessary
#re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')

content= u'[\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633]'
print(content)

stripped = re_strip.sub('', content)
print(stripped)

两个表达式,都将stripped字符串中的字符数减少到26个.

Both expressions, reduce the number of characters in the stripped string to 26.

这些表达去掉了你想要的表情符号,但也可能去掉了你想要的其他东西.可能值得查看 unicode 代码点范围列表(例如 此处)并对其进行调整.

These expressions strip out the emojis you were after, but may also strip out other things you do want. It may be worth reviewing a unicode codepoint range listing (e.g. here) and adjusting them.

您可以通过执行以下操作来确定您的 python 安装是否仅识别 16 位代码点:

You can determine whether your python install will only recognize 16-bit codepoints by doing something like:

import sys
print(sys.maxunicode.bit_length())

如果显示 16,则您需要第一个正则表达式.如果它显示大于 16 的值(对我来说是 21),第二个就是你想要的.

If this displays 16, you'll need the first regex expression. If it displays something greater than 16 (for me it says 21), the second one is what you want.

在使用错误的 sys.maxunicode 的 python 安装中使用时,这两个表达式都不起作用.

Neither expression will work when used on a python install with the wrong sys.maxunicode.

另见:这个 相关.

这篇关于python-re.sub() 和 unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆