从python 3中的字符串中删除unicode表达式的最简单的方法? [英] Easiest way to remove unicode representations from a string in python 3?

查看:157
本文介绍了从python 3中的字符串中删除unicode表达式的最简单的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在python 3中有一个字符串,其中有几个unicode表示,例如:

  t ='R \\ \\óis\\\\ín'

我想转换t,以便它具有正确的表示,当我打印它,即:

 >>> print(t)
Róisín

然而,我只是将原始字符串恢复。我已经尝试过re.sub和其他一些,但我似乎找不到一种可以改变这些字符的方式,而不必迭代每一个。
最简单的方法是什么?

解决方案

你想使用内置的编解码器c $ c> unicode_escape 。



如果 t 已经是 bytes (一个8位字符串),这样简单:

 > >> print(t.decode('unicode_escape'))
Róisín

如果 t 已被解码为Unicode,您可以将其编码回字节,然后 decode 这样。如果您确定所有的Unicode字符都已被转义,那么用于编码的编解码器实际上并不重要。否则,您可以尝试将原始字节串返回,但是更简单,也许更安全,只需强制任何未编码的字符进行编码,然后它们将与已编码的字符串一起解码: p>

 >>> print(t.encode('unicode_escape')。decode('unicode_escape')
Róisín

如果你想知道如何在将来用正则表达式做这样的事情,请注意 sub 可让您传递一个函数,而不是 repl 的模式您可以通过调用 int(hexstring,16)将任何十六进制字符串转换为整数,并将任何整数转换为相应的Unicode字符,其中 chr (注意,这是Python 2中有一点不同 - 你需要 unichr )所以:

 >>> re.sub(r'(\\u [0-9A-Fa-f] +)',lambda matchobj:chr (matchobj.group(0)[2:],16)),t)
Róisín

或者让它更清楚一点:

 >>> def unescapematch(matchobj):
... escapesequence = match obj.group(0)
... digits = escapesequence [2:]
... ordinal = int(escapesequence,16)
... char = chr(ordinal)
... return char
>>> re.sub(r'(\\u [0-9A-Fa-f] +)',unescapematch,t)
Róisín

unicode_escape 编解码器实际上处理 \U \x \X ,八进制( \066 )和特殊字符( \\\
)序列以及 \u ,它实现只读适当的最大数字的规则(4为 \u ,8为 \U ,等等,所以 r'\\\\∢2'解码为'∢2'而不是

I have a string in python 3 that has several unicode representations in it, for example:

t = 'R\\u00f3is\\u00edn'

and I want to convert t so that it has the proper representation when I print it, ie:

>>> print(t)
Róisín

However I just get the original string back. I've tried re.sub and some others, but I can't seem to find a way that will change these characters without having to iterate over each one. What would be the easiest way to do so?

解决方案

You want to use the built-in codec unicode_escape.

If t is already a bytes (an 8-bit string), it's as simple as this:

>>> print(t.decode('unicode_escape'))
Róisín

If t has already been decoded to Unicode, you can to encode it back to a bytes and then decode it this way. If you're sure that all of your Unicode characters have been escaped, it actually doesn't matter what codec you use to do the encode. Otherwise, you could try to get your original byte string back, but it's simpler, and probably safer, to just force any non-encoded characters to get encoded, and then they'll get decoded along with the already-encoded ones:

>>> print(t.encode('unicode_escape').decode('unicode_escape')
Róisín

In case you want to know how to do this kind of thing with regular expressions in the future, note that sub lets you pass a function instead of a pattern for the repl. And you can convert any hex string into an integer by calling int(hexstring, 16), and any integer into the corresponding Unicode character with chr (note that this is the one bit that's different in Python 2—you need unichr instead). So:

>>> re.sub(r'(\\u[0-9A-Fa-f]+)', lambda matchobj: chr(int(matchobj.group(0)[2:], 16)), t)
Róisín

Or, making it a bit more clear:

>>> def unescapematch(matchobj):
...     escapesequence = matchobj.group(0)
...     digits = escapesequence[2:]
...     ordinal = int(escapesequence, 16)
...     char = chr(ordinal)
...     return char
>>> re.sub(r'(\\u[0-9A-Fa-f]+)', unescapematch, t)
Róisín

The unicode_escape codec actually handles \U, \x, \X, octal (\066), and special-character (\n) sequences as well as just \u, and it implements the proper rules for reading only the appropriate max number of digits (4 for \u, 8 for \U, etc., so r'\\u22222' decodes to '∢2' rather than '

这篇关于从python 3中的字符串中删除unicode表达式的最简单的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆