从python 3中的字符串中删除unicode表达式的最简单的方法? [英] Easiest way to remove unicode representations from a string in python 3?
问题描述
t ='R \\ \\óis\\\\ín'
我想转换t,以便它具有正确的表示,当我打印它,即:
>>> print(t)
Róisín
然而,我只是将原始字符串恢复。我已经尝试过re.sub和其他一些,但我似乎找不到一种可以改变这些字符的方式,而不必迭代每一个。
最简单的方法是什么?
你想使用内置的编解码器c $ c> unicode_escape 。
如果 t
已经是 bytes
(一个8位字符串),这样简单:
> >> print(t.decode('unicode_escape'))
Róisín
如果 t
已被解码为Unicode,您可以将其编码回字节
,然后 decode
这样。如果您确定所有的Unicode字符都已被转义,那么用于编码的编解码器实际上并不重要。否则,您可以尝试将原始字节串返回,但是更简单,也许更安全,只需强制任何未编码的字符进行编码,然后它们将与已编码的字符串一起解码: p>
>>> print(t.encode('unicode_escape')。decode('unicode_escape')
Róisín
如果你想知道如何在将来用正则表达式做这样的事情,请注意 sub 可让您传递一个函数,而不是 repl
的模式您可以通过调用 int(hexstring,16)
将任何十六进制字符串转换为整数,并将任何整数转换为相应的Unicode字符,其中 chr
(注意,这是Python 2中有一点不同 - 你需要 unichr
)所以:
>>> re.sub(r'(\\u [0-9A-Fa-f] +)',lambda matchobj:chr (matchobj.group(0)[2:],16)),t)
Róisín
或者让它更清楚一点:
>>> def unescapematch(matchobj):
... escapesequence = match obj.group(0)
... digits = escapesequence [2:]
... ordinal = int(escapesequence,16)
... char = chr(ordinal)
... return char
>>> re.sub(r'(\\u [0-9A-Fa-f] +)',unescapematch,t)
Róisín
I have a string in python 3 that has several unicode representations in it, for example: and I want to convert t so that it has the proper representation when I print it, ie: However I just get the original string back. I've tried re.sub and some others, but I can't seem to find a way that will change these characters without having to iterate over each one.
What would be the easiest way to do so? You want to use the built-in codec If If In case you want to know how to do this kind of thing with regular expressions in the future, note that Or, making it a bit more clear: The 这篇关于从python 3中的字符串中删除unicode表达式的最简单的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! unicode_escape
编解码器实际上处理 \U
, \x
, \X
,八进制( \066
)和特殊字符( \\\
)序列以及
\u
,它实现只读适当的最大数字的规则(4为 \u
,8为 \U
,等等,所以 r'\\\\∢2'
解码为'∢2'
而不是t = 'R\\u00f3is\\u00edn'
>>> print(t)
Róisín
unicode_escape
.t
is already a bytes
(an 8-bit string), it's as simple as this:>>> print(t.decode('unicode_escape'))
Róisín
t
has already been decoded to Unicode, you can to encode it back to a bytes
and then decode
it this way. If you're sure that all of your Unicode characters have been escaped, it actually doesn't matter what codec you use to do the encode. Otherwise, you could try to get your original byte string back, but it's simpler, and probably safer, to just force any non-encoded characters to get encoded, and then they'll get decoded along with the already-encoded ones:>>> print(t.encode('unicode_escape').decode('unicode_escape')
Róisín
sub
lets you pass a function instead of a pattern for the repl
. And you can convert any hex string into an integer by calling int(hexstring, 16)
, and any integer into the corresponding Unicode character with chr
(note that this is the one bit that's different in Python 2—you need unichr
instead). So:>>> re.sub(r'(\\u[0-9A-Fa-f]+)', lambda matchobj: chr(int(matchobj.group(0)[2:], 16)), t)
Róisín
>>> def unescapematch(matchobj):
... escapesequence = matchobj.group(0)
... digits = escapesequence[2:]
... ordinal = int(escapesequence, 16)
... char = chr(ordinal)
... return char
>>> re.sub(r'(\\u[0-9A-Fa-f]+)', unescapematch, t)
Róisín
unicode_escape
codec actually handles \U
, \x
, \X
, octal (\066
), and special-character (\n
) sequences as well as just \u
, and it implements the proper rules for reading only the appropriate max number of digits (4 for \u
, 8 for \U
, etc., so r'\\u22222'
decodes to '∢2'
rather than '