在Python中取消转义Unicode转义,但不退回回车符和换行符 [英] Unescape unicode-escapes, but not carriage returns and line feeds, in Python
问题描述
我有一个ASCII编码的JSON文件,带有Unicode转义符(例如\\u201cquotes\\u201d
)和换行符在字符串中转义(例如`"foo\\r\\nbar"
).在Python中,有没有一种简单的方法可以通过取消转义unicode-escapes来生成utf-8编码的文件,而使换行符转义完好无损?
I have an ASCII-encoded JSON file with unicode-escapes (e.g., \\u201cquotes\\u201d
) and newlines escaped within strings, (e.g., `"foo\\r\\nbar"
). Is there a simple way in Python to generate a utf-8 encoded file by un-escaping the unicode-escapes, but leaving the newline escapes intact?
在字符串上调用decode('unicode-escape')
会解码unicode转义符(这是我想要的),但也会解码回车符和换行符(我不想要).
Calling decode('unicode-escape')
on the string will decode the unicode escapes (which is what I want) but it will also decode the carriage returns and newlines (which I don't want).
推荐答案
Sure there is, use the right tool for the job and ask the json
module to decode the data to Python unicode
; then encode the result to UTF-8:
import json
json.loads(input).encode('utf8')
仅将unicode-escape
用于实际的Python字符串文字. JSON字符串与Python字符串不同,尽管乍一看它们看起来很相似.
Use unicode-escape
only for actual Python string literals. JSON strings are not the same as Python strings, even though they may, at first glance, look very similar.
简短的演示(考虑到python交互式解释器将字符串作为文字回显的原因):
Short demo (take into account the python interactive interpreter echoes strings as literals):
>>> json.loads(r'"\u201cquotes\u201d"').encode('utf8')
'\xe2\x80\x9cquotes\xe2\x80\x9d'
>>> json.loads(r'"foo\r\nbar"').encode('utf8')
'foo\r\nbar'
请注意,JSON解码器会像Python文字一样对\n
上的\r
进行解码.
Note that the JSON decoder decodes \r
on \n
just like a python literal would.
如果绝对必须仅处理JSON输入中的\uabcd
unicode文字,而其余部分保持不变,则您需要使用正则表达式:
If you absolutely have to only process the \uabcd
unicode literals in the JSON input but leave the rest intact, then you need to resort to a regular expression:
import re
codepoint = re.compile(r'(\\u[0-9a-fA-F]{4})')
def replace(match):
return unichr(int(match.group(1)[2:], 16))
codepoint.sub(replace, text).encode('utf8')
给出:
>>> codepoint.sub(replace, r'\u201cquotes\u201d').encode('utf8')
'\xe2\x80\x9cquotes\xe2\x80\x9d'
>>> codepoint.sub(replace, r'"foo\r\nbar"').encode('utf8')
'"foo\\r\\nbar"'
这篇关于在Python中取消转义Unicode转义,但不退回回车符和换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!