在Python中取消转义Unicode转义,但不退回回车符和换行符 [英] Unescape unicode-escapes, but not carriage returns and line feeds, in Python

查看:336
本文介绍了在Python中取消转义Unicode转义,但不退回回车符和换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个ASCII编码的JSON文件,带有Unicode转义符(例如\\u201cquotes\\u201d)和换行符在字符串中转义(例如`"foo\\r\\nbar").在Python中,有没有一种简单的方法可以通过取消转义unicode-escapes来生成utf-8编码的文件,而使换行符转义完好无损?

I have an ASCII-encoded JSON file with unicode-escapes (e.g., \\u201cquotes\\u201d) and newlines escaped within strings, (e.g., `"foo\\r\\nbar"). Is there a simple way in Python to generate a utf-8 encoded file by un-escaping the unicode-escapes, but leaving the newline escapes intact?

在字符串上调用decode('unicode-escape')会解码unicode转义符(这是我想要的),但也会解码回车符和换行符(我不想要).

Calling decode('unicode-escape') on the string will decode the unicode escapes (which is what I want) but it will also decode the carriage returns and newlines (which I don't want).

推荐答案

当然可以,使用正确的工具进行工作,并询问

Sure there is, use the right tool for the job and ask the json module to decode the data to Python unicode; then encode the result to UTF-8:

import json

json.loads(input).encode('utf8')

仅将unicode-escape用于实际的Python字符串文字. JSON字符串与Python字符串不同,尽管乍一看它们看起来很相似.

Use unicode-escape only for actual Python string literals. JSON strings are not the same as Python strings, even though they may, at first glance, look very similar.

简短的演示(考虑到python交互式解释器将字符串作为文字回显的原因):

Short demo (take into account the python interactive interpreter echoes strings as literals):

>>> json.loads(r'"\u201cquotes\u201d"').encode('utf8')
'\xe2\x80\x9cquotes\xe2\x80\x9d'
>>> json.loads(r'"foo\r\nbar"').encode('utf8')
'foo\r\nbar'

请注意,JSON解码器会像Python文字一样对\n上的\r进行解码.

Note that the JSON decoder decodes \r on \n just like a python literal would.

如果绝对必须仅处理JSON输入中的\uabcd unicode文字,而其余部分保持不变,则您需要使用正则表达式:

If you absolutely have to only process the \uabcd unicode literals in the JSON input but leave the rest intact, then you need to resort to a regular expression:

import re

codepoint = re.compile(r'(\\u[0-9a-fA-F]{4})')
def replace(match):
    return unichr(int(match.group(1)[2:], 16))

codepoint.sub(replace, text).encode('utf8')

给出:

>>> codepoint.sub(replace, r'\u201cquotes\u201d').encode('utf8')
'\xe2\x80\x9cquotes\xe2\x80\x9d'
>>> codepoint.sub(replace, r'"foo\r\nbar"').encode('utf8')
'"foo\\r\\nbar"'

这篇关于在Python中取消转义Unicode转义,但不退回回车符和换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆