再次转换为UTF-8 [英] Converting to UTF-8 (again)

查看:84
本文介绍了再次转换为UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个字符串Traor\u0102\u0160

Traor\u0102\u0160应该产生Traoré.然后解码的Traoré utf-8应该产生Traorè

如何将其转换为Traorè?

Traor\u0102\u0160是哪种字符? Unicode?

我已经阅读了此 http://docs.python.org/howto/unicode.html#encodings 很多次.但是我还是很困惑.

我通过以下请求获得了该数据:

import json
import requests

# making a request to get this json
r = requests.get('http://cdn.content.easports.com/fifa/fltOnlineAssets/2013/fut/items/web/199074.json')
print r.json

解决方案

#! /usr/bin/env python
# -*- coding: utf-8 -*-
import json
import requests

headers = {'Content-Type': 'application/json'}

r = requests.get('http://cdn.content.easports.com/fifa/fltOnlineAssets/2013/fut/items/web/199074.json', headers=headers)


print r.content

#prints
{"Item":{"FirstName":"Lacina","LastName":"Traoré","CommonName":null,"Height":"203","DateOfBirth":{"Year":"1990","Month":"8","Day":"20"},"PreferredFoot":"Left","ClubId":"100766","LeagueId":"67","NationId":"108","Rating":"78","Attribute1":"79","Attribute2":"71","Attribute3":"45","Attribute4":"69","Attribute5":"50","Attribute6":"72","Rare":"1","ItemType":"PlayerA"}}

基本上,我需要设置发送严格的标题.

谢谢大家

解决方案

对我来说,您的网站返回了"Traor\u00e9"(最后一个字符是é ):

r = requests.get(url)
print(json.dumps(json.loads(r.content)['Item']['LastName']))
# -> "Traor\u00e9" -> Traoré

r.json(r.text)在此处产生不正确的内容.服务器或requests或两者都使用不正确的编码,从而导致"Traor\u0102\u0160". JSON文本的编码完全由其内容定义,因此始终可以从 json rfc :

JSON文本应以Unicode编码.默认编码为
UTF-8.

由于JSON文本的前两个字符始终为ASCII 字符[RFC0020],则可以确定是否为八位字节
通过查看
,流是UTF-8,UTF-16(BE或LE)或UTF-32(BE或LE) 在前四个八位位组中为空模式.

       00 00 00 xx  UTF-32BE
       00 xx 00 xx  UTF-16BE
       xx 00 00 00  UTF-32LE
       xx 00 xx 00  UTF-16LE
       xx xx xx xx  UTF-8

在这种情况下,r.content的开头没有零字节,因此json.loads可以工作,否则,如果服务器在Content-Type标头中发送了错误的字符编码或解决方法<,则需要手动将其转换为Unicode字符串. c11>错误

I've this string Traor\u0102\u0160

Traor\u0102\u0160 Should produce Traoré. Then Traoré utf-8 decoded should produce Traorè

How I can convert it to Traorè ?

What kind of chars are Traor\u0102\u0160? Unicode?

I've already read this http://docs.python.org/howto/unicode.html#encodings many times. But I'm still really confused.

I get this data with the following request:

import json
import requests

# making a request to get this json
r = requests.get('http://cdn.content.easports.com/fifa/fltOnlineAssets/2013/fut/items/web/199074.json')
print r.json

Solution

#! /usr/bin/env python
# -*- coding: utf-8 -*-
import json
import requests

headers = {'Content-Type': 'application/json'}

r = requests.get('http://cdn.content.easports.com/fifa/fltOnlineAssets/2013/fut/items/web/199074.json', headers=headers)


print r.content

#prints
{"Item":{"FirstName":"Lacina","LastName":"Traoré","CommonName":null,"Height":"203","DateOfBirth":{"Year":"1990","Month":"8","Day":"20"},"PreferredFoot":"Left","ClubId":"100766","LeagueId":"67","NationId":"108","Rating":"78","Attribute1":"79","Attribute2":"71","Attribute3":"45","Attribute4":"69","Attribute5":"50","Attribute6":"72","Rare":"1","ItemType":"PlayerA"}}

Basically I needed to set to send the rigth headers.

Thank you all

解决方案

For me your site returns "Traor\u00e9" (the last character is é):

r = requests.get(url)
print(json.dumps(json.loads(r.content)['Item']['LastName']))
# -> "Traor\u00e9" -> Traoré

r.json (r.text) produces incorrect content here. Either server or requests or both use incorrect encoding that results in "Traor\u0102\u0160". The encoding of JSON text is completely defined by its content therefore it is always possible to decode it whatever headers server sends, from json rfc:

JSON text SHALL be encoded in Unicode. The default encoding is
UTF-8.

Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.

       00 00 00 xx  UTF-32BE
       00 xx 00 xx  UTF-16BE
       xx 00 00 00  UTF-32LE
       xx 00 xx 00  UTF-16LE
       xx xx xx xx  UTF-8

In this case there are no zero bytes at the start of r.content so json.loads works otherwise you need manually to convert it to a Unicode string if the server sends incorrect character encoding in Content-Type header or to workaround requests bug

这篇关于再次转换为UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆