为什么Python无法正确显示此文本? (UTF-8解码问题) [英] Why won't Python display this text correctly? (UTF-8 Decoding Issue)

查看:214
本文介绍了为什么Python无法正确显示此文本? (UTF-8解码问题)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

import urllib.request as u

zipcode = str(47401)
url = 'http://watchdog.net/us/?zip=' + zipcode
con = u.urlopen(url)

page = str(con.read())
value3 = int(page.find("<title>")) + 7
value4 = int(page.find("</title>")) - 15
district = str(page[value3:value4])
print(district)
newdistrict = district.replace("\xe2\x80\x99","'")
print(newdistrict)

由于某种原因,我的代码以以下格式输入标题:IN-09: Indiana\xe2\x80\x99s 9th.我知道\xe字符串是'符号的unicode,但是我不知道如何获取python用'符号替换那组字符.我试过解码字符串,但是它已经在unicode中了,上面的替换代码没有任何改变.关于我做错了什么的建议吗?

For some reason, my code is pulling in the title in the following format: IN-09: Indiana\xe2\x80\x99s 9th. I know the \xe string of characters is unicode for the ' symbol, but I can't figure out how to get python to replace that set of characters with the ' symbol. I've tried decoding the string but it's already in unicode and the replace code above doesn't change anything. Any advice as to what I'm doing incorrectly?

推荐答案

调用con.text()时,这将返回bytes对象.在其上调用str()会返回其表示形式的字符串-因此,如果未指定编码,则使用转义字符而不是实际字符. (这意味着您的字符串最终包含\\xe2\\x80\\x99以及各种其他不想要的东西.)bytes就像Python 2中的str一样,它没有存储任何编码信息. Python 3中的str就像Python 2中的unicode一样;它具有编码.因此,当将bytes对象转换为str对象时,您需要告诉它实际上是什么编码.在这种情况下,它就是utf-8.

When you call con.text(), this returns a bytes object. Calling str() on it returns a string of the representation of it - thus, the escapes are used rather than the real characters, if you don't specify an encoding. (That means that your string ends up containing \\xe2\\x80\\x99 as well as all sorts of other undesired things.) bytes is mostly like str in Python 2: it doesn't have any encoding information stored. str in Python 3 is like unicode in Python 2; it has the encoding. So, when turning a bytes object into a str object, you need to tell it what encoding it is actually in. In this case, that's utf-8.

与其在其上调用str(),不如使用bytes.decode更好;是同一件事,只是要整洁.

Instead of calling str() on it, you would be better to use bytes.decode; it's the same thing, just neater.

>>> import urllib.request as u
>>> zipcode = 47401
>>> url = 'http://watchdog.net/us/?zip={}'.format(zipcode)
>>> con = u.urlopen(url)
>>> page = con.read().decode('utf-8')
>>> page[page.find("<title>") + 7:page.find("</title>") - 15]
'IN-09: Indiana’s 9th'

此处唯一进行的功能更改是将bytes对象解码为'utf-8'的规范.

The only functional change that has been made here is the specification to decode the bytes object as 'utf-8'.

这篇关于为什么Python无法正确显示此文本? (UTF-8解码问题)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆