为什么Python无法正确显示此文本? (UTF-8解码问题) [英] Why won't Python display this text correctly? (UTF-8 Decoding Issue)
问题描述
import urllib.request as u
zipcode = str(47401)
url = 'http://watchdog.net/us/?zip=' + zipcode
con = u.urlopen(url)
page = str(con.read())
value3 = int(page.find("<title>")) + 7
value4 = int(page.find("</title>")) - 15
district = str(page[value3:value4])
print(district)
newdistrict = district.replace("\xe2\x80\x99","'")
print(newdistrict)
由于某种原因,我的代码以以下格式输入标题:IN-09: Indiana\xe2\x80\x99s 9th
.我知道\xe
字符串是'
符号的unicode,但是我不知道如何获取python用'
符号替换那组字符.我试过解码字符串,但是它已经在unicode中了,上面的替换代码没有任何改变.关于我做错了什么的建议吗?
For some reason, my code is pulling in the title in the following format: IN-09: Indiana\xe2\x80\x99s 9th
. I know the \xe
string of characters is unicode for the '
symbol, but I can't figure out how to get python to replace that set of characters with the '
symbol. I've tried decoding the string but it's already in unicode and the replace code above doesn't change anything. Any advice as to what I'm doing incorrectly?
推荐答案
调用con.text()
时,这将返回bytes
对象.在其上调用str()
会返回其表示形式的字符串-因此,如果未指定编码,则使用转义字符而不是实际字符. (这意味着您的字符串最终包含\\xe2\\x80\\x99
以及各种其他不想要的东西.)bytes
就像Python 2中的str
一样,它没有存储任何编码信息. Python 3中的str
就像Python 2中的unicode
一样;它具有编码.因此,当将bytes
对象转换为str
对象时,您需要告诉它实际上是什么编码.在这种情况下,它就是utf-8
.
When you call con.text()
, this returns a bytes
object. Calling str()
on it returns a string of the representation of it - thus, the escapes are used rather than the real characters, if you don't specify an encoding. (That means that your string ends up containing \\xe2\\x80\\x99
as well as all sorts of other undesired things.) bytes
is mostly like str
in Python 2: it doesn't have any encoding information stored. str
in Python 3 is like unicode
in Python 2; it has the encoding. So, when turning a bytes
object into a str
object, you need to tell it what encoding it is actually in. In this case, that's utf-8
.
与其在其上调用str()
,不如使用bytes.decode
更好;是同一件事,只是要整洁.
Instead of calling str()
on it, you would be better to use bytes.decode
; it's the same thing, just neater.
>>> import urllib.request as u
>>> zipcode = 47401
>>> url = 'http://watchdog.net/us/?zip={}'.format(zipcode)
>>> con = u.urlopen(url)
>>> page = con.read().decode('utf-8')
>>> page[page.find("<title>") + 7:page.find("</title>") - 15]
'IN-09: Indiana’s 9th'
此处唯一进行的功能更改是将bytes
对象解码为'utf-8'
的规范.
The only functional change that has been made here is the specification to decode the bytes
object as 'utf-8'
.
这篇关于为什么Python无法正确显示此文本? (UTF-8解码问题)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!