用Python解码HTML实体 [英] Decoding HTML Entities With Python
问题描述
以下Python代码使用BeautifulStoneSoup获取托尔金的The Children ofHúrin的LibraryThing API信息。
The following Python code uses BeautifulStoneSoup to fetch the LibraryThing API information for Tolkien's "The Children of Húrin".
import urllib2
from BeautifulSoup import BeautifulStoneSoup
URL = ("http://www.librarything.com/services/rest/1.0/"
"?method=librarything.ck.getwork&id=1907912"
"&apikey=2a2e596b887f554db2bbbf3b07ff812a")
soup = BeautifulStoneSoup(urllib2.urlopen(URL),
convertEntities=BeautifulStoneSoup.ALL_ENTITIES)
title_field = soup.find('field', attrs={'name': 'canonicaltitle'})
print title_field.find('fact').string
不幸的是,它不是Húrin,而是打印出Hú。这显然是一个编码问题,但是我无法解决我需要做什么来获得预期的输出。帮助将不胜感激。
Unfortunately, instead of 'Húrin', it prints out 'Húrin'. This is obviously an encoding issue, but I can't work out what I need to do to get the expected output. Help would be greatly appreciated.
推荐答案
在网页的源代码中,它看起来像这样: H& Atilde的儿童& ordm; rin
。所以编码在他们身边的某个地方已经被破坏,甚至被转换成XML ...
In the source of the web page it looks like this: The Children of Húrin
. So the encoding is already broken somewhere on their side before it even gets converted to XML...
如果这是所有书籍的一般问题,你需要解决它似乎工作:
If it's a general issue with all the books and you need to work around it, this seems to work:
unicode(title_field.find('fact').string).encode("latin1").decode("utf-8")
这篇关于用Python解码HTML实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!