用Python解码HTML实体 [英] Decoding HTML Entities With Python

查看：291 发布时间：2017/8/16 20:09:45 python unicode encoding utf-8 beautifulsoup

本文介绍了用Python解码HTML实体的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

以下Python代码使用BeautifulStoneSoup获取托尔金的The Children ofHúrin的LibraryThing API信息。

The following Python code uses BeautifulStoneSoup to fetch the LibraryThing API information for Tolkien's "The Children of Húrin".

import urllib2

from BeautifulSoup import BeautifulStoneSoup

URL = ("http://www.librarything.com/services/rest/1.0/"
            "?method=librarything.ck.getwork&id=1907912"
            "&apikey=2a2e596b887f554db2bbbf3b07ff812a")

soup = BeautifulStoneSoup(urllib2.urlopen(URL),
                          convertEntities=BeautifulStoneSoup.ALL_ENTITIES)
title_field = soup.find('field', attrs={'name': 'canonicaltitle'})
print title_field.find('fact').string

不幸的是，它不是Húrin，而是打印出HÃº。这显然是一个编码问题，但是我无法解决我需要做什么来获得预期的输出。帮助将不胜感激。

Unfortunately, instead of 'Húrin', it prints out 'HÃºrin'. This is obviously an encoding issue, but I can't work out what I need to do to get the expected output. Help would be greatly appreciated.

推荐答案

在网页的源代码中，它看起来像这样： H& Atilde的儿童& ordm; rin 。所以编码在他们身边的某个地方已经被破坏，甚至被转换成XML ...

In the source of the web page it looks like this: The Children of HÃºrin. So the encoding is already broken somewhere on their side before it even gets converted to XML...

如果这是所有书籍的一般问题，你需要解决它似乎工作：

If it's a general issue with all the books and you need to work around it, this seems to work:

unicode(title_field.find('fact').string).encode("latin1").decode("utf-8")

这篇关于用Python解码HTML实体的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用Python解码HTML实体 [英] Decoding HTML Entities With Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

用Python解码HTML实体 [英] Decoding HTML Entities With Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭