从网页获取国际字符? [英] Getting international characters from a web page?
问题描述
我想使用简单的python regexp从足球(足球)网页上刮掉一些信息。问题在于像第一章RIDITALO这样的玩家出现为ÄÄ RITALO!
也就是说,html使用特殊字符的转义标记,如& amp ;#196;
有没有简单的方法将html读入正确的python字符串?如果是XML / XHTML,很容易,解析器会这样做。 我会推荐 BeautifulSoup 用于HTML抓取。您还需要告诉它将HTML实体转换为相应的Unicode字符,例如: from BeautifulSoup import BeautifulSoup
>>> html =< html>&#196;&#196; RITALO!< / html>
>>> soup = BeautifulSoup(html,convertEntities = BeautifulSoup.HTML_ENTITIES)
>>> print soup.contents [0] .string
ÄÄRITALO!
(如果标准的编解码器模块包含一个编解码器,以便您可以执行some_string.decode('html_entities') )
编辑:
另一个解决方案:
Python开发人员Fredrik Lundh(elementtree的作者,除其他外)有一个用于取消HTML HTML实体的功能在他的网站上,它与十进制,十六进制和命名实体一起工作(BeautifulSoup不会与十六进制一起工作)。
I want to scrape some information off a football (soccer) web page using simple python regexp's. The problem is that players such as the first chap, ÄÄRITALO, comes out as ÄÄRITALO!
That is, html uses escaped markup for the special characters, such as Ä
Is there a simple way of reading the html into the correct python string? If it was XML/XHTML it would be easy, the parser would do it.
I would recommend BeautifulSoup for HTML scraping. You also need to tell it to convert HTML entities to the corresponding Unicode characters, like so:
>>> from BeautifulSoup import BeautifulSoup
>>> html = "<html>ÄÄRITALO!</html>"
>>> soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> print soup.contents[0].string
ÄÄRITALO!
(It would be nice if the standard codecs module included a codec for this, such that you could do "some_string".decode('html_entities')
but unfortunately it doesn't!)
EDIT: Another solution: Python developer Fredrik Lundh (author of elementtree, among other things) has a function to unsecape HTML entities on his website, which works with decimal, hex and named entities (BeautifulSoup will not work with the hex ones).
这篇关于从网页获取国际字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!