从网页获取国际字符? [英] Getting international characters from a web page?

查看:91
本文介绍了从网页获取国际字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用简单的python regexp从足球(足球)网页上刮掉一些信息。问题在于像第一章RIDITALO这样的玩家出现为ÄÄ RITALO!

也就是说,html使用特殊字符的转义标记,如& amp ;#196;



有没有简单的方法将html读入正确的python字符串?如果是XML / XHTML,很容易,解析器会这样做。 我会推荐 BeautifulSoup 用于HTML抓取。您还需要告诉它将HTML实体转换为相应的Unicode字符,例如: from BeautifulSoup import BeautifulSoup
>>> html =< html>&#196;&#196; RITALO!< / html>
>>> soup = BeautifulSoup(html,convertEntities = BeautifulSoup.HTML_ENTITIES)
>>> print soup.contents [0] .string
ÄÄRITALO!

(如果标准的编解码器模块包含一个编解码器,以便您可以执行some_string.decode('html_entities') )



编辑:
另一个解决方案:
Python开发人员Fredrik Lundh(elementtree的作者,除其他外)有一个用于取消HTML HTML实体的功能在他的网站上,它与十进制,十六进制和命名实体一起工作(BeautifulSoup不会与十六进制一起工作)。


I want to scrape some information off a football (soccer) web page using simple python regexp's. The problem is that players such as the first chap, ÄÄRITALO, comes out as &#196;&#196;RITALO!
That is, html uses escaped markup for the special characters, such as &#196;

Is there a simple way of reading the html into the correct python string? If it was XML/XHTML it would be easy, the parser would do it.

解决方案

I would recommend BeautifulSoup for HTML scraping. You also need to tell it to convert HTML entities to the corresponding Unicode characters, like so:

>>> from BeautifulSoup import BeautifulSoup    
>>> html = "<html>&#196;&#196;RITALO!</html>"
>>> soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> print soup.contents[0].string
ÄÄRITALO!

(It would be nice if the standard codecs module included a codec for this, such that you could do "some_string".decode('html_entities') but unfortunately it doesn't!)

EDIT: Another solution: Python developer Fredrik Lundh (author of elementtree, among other things) has a function to unsecape HTML entities on his website, which works with decimal, hex and named entities (BeautifulSoup will not work with the hex ones).

这篇关于从网页获取国际字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆