使用Python 2.x将HTML源内容转换为可读格式 [英] Converting html source content into readable format with Python 2.x

查看:82
本文介绍了使用Python 2.x将HTML源内容转换为可读格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python 2.7



我有一个程序可以从网页的源代码获取视频标题,但标题是以某种HTML格式编码的。 b
$ b

这是我到目前为止所尝试过的:

 >>> import urllib2 
>>> urllib2.unquote('& pound;')
'& pound;'

所以,这没有奏效...
然后我试过:

 >>> import HTMLParser 
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('& pound;')
u'\xa3'

你可以看到,这两个都不起作用,也没有任何两个组合。



我设法找出'& pound;' 是一个HTML字符实体名称。 '\xa3'我无法找到。



有谁知道如何做到这一点,如何将HTML内容转换为Python中的可读格式?

是<一个是Unicode字符U + 00A3的POUND SIGN 。如果您打印它,您可以看到:

 >>>打印u'\xa3'
$ b b $ b

当您使用 unescape(),你将字符实体转换为它的本地unicode字符,这就是 u'\xa3'的含义 - 一个U + 00A3 Unicode字符。



如果您想将其编码为另一种格式(例如utf-8),您可以使用 encode 字符串方法:

 >>> u'\xa3'.encode('utf-8')
'\xc2\xa3'

您会得到一个代表单个POUND SIGN字符的双字节字符串。



我怀疑字符串编码是如何工作的一般来说。您需要将字符串从字节转换为unicode(请参阅此答案,以获取使用urllib2的一种方式),然后使用unescape html,然后(可能)将unicode转换为你需要的任何输出编码。


Python 2.7

I have a program that gets video titles from the source code of a webpage but the titles are encoded in some HTML format.

This is what I've tried so far:

>>> import urllib2
>>> urllib2.unquote('&pound;')
'&pound;'

So that didn't work... Then I tried:

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> h.unescape('&pound;')
u'\xa3'

as you can see that doesn't work either nor any combination of the two.

I managed to find out that '&pound;' is an HTML character entity name. The '\xa3' I wasn't able to find out.

Does anyone know how to do this, how to convert HTML content into a readable format in python?

解决方案

&pound; is the html character entity for the POUND SIGN, which is unicode character U+00A3. You can see this if you print it:

>>> print u'\xa3'
£

When you use unescape(), you converted the character entity to it's native unicode character, which is what u'\xa3' means--a single U+00A3 unicode character.

If you want to encode this into another format (e.g. utf-8), you would do so with the encode method of strings:

>>> u'\xa3'.encode('utf-8')
'\xc2\xa3'

You get a two-byte string representing the single "POUND SIGN" character.

I suspect that you are a bit unclear about how string encodings work in general. You need to convert your string from bytes to unicode (see this answer for one way to do that with urllib2), then unescape the html, then (possibly) convert the unicode into whatever output encoding you need.

这篇关于使用Python 2.x将HTML源内容转换为可读格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆