urllib2 读取到 Unicode [英] urllib2 read to Unicode

查看:51
本文介绍了urllib2 读取到 Unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要存储可以使用任何语言的网站内容.而且我需要能够搜索 Unicode 字符串的内容.

我尝试过类似的方法:

导入 urllib2req = urllib2.urlopen('http://lenta.ru')内容 = req.read()

内容是一个字节流,所以我可以搜索一个Unicode字符串.

我需要一些方法,当我执行 urlopen 然后读取以使用标头中的字符集来解码内容并将其编码为 UTF-8 时.

解决方案

执行操作后,您将看到:

<预><代码>>>>req.headers['内容类型']'文本/html;字符集=windows-1251'

等等:

<预><代码>>>>encoding=req.headers['content-type'].split('charset=')[-1]>>>ucontent = unicode(内容,编码)

ucontent 现在是一个 Unicode 字符串(140655 个字符)——例如显示它的一部分,如果你的终端是 UTF-8:

<预><代码>>>>打印 ucontent[76:110].encode('utf-8')<title>Lenta.ru: Главное: </title>

你可以搜索,等等.

Unicode I/O 通常很棘手(这可能是阻碍原始提问者的原因)但我将绕过将 Unicode 字符串输入到交互式 Python 解释器的难题(与原始问题完全无关)展示如何,一旦 Unicode 字符串被正确输入(我是通过代码点来做的——愚蠢但不棘手;-),搜索绝对是一件轻而易举的事(因此希望最初的问题已经得到彻底回答).再次假设一个 UTF-8 终端:

<预><代码>>>>x=u'\u0413\u043b\u0430\u0432\u043d\u043e\u0435'>>>打印 x.encode('utf-8')Главное>>>u 内容中的 x真的>>>ucontent.find(x)93

注意:请记住,此方法可能不适用于所有网站,因为某些网站仅在提供的文档中指定字符编码(例如,使用 http-equiv 元标记).

I need to store the content of a site that can be in any language. And I need to be able to search the content for a Unicode string.

I have tried something like:

import urllib2

req = urllib2.urlopen('http://lenta.ru')
content = req.read()

The content is a byte stream, so I can search it for a Unicode string.

I need some way that when I do urlopen and then read to use the charset from the headers to decode the content and encode it into UTF-8.

解决方案

After the operations you performed, you'll see:

>>> req.headers['content-type']
'text/html; charset=windows-1251'

and so:

>>> encoding=req.headers['content-type'].split('charset=')[-1]
>>> ucontent = unicode(content, encoding)

ucontent is now a Unicode string (of 140655 characters) -- so for example to display a part of it, if your terminal is UTF-8:

>>> print ucontent[76:110].encode('utf-8')
<title>Lenta.ru: Главное: </title>

and you can search, etc, etc.

Edit: Unicode I/O is usually tricky (this may be what's holding up the original asker) but I'm going to bypass the difficult problem of inputting Unicode strings to an interactive Python interpreter (completely unrelated to the original question) to show how, once a Unicode string IS correctly input (I'm doing it by codepoints -- goofy but not tricky;-), search is absolutely a no-brainer (and thus hopefully the original question has been thoroughly answered). Again assuming a UTF-8 terminal:

>>> x=u'\u0413\u043b\u0430\u0432\u043d\u043e\u0435'
>>> print x.encode('utf-8')
Главное
>>> x in ucontent
True
>>> ucontent.find(x)
93

Note: Keep in mind that this method may not work for all sites, since some sites only specify character encoding inside the served documents (using http-equiv meta tags, for example).

这篇关于urllib2 读取到 Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆