来自urlopen的胡言乱语 [英] Gibberish from urlopen

查看:114
本文介绍了来自urlopen的胡言乱语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从下面的代码中的地址中读取一些utf-8文件.它适用于大多数文件,但是对于某些文件,urllib2(和urllib)无法读取.

I am trying to read some utf-8 files from the addresses in the code below. It works for most of them, but for some files the urllib2 (and urllib) is unable to read.

这里最明显的答案是第二个文件已损坏,但奇怪的是IE完全没有问题地读取了它们.该代码已经在XP和Linux上进行了测试,结果相同.有任何建议吗?

The obvious answer here is that the second file is corrupt, but the strange thing is that IE reads them both with no problem at all. The code has been tested on both XP and Linux, with identical results. Any sugestions?

import urllib2
#This works:
f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/145/pg145.txt")
line=f.readline()
print "this works: %s)" %(line)
line=unicode(line,'utf-8') #... works fine

#This doesn't
f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/144/pg144.txt")
line=f.readline()
print "this doesn't: %s)" %(line)
line=unicode(line,'utf-8')#...causes an exception:

推荐答案

>>> f=urllib2.urlopen("http://www.gutenberg.org/cache/epub/144/pg144.txt")
>>> f.headers.dict
{'content-length': '304513', ..., 'content-location': 'pg144.txt.utf8.gzip', 'content-encoding': 'gzip', ..., 'content-type': 'text/plain; charset=utf-8'}

要么设置阻止站点发送以gzip编码的响应的标头,要么先对其进行解码.

Either set a header that prevents the site sending a gzip-encoded response, or decode it first.

这篇关于来自urlopen的胡言乱语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆