Urllib2 - 获取并显示任何语言页面,编码问题 [英] Urllib2- fetch and show any language page, encoding problem
问题描述
我使用Python Google App Engine来简单获取html页面并显示它。我的目标是能够以任何语言获取任何页面。现在我遇到了一个编码问题:
简单
result = urllib2.urlopen(url).read()
将工件留在特殊字母的位置并且
urllib2.urlopen(url).read()。decode('utf8' )
抛出错误:
'utf8'编解码器无法解码位置3544-3546中的字节:无效数据
那么如何解决它?是否有任何lib会检查
页面是什么编码并进行转换,以便读取它? 解决方案
rajax sugested在如何以正确的字符集下载任何(!)网页在Python中?使用来自 http://chardet.feedparser.org/ 的chardet lib
此代码似乎可以工作,现在:
import urllib2
import chardet
def fetch(url):
try:
result = urllib2.urlopen(url)
rawdata = result.read()
encoding = chardet.detect(rawdata)
return rawdata.decode(encoding ['encoding'])
除了urllib2.URLError,e:
handleError(e)
I'm using Python Google App Engine to simply fetch html pages and show it. My aim is to be able to fetch any page in any language. Now I have a problem with encoding:
Simple
result = urllib2.urlopen(url).read()
leaves artifacts in place of special letters and
urllib2.urlopen(url).read().decode('utf8')
throws error:
'utf8' codec can't decode bytes in position 3544-3546: invalid data
So how to solve it? Is there any lib that would check what encoding page is and convert so it would be readable?
rajax sugested at How to download any(!) webpage with correct charset in python? to use chardet lib from http://chardet.feedparser.org/
This code seems to work, now:
import urllib2
import chardet
def fetch(url):
try:
result = urllib2.urlopen(url)
rawdata = result.read()
encoding = chardet.detect(rawdata)
return rawdata.decode(encoding['encoding'])
except urllib2.URLError, e:
handleError(e)
这篇关于Urllib2 - 获取并显示任何语言页面,编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!