Urllib2 - 获取并显示任何语言页面,编码问题 [英] Urllib2- fetch and show any language page, encoding problem

查看:102
本文介绍了Urllib2 - 获取并显示任何语言页面,编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Python Google App Engine来简单获取html页面并显示它。我的目标是能够以任何语言获取任何页面。现在我遇到了一个编码问题:

简单

  result = urllib2.urlopen(url).read()

将工件留在特殊字母的位置并且

  urllib2.urlopen(url).read()。decode('utf8' )

抛出错误:


'utf8'编解码器无法解码位置3544-3546中的字节:无效数据

那么如何解决它?是否有任何lib会检查
页面是什么编码并进行转换,以便读取它? 解决方案

rajax sugested在如何以正确的字符集下载任何(!)网页在Python中?使用来自 http://chardet.feedparser.org/ 的chardet lib



此代码似乎可以工作,现在:

  import urllib2 
import chardet

def fetch(url):
try:
result = urllib2.urlopen(url)
rawdata = result.read()
encoding = chardet.detect(rawdata)
return rawdata.decode(encoding ['encoding'])

除了urllib2.URLError,e:
handleError(e)


I'm using Python Google App Engine to simply fetch html pages and show it. My aim is to be able to fetch any page in any language. Now I have a problem with encoding:

Simple

result = urllib2.urlopen(url).read() 

leaves artifacts in place of special letters and

urllib2.urlopen(url).read().decode('utf8')

throws error:

'utf8' codec can't decode bytes in position 3544-3546: invalid data

So how to solve it? Is there any lib that would check what encoding page is and convert so it would be readable?

解决方案

rajax sugested at How to download any(!) webpage with correct charset in python? to use chardet lib from http://chardet.feedparser.org/

This code seems to work, now:

import urllib2
import chardet

def fetch(url):
 try:
    result = urllib2.urlopen(url)
    rawdata = result.read()
    encoding = chardet.detect(rawdata)
    return rawdata.decode(encoding['encoding'])

 except urllib2.URLError, e:
    handleError(e)

这篇关于Urllib2 - 获取并显示任何语言页面,编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆