在抓取网页时如何处理未知的编码? [英] How to deal with unknown encoding when scraping webpages?

查看:116
本文介绍了在抓取网页时如何处理未知的编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我一次刮一篇文章url的代码导致出现以下错误:

  UnicodeDecodeError:'ascii'编解码器无法解码8858位的字节0xe2:ordinal不在范围内(128)

以下是我的代码的最简单形式:

  from google.appengine.api import urlfetch 

def fetch(url):
headers = {'User-Agent':Chrome / 11.0.696.16 }
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
return result.content

以下是我尝试过的另一个变体,结果相同:

  def fetch(url):
headers = {'User-Agent':Chrome / 11.0.696.16}
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
s = result.content
s = s.decode(' utf-8')
s = s.encode('utf-8')
s = unicode(s,'utf-8')
返回s

这是丑陋而脆弱的一个,它也不起作用:

<$ p $ ($ url):
headers = {'User-Agent':Chrome / 11.0.696.16}
result = urlfetch.fetch(url,headers)
if result.status_code == 200:
s = result.content

try:
s = s.decode('iso-8859-1')
除了:
通过
试试:
s = s.decode('ascii')
除外:
通过
试试:
s = s。解码('GB2312')
除外:
通过
试试:
s = s.decode('Windows-1251')
除外:
pass
尝试:
s = s.decode('Windows-1252')
除外:
s =无效

s = s.encode('utf-8')
s = unicode(s,'utf-8')
返回s

最后一个变量返回s,因为字符串没有工作,除去。



所以,我要去不得不扩大我笨拙的尝试/除了构造包含所有可能的编码(即使工作?),还是有一个更简单的方法?

为什么我决定刮掉整个HTML,不只是BeautifulSoup?因为我想在稍后进行修剪,以避免GAE中的DeadlineExceedError。



我读过关于Unicode的所有优秀文章,以及它应该如何完成?是。然而,我没有找到一个解决方案,并不假设我知道传入的编码,我不这样做,因为我每天都在拼抢不同的网站。

解决方案

前段时间我有同样的问题,没有100%准确。我做的是:
$ b $ ul
从内容类型获取编码
从元标记获取编码

  • 使用chardet Python模块检测编码
  • 将文本从最常见的编码解码为Unicode
  • 处理文本/ html


  • I'm scraping news articles from various sites, using GAE and Python.

    The code where I scrape one article url at a time leads to the following error:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8858: ordinal not in range(128)
    

    Here's my code in its simplest form:

    from google.appengine.api import urlfetch
    
    def fetch(url):
        headers = {'User-Agent' : "Chrome/11.0.696.16"}
        result = urlfetch.fetch(url,headers)
        if result.status_code == 200:
            return result.content
    

    Here is another variant I have tried, with the same result:

    def fetch(url):
        headers = {'User-Agent' : "Chrome/11.0.696.16"}
        result = urlfetch.fetch(url,headers)
        if result.status_code == 200:
            s = result.content
            s = s.decode('utf-8')
            s = s.encode('utf-8')
            s = unicode(s,'utf-8')
            return s
    

    Here's the ugly, brittle one, which also doesn't work:

    def fetch(url):
        headers = {'User-Agent' : "Chrome/11.0.696.16"}
        result = urlfetch.fetch(url,headers)
        if result.status_code == 200:
            s = result.content
    
            try:
                s = s.decode('iso-8859-1')
            except:
                pass
            try:
                s = s.decode('ascii')
            except: 
                pass
            try:
                s = s.decode('GB2312')
            except:
                pass
            try:
                s = s.decode('Windows-1251')
            except:
                pass
            try:
                s = s.decode('Windows-1252')
            except:
                s = "did not work"
    
            s = s.encode('utf-8')
            s = unicode(s,'utf-8')
            return s
    

    The last variant returns s as the string "did not work" from the last except.

    So, am I going to have to expand my clumsy try/except construction to encompass all possible encodings (will that even work?), or is there an easier way?

    Why have I decided to scrape the entire html, not just the BeautifulSoup? Because I want to do the soupifying later, to avoid DeadlineExceedError in GAE.

    Have I read all the excellent articles about Unicode, and how it should be done? Yes. However, I have failed to find a solution that does not assume I know the incoming encoding, which I don't, since I'm scraping different sites every day.

    解决方案

    I had the same problem some time ago and there is nothing 100% accurate. What I did was:

    • Get encoding from Content-Type
    • Get encoding from meta tags
    • Detect encoding with chardet Python module
    • Decode text from the most common encoding to Unicode
    • Process the text/html

    这篇关于在抓取网页时如何处理未知的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆