Scrapy spider:处理字符编码定义错误的页面 [英] Scrapy spider: dealing with pages that have incorrectly-defined character encoding

查看：48 发布时间：2021/7/16 22:03:35 python unicode scrapy

本文介绍了Scrapy spider:处理字符编码定义错误的页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

更新:只需从命令行运行此错误即可重现此错误:

Update: this error can be reproduced simply by running this from the command line:

scrapy shell http://www.indiegogo.com/Straight-Talk-About-Your-Future

<小时>

我正在使用 Scrapy 来抓取网站.我抓取的每个页面都声称使用 UTF-8 编码:

I'm using Scrapy to crawl a website. Every page I scrape claims to be encoded UTF-8:

<meta content="text/html; charset=utf-8" http-equiv="Content-Type">

但偶尔，页面包含超出 UTF-8 的字节，我会收到类似以下的 Scrapy 错误:

But occasionally, the pages contain bytes that fall outside of UTF-8, and I get Scrapy errors like:

exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 131: invalid continuation byte

我仍然需要抓取这些页面，即使它们包含不可映射的字符.有没有办法告诉 Scrapy 覆盖页面声明的编码，并使用另一个(比如，UTF-16)代替?

I still need to scrape these pages, even though they contain unmappable characters. Is there a way to tell Scrapy to override the page's declared encoding, and use another (say, UTF-16) instead?

这里是异常被捕获的地方:

Here's where the exception is being caught:

2012-05-30 14:43:20+0200 [igg] ERROR: Spider error processing <GET http://www.site.com/page>
    Traceback (most recent call last):
      File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
      File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 368, in callback
        self._startRunCallbacks(result)
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 464, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/Library/Python/2.7/site-packages/scrapy/core/spidermw.py", line 61, in process_spider_output
        result = method(response=response, result=result, spider=spider)

Scrapy spider:处理字符编码定义错误的页面 [英] Scrapy spider: dealing with pages that have incorrectly-defined character encoding

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy spider:处理字符编码定义错误的页面 [英] Scrapy spider: dealing with pages that have incorrectly-defined character encoding

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭