Scrapy spider:处理字符编码定义错误的页面 [英] Scrapy spider: dealing with pages that have incorrectly-defined character encoding

查看:48
本文介绍了Scrapy spider:处理字符编码定义错误的页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新:只需从命令行运行此错误即可重现此错误:

Update: this error can be reproduced simply by running this from the command line:

scrapy shell http://www.indiegogo.com/Straight-Talk-About-Your-Future

<小时>

我正在使用 Scrapy 来抓取网站.我抓取的每个页面都声称使用 UTF-8 编码:


I'm using Scrapy to crawl a website. Every page I scrape claims to be encoded UTF-8:

<meta content="text/html; charset=utf-8" http-equiv="Content-Type">

但偶尔,页面包含超出 UTF-8 的字节,我会收到类似以下的 Scrapy 错误:

But occasionally, the pages contain bytes that fall outside of UTF-8, and I get Scrapy errors like:

exceptions.UnicodeDecodeError: 'utf8' codec can't decode byte 0xe8 in position 131: invalid continuation byte

我仍然需要抓取这些页面,即使它们包含不可映射的字符.有没有办法告诉 Scrapy 覆盖页面声明的编码,并使用另一个(比如,UTF-16)代替?

I still need to scrape these pages, even though they contain unmappable characters. Is there a way to tell Scrapy to override the page's declared encoding, and use another (say, UTF-16) instead?

这里是异常被捕获的地方:

Here's where the exception is being caught:

2012-05-30 14:43:20+0200 [igg] ERROR: Spider error processing <GET http://www.site.com/page>
    Traceback (most recent call last):
      File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1178, in mainLoop
        self.runUntilCurrent()
      File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 800, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 368, in callback
        self._startRunCallbacks(result)
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 464, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/Library/Python/2.7/site-packages/twisted/internet/defer.py", line 551, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/Library/Python/2.7/site-packages/scrapy/core/spidermw.py", line 61, in process_spider_output
        result = method(response=response, result=result, spider=spider)

推荐答案

在最新的 dev scrapy (0.15) 中有一些关于编码的工作.可能值得尝试最新版本.

There has been some work on encoding in the latest dev scrapy (0.15). It could be worth trying the latest version.

Scrapy 允许您通过 response.body_as_unicode.这以与浏览器类似的方式处理编码检测,您几乎应该总是使用它而不是原始正文.从scrapy 0.15开始,它依赖于w3lib.encoding.html_to_unicode,稍加定制.

Scrapy lets you access unicode via response.body_as_unicode. This handles encoding detection in a similar way to browsers and you should nearly always use this instead of the raw body. As of scrapy 0.15, it relies on w3lib.encoding.html_to_unicode, with a little customization.

当有人请求 unicode 时,解码会延迟发生.您可以创建一个新的响应,根据您在蜘蛛程序中收到的响应自行指定编码,但是,这不是必需的.

The decoding happens lazily, when someone requests unicode. You can create a new response, specifying the encoding yourself from the one you receive in the spider, however, this shouldn't be necessary.

从回溯中不清楚究竟是哪位代码导致了错误的发生.有没有更详细的?另一种可能是身体不知何故被截断了.

It's not clear from the traceback which bit of code is actually causing the error to happen. Was there any more detail? Another possibility could be that the body is getting truncated somehow.

如果浏览器正确处理这些页面而不是scrapy,那么如果您能制作一个简单的测试用例并报告错误,我们将不胜感激.

If these pages are handled correctly by a browser and not by scrapy, then it would be appreciated if you could make a simple test case and report a bug.

这篇关于Scrapy spider:处理字符编码定义错误的页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆