500 内部服务器错误 [英] 500 Internal server error scrapy

查看:55
本文介绍了500 内部服务器错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scrapy 来抓取超过 400 万种产品的产品网站.然而,在抓取大约 50k 产品后,它开始抛出 500 HTTP 错误.我已将 Auto throttling 设置为 false ,因为在启用其非常慢后,大约需要 20-25 天才能完成抓取.我认为服务器在一段时间后开始暂时阻止爬虫.任何解决方案可以做什么?我正在使用站点地图爬虫 - 如果服务器没有响应,我想从 url 本身提取一些信息并继续下一个 url 而不是完成爬行并关闭蜘蛛,为此我正在查看请求中的 errback 参数.但是,由于我使用的是站点地图爬虫,因此我没有明确创建请求对象.是否有任何我可以覆盖的默认 errback 函数或者我可以在哪里定义它.

I am using scrapy to crawl a product website which over 4 million products. However after crawling around 50k products it starts throwing 500 HTTP error. I have set Auto throttling to false as after enabling its very slow and will take around 20-25 days to complete the scraping. I think the server starts blocking the crawler temporarily after sometime. Any solutions what can be done ? I am using sitemap crawler - I want to extract some information from the url itself if the server is not responding and proceed with next url instead of finishing the crawling and closing the spider, for that I was looking at the errback parameter in Request. However, since I am using sitemap crawler so I don't explicitly create a Request Object. Is there any default errback function that I can override or where can I define it.

这里定义了另一种方法-Scrapy:在一个请求失败时(例如 404,500),如何请求另一个替代请求?

One more way to do it is defined here-Scrapy:In a request fails (eg 404,500), how to ask for another alternative request?

推荐答案

HTTP 500 通常表示内部服务器错误.当被阻止时,您更有可能看到 403 或 404.(或者可能是 302 重定向到您已被阻止"页面)您可能正在访问导致某些内容中断服务器端的链接.您应该存储导致错误的请求并尝试自己访问它.可能是该网站损坏了.

HTTP 500 typically indicates an internal server error. When getting blocked, it is much more likely you'd see a 403 or 404. (or perhaps a 302 redirect to a "you've been blocked" page) You're probably visiting links that cause something to break server-side. You should store which request caused the error and try visiting it yourself. It could be the case that the site is simply broken.

好的..我明白了,但你能告诉我在哪里以及如何定义 errback 函数,以便我可以处理这个错误并且我的蜘蛛没有完成

Ok..i get it but can you tell where and how to define errback function so that I can handle this error and my spider does not finishes

我看了一下 SitemapSpider不幸的是,它不允许您指定 errback 函数,因此您必须自己添加对它的支持.我基于 SitemapSpider 的来源.

I took a look at SitemapSpider and unfortunately, it does not allow you to specify an errback function, so you're going to have to add support for it yourself. I'm basing this on the source for SitemapSpider.

首先,您需要通过添加处理错误的函数来更改 sitemap_rules 的工作方式:

First, you're going to want to change how sitemap_rules works by adding a function to handle errors:

sitemap_rules = [
    ('/product/', 'parse_product'),
    ('/category/', 'parse_category'),
]

将变成:

sitemap_rules = [
    ('/product/', 'parse_product', 'error_handler'),
    ('/category/', 'parse_category', 'error_handler'),
]

接下来,在init中,您想将新回调存储在_cbs中.

Next, in init, you want to store the new callback in _cbs.

 for r, c in self.sitemap_rules:
    if isinstance(c, basestring):
        c = getattr(self, c)
    self._cbs.append((regex(r), c))

将变成:

 for r, c, e in self.sitemap_rules:
    if isinstance(c, basestring):
        c = getattr(self, c)
    if isinstance(e, basestring):
        e = getattr(self, e)
    self._cbs.append((regex(r), c, e))

最后,在_parse_sitemap的最后,你可以指定你新的errback函数

Finally, at the end of _parse_sitemap, you can specify your new errback function

elif s.type == 'urlset':
    for loc in iterloc(s):
        for r, c in self._cbs:
            if r.search(loc):
                yield Request(loc, callback=c)
                break

将变成:

elif s.type == 'urlset':
    for loc in iterloc(s):
        for r, c, e in self._cbs:
            if r.search(loc):
                yield Request(loc, callback=c, errback=e)
                break

从那里开始,只需实现您的 errback 函数(请记住,它需要一个 Twisted Failure 作为参数),您应该很高兴.

From there, simply implement your errback function (keep in mind that it takes a Twisted Failure as an argument) and you should be good to go.

这篇关于500 内部服务器错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆