如何解决scrapy中的403错误 [英] How to solve 403 error in scrapy

查看:206
本文介绍了如何解决scrapy中的403错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是scrapy 的新手,我创建了scrapy 项目来废弃数据.

I'm new to scrapy and I made the scrapy project to scrap data.

我正在尝试从网站上抓取数据,但收到以下错误日志

I'm trying to scrapy the data from the website but I'm getting following error logs

2016-08-29 14:07:57 [scrapy] INFO: Enabled item pipelines:
[]
2016-08-29 13:55:03 [scrapy] INFO: Spider opened
2016-08-29 13:55:03 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2016-08-29 13:55:04 [scrapy] DEBUG: Crawled (403) <GET http://www.justdial.com/robots.txt> (referer: None)
2016-08-29 13:55:04 [scrapy] DEBUG: Crawled (403) <GET http://www.justdial.com/Mumbai/small-business> (referer: None)
2016-08-29 13:55:04 [scrapy] DEBUG: Ignoring response <403 http://www.justdial.com/Mumbai/small-business>: HTTP status code is not handled or not allowed
2016-08-29 13:55:04 [scrapy] INFO: Closing spider (finished)

我正在尝试在网站控制台上执行以下命令,然后我得到了响应,但是当我在 python 脚本中使用相同的路径时,我得到了上面描述的错误.

I'm trying following command then on website console then I got the response but when I'm using same path inside python script then I got the error which I have described above.

Web 控制台上的命令:

Commands on web console:

$x('//div[@class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/h4/span/a/text()')
$x('//div[@class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/p[@class="contact-info"]/span/a/text()')

请帮帮我.

谢谢

推荐答案

就像评论中提到的 Avihoo Mamka 一样,您需要提供一些额外的请求标头以免被本网站拒绝.

Like Avihoo Mamka mentioned in the comment you need to provide some extra request headers to not get rejected by this website.

在这种情况下,它似乎只是 User-Agent 标头.默认情况下,scrapy 使用用户代理 "Scrapy/{version}(+http://scrapy.org)" 标识自己.某些网站可能会出于某种原因拒绝此操作.

In this case it seems to just be the User-Agent header. By default scrapy identifies itself with user agent "Scrapy/{version}(+http://scrapy.org)". Some websites might reject this for one reason or another.

为避免这种情况,只需使用通用用户代理字符串设置 Requestheaders 参数:

To avoid this just set headers parameter of your Request with a common user agent string:

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
yield Request(url, headers=headers)

您可以在这里找到一个巨大的用户代理列表,但您应该坚持使用流行的网络浏览器,如 Firefox、Chrome 等,以获得最佳效果

You can find a huge list of user-agents here, though you should stick with popular web-browser ones like Firefox, Chrome etc. for the best results

你也可以实现它来与你的蜘蛛start_urls一起工作:

You can implement it to work with your spiders start_urls too:

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = (
        'http://scrapy.org',
    )

    def start_requests(self):
        headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
        for url in self.start_urls:
            yield Request(url, headers=headers)

这篇关于如何解决scrapy中的403错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆