Scrapy shell 返回无响应 [英] Scrapy shell return without response

查看:27
本文介绍了Scrapy shell 返回无响应的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用scrapy 抓取网站时遇到了一点问题.我跟着scrapy的教程学习了如何抓取一个网站,我有兴趣在网站上测试它'https://www.leboncoin.fr' 但蜘蛛不工作.所以,我试过:

I have a little problem with scrapy to crawl a website. I followed the tutorial of scrapy to learn how crawl a website and I was interested to test it on the site 'https://www.leboncoin.fr' but the spider doesn't work. So, I tried :

scrapy shell 'https://www.leboncoin.fr'

但是,我没有网站的回复.

But, I haven't a response of the site.

$ scrapy shell 'https://www.leboncoin.fr'
2017-05-16 08:31:26 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: all_cote)
2017-05-16 08:31:26 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'all_cote', 'DUPEFILTER_CLASS':    'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0,   'NEWSPIDER_MODULE': 'all_cote.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['all_cote.spiders']}
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled item pipelines:[]
2017-05-16 08:31:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-16 08:31:27 [scrapy.core.engine] INFO: Spider opened
2017-05-16 08:31:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leboncoin.fr/robots.txt> (referer: None)
2017-05-16 08:31:27 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.leboncoin.fr>
2017-05-16 08:31:28 [traitlets] DEBUG: Using default logger
2017-05-16 08:31:28 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x1039fbd30>
[s]   item       {}
[s]   request    <GET https://www.leboncoin.fr>
[s]   settings   <scrapy.settings.Settings object at 0x10716b8d0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

如果我使用:

view(response)

打印 AttributeError...

An AttributeError is printed...

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-1-2c2544195c90> in <module>()
----> 1 view(response)

/usr/local/lib/python3.6/site-packages/scrapy/utils/response.py in open_in_browser(response, _openfunc)
     67     from scrapy.http import HtmlResponse, TextResponse
     68     # XXX: this implementation is a bit dirty and could be improved
---> 69     body = response.body
     70     if isinstance(response, HtmlResponse):
     71         if b'<base' not in body:

AttributeError: 'NoneType' 对象没有属性 'body'

编辑 1:

致 rrschmidt :完整的日志已更新,当我运行时

AttributeError: 'NoneType' object has no attribute 'body'

Edit 1 :

To rrschmidt : the complete log was updated and when I run

fetch('https:www.leboncoin.fr') 

我收到这个:

2017-05-16 08:33:15 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.leboncoin.fr>

<小时>

那么,我该如何解决?


So, How can I fix it ?

感谢您的回答,

克里斯

推荐答案

该网站似乎已限制通过 robots.txt 进行抓取.尊重这个愿望通常是礼貌的.

It looks like the website has restricted scraping via robots.txt. Its usually polite to respect that wish.

但如果你真的想抓取网站,你可以通过在 settings.py 中将 ROBOTSTXT_OBEY 设置更改为 false 来更改scrapy的默认行为

But if you really want to scrape the site you can change scrapy's default behaviour by changing the ROBOTSTXT_OBEY setting to false in your settings.py

ROBOTSTXT_OBEY=False

这篇关于Scrapy shell 返回无响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆