Scrapy shell 返回无响应 [英] Scrapy shell return without response
问题描述
我在使用scrapy 抓取网站时遇到了一点问题.我跟着scrapy的教程学习了如何抓取一个网站,我有兴趣在网站上测试它'https://www.leboncoin.fr' 但蜘蛛不工作.所以,我试过:
I have a little problem with scrapy to crawl a website. I followed the tutorial of scrapy to learn how crawl a website and I was interested to test it on the site 'https://www.leboncoin.fr' but the spider doesn't work. So, I tried :
scrapy shell 'https://www.leboncoin.fr'
但是,我没有网站的回复.
But, I haven't a response of the site.
$ scrapy shell 'https://www.leboncoin.fr'
2017-05-16 08:31:26 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: all_cote)
2017-05-16 08:31:26 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'all_cote', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'all_cote.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['all_cote.spiders']}
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-16 08:31:27 [scrapy.middleware] INFO: Enabled item pipelines:[]
2017-05-16 08:31:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-16 08:31:27 [scrapy.core.engine] INFO: Spider opened
2017-05-16 08:31:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.leboncoin.fr/robots.txt> (referer: None)
2017-05-16 08:31:27 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.leboncoin.fr>
2017-05-16 08:31:28 [traitlets] DEBUG: Using default logger
2017-05-16 08:31:28 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x1039fbd30>
[s] item {}
[s] request <GET https://www.leboncoin.fr>
[s] settings <scrapy.settings.Settings object at 0x10716b8d0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
如果我使用:
view(response)
打印 AttributeError...
An AttributeError is printed...
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-1-2c2544195c90> in <module>()
----> 1 view(response)
/usr/local/lib/python3.6/site-packages/scrapy/utils/response.py in open_in_browser(response, _openfunc)
67 from scrapy.http import HtmlResponse, TextResponse
68 # XXX: this implementation is a bit dirty and could be improved
---> 69 body = response.body
70 if isinstance(response, HtmlResponse):
71 if b'<base' not in body:
AttributeError: 'NoneType' 对象没有属性 'body'
编辑 1:
致 rrschmidt :完整的日志已更新,当我运行时
AttributeError: 'NoneType' object has no attribute 'body'
Edit 1 :
To rrschmidt : the complete log was updated and when I run
fetch('https:www.leboncoin.fr')
我收到这个:
2017-05-16 08:33:15 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://www.leboncoin.fr>
<小时>
那么,我该如何解决?
So, How can I fix it ?
感谢您的回答,
克里斯
推荐答案
该网站似乎已限制通过 robots.txt 进行抓取.尊重这个愿望通常是礼貌的.
It looks like the website has restricted scraping via robots.txt. Its usually polite to respect that wish.
但如果你真的想抓取网站,你可以通过在 settings.py 中将 ROBOTSTXT_OBEY 设置更改为 false 来更改scrapy的默认行为
But if you really want to scrape the site you can change scrapy's default behaviour by changing the ROBOTSTXT_OBEY setting to false in your settings.py
ROBOTSTXT_OBEY=False
这篇关于Scrapy shell 返回无响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!