Scrapy Spider 完成刮削过程,不刮任何东西 [英] Scrapy spider finishing scraping process without scraping anything
问题描述
我有这个蜘蛛,它会在亚马逊上抓取信息.
I have this spider that scrapes amazon for information.
蜘蛛读取一个 .txt 文件,我在其中写下它必须搜索的产品,然后进入该产品的亚马逊页面,例如:
The spider reads a .txt file in which I write which product it must search and then enters amazon page for that product, for example :
https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=laptop
我使用关键字=笔记本电脑来更改要搜索的产品等.
I use the keyword=laptop for changing which product to search and such.
我遇到的问题是蜘蛛不工作,这很奇怪,因为一周前它做得很好.
The issue that I'm having is that the spider just does not work, which is weird because a week ago it did her job just fine.
此外,控制台上没有出现错误,蜘蛛启动,抓取"关键字然后停止.
Also, no errors appear on the console, the spider starts, "crawls" the keyword and then just stops.
这是完整的蜘蛛
import scrapy
import re
import string
import random
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from genericScraper.items import GenericItem
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
class GenericScraperSpider(CrawlSpider):
name = "generic_spider"
#Dominio permitido
allowed_domain = ['www.amazon.com']
search_url = 'https://www.amazon.com/s?field-keywords={}'
custom_settings = {
'FEED_FORMAT': 'csv',
'FEED_URI' : 'datosGenericos.csv'
}
rules = {
#Gets all the elements in page 1 of the keyword i search
Rule(LinkExtractor(allow =(), restrict_xpaths = ('//*[contains(@class, "s-access-detail-page")]') ),
callback = 'parse_item', follow = False)
}
def start_requests(self):
txtfile = open('productosGenericosABuscar.txt', 'r')
keywords = txtfile.readlines()
txtfile.close()
for keyword in keywords:
yield Request(self.search_url.format(keyword))
def parse_item(self,response):
genericAmz_item = GenericItem()
#info de producto
categoria = response.xpath('normalize-space(//span[contains(@class, "a-list-item")]//a/text())').extract_first()
genericAmz_item['nombreProducto'] = response.xpath('normalize-space(//span[contains(@id, "productTitle")]/text())').extract()
genericAmz_item['precioProducto'] = response.xpath('//span[contains(@id, "priceblock")]/text()'.strip()).extract()
genericAmz_item['opinionesProducto'] = response.xpath('//div[contains(@id, "averageCustomerReviews_feature_div")]//i//span[contains(@class, "a-icon-alt")]/text()'.strip()).extract()
genericAmz_item['urlProducto'] = response.request.url
genericAmz_item['categoriaProducto'] = re.sub('Back to search results for |"','', categoria)
yield genericAmz_item
我制作的其他具有类似结构的蜘蛛也可以使用,您知道发生了什么吗?
Other spiders with a similar structure I made also work, any idea what's going on?
这是我在控制台中得到的
Here's what I get in the console
2019-01-31 22:49:26 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: genericScraper)
2019-01-31 22:49:26 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.7.0, Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0
2019-01-31 22:49:26 [scrapy.crawler] INFO: Overridden settings: {'AUTOTHROTTLE_ENABLED': True, 'BOT_NAME': 'genericScraper', 'DOWNLOAD_DELAY': 3, 'FEED_FORMAT': 'csv', 'FEED_URI': 'datosGenericos.csv', 'NEWSPIDER_MODULE': 'genericScraper.spiders', 'SPIDER_MODULES': ['genericScraper.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'}
2019-01-31 22:49:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.throttle.AutoThrottle']
2019-01-31 22:49:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-01-31 22:49:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-01-31 22:49:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-01-31 22:49:26 [scrapy.core.engine] INFO: Spider opened
2019-01-31 22:49:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-01-31 22:49:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on xxx.x.x.x:xxxx
2019-01-31 22:49:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/s?field-keywords=Laptop> (referer: None)
2019-01-31 22:49:27 [scrapy.core.engine] INFO: Closing spider (finished)
2019-01-31 22:49:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 315,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2525,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 2, 1, 1, 49, 27, 375619),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 2, 1, 1, 49, 26, 478037)}
2019-01-31 22:49:27 [scrapy.core.engine] INFO: Spider closed (finished)
推荐答案
有趣!这可能是由于网站没有返回任何数据.您是否尝试过使用 scrapy shell
进行调试.如果没有,请尝试检查 response.body
是否返回了您想要抓取的预期数据.
Interesting! It is possibly due to website wasn't returning any data. Have you tried to debug with scrapy shell
. If not, try to check with that is response.body
returning intended data which you want to crawl.
def parse_item(self,response):
from scrapy.shell import inspect_response
inspect_response(response, self)
更多详情,请阅读scrapy shell的详细信息
For more details, please read detailed info on scrapy shell
调试后,如果您仍然没有获得预期的数据,则意味着该站点中有更多内容阻碍了抓取过程.这可能是动态脚本或 cookie/local-storage/session
依赖项.
After debugging, If you still not getting intended data that means there is more into the site which obstructing to crawling process. That could be dynamic script or cookie/local-storage/session
dependency.
对于动态/JS 脚本,您可以使用 selenium
或 splash
.
selenium-with-scrapy-for-dynamic-page
handling-javascript-in-scrapy-带飞溅
For dynamic/JS script, you can use selenium
or splash
.
selenium-with-scrapy-for-dynamic-page
handling-javascript-in-scrapy-with-splash
对于cookie/local-storage/session
,您必须更深入地查看inspect
窗口并找出哪些对于获取数据至关重要.
For cookie/local-storage/session
, you have to look deeper into inspect
window and find out which is essential for getting the data.
这篇关于Scrapy Spider 完成刮削过程,不刮任何东西的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!