如何在一定数量的请求后停止scrapy蜘蛛? [英] How to stop scrapy spider after certain number of requests?
问题描述
我正在开发一个简单的抓取工具来获取 9 个恶作剧帖子及其图像,但由于一些技术困难,我无法停止抓取工具并继续抓取,这是我不想要的.我想增加计数器值并在 100 个帖子后停止.但是 9gag 页面的设计方式是在每个响应中它只提供 10 个帖子,并且在每次迭代后,我的计数器值重置为 10,在这种情况下,我的循环无限长且永不停止.
I am developing an simple scraper to get 9 gag posts and its images but due to some technical difficulties iam unable to stop the scraper and it keeps on scraping which i dont want.I want to increase the counter value and stop after 100 posts. But the 9gag page was designed in a fashion in each response it gives only 10 posts and after each iteration my counter value resets to 10 in this case my loop runs infintely long and never stops.
# -*- coding: utf-8 -*-
import scrapy
from _9gag.items import GagItem
class FirstSpider(scrapy.Spider):
name = "first"
allowed_domains = ["9gag.com"]
start_urls = (
'http://www.9gag.com/',
)
last_gag_id = None
def parse(self, response):
count = 0
for article in response.xpath('//article'):
gag_id = article.xpath('@data-entry-id').extract()
count +=1
if gag_id:
if (count != 100):
last_gag_id = gag_id[0]
ninegag_item = GagItem()
ninegag_item['entry_id'] = gag_id[0]
ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
yield ninegag_item
else:
break
next_url = 'http://9gag.com/?id=%s&c=200' % last_gag_id
yield scrapy.Request(url=next_url, callback=self.parse)
print count
items.py 的代码在这里
Code for items.py is here
from scrapy.item import Item, Field
class GagItem(Item):
entry_id = Field()
url = Field()
votes = Field()
comments = Field()
title = Field()
img_url = Field()
所以我想增加一个全局计数值并尝试通过传递 3 个参数来解析函数它给出错误
So i want to increase a global count value and tried this by passing 3 arguments to parse function it gives error
TypeError: parse() takes exactly 3 arguments (2 given)
那么有没有办法传递一个全局计数值并在每次迭代后返回它并在 100 个帖子后停止(假设).
So is there a way to pass a global count value and return it after each iteration and stop after 100 posts(suppose).
这里提供了整个项目 Github即使我设置 POST_LIMIT =100 也会发生无限循环,请参阅我执行的命令
Entire project is available here Github Even if i set POST_LIMIT =100 the infinite loop happens,see here command i executed
scrapy crawl first -s POST_LIMIT=10 --output=output.json
推荐答案
首先:使用 self.count
并在 parse
之外初始化.然后不要阻止项目的解析,而是生成新的请求
.看下面的代码:
First: Use self.count
and initialize outside of parse
. Then don't prevent the parsing of the items, but generating new requests
. See the following code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Item, Field
class GagItem(Item):
entry_id = Field()
url = Field()
votes = Field()
comments = Field()
title = Field()
img_url = Field()
class FirstSpider(scrapy.Spider):
name = "first"
allowed_domains = ["9gag.com"]
start_urls = ('http://www.9gag.com/', )
last_gag_id = None
COUNT_MAX = 30
count = 0
def parse(self, response):
for article in response.xpath('//article'):
gag_id = article.xpath('@data-entry-id').extract()
ninegag_item = GagItem()
ninegag_item['entry_id'] = gag_id[0]
ninegag_item['url'] = article.xpath('@data-entry-url').extract()[0]
ninegag_item['votes'] = article.xpath('@data-entry-votes').extract()[0]
ninegag_item['comments'] = article.xpath('@data-entry-comments').extract()[0]
ninegag_item['title'] = article.xpath('.//h2/a/text()').extract()[0].strip()
ninegag_item['img_url'] = article.xpath('.//div[1]/a/img/@src').extract()
self.last_gag_id = gag_id[0]
self.count = self.count + 1
yield ninegag_item
if (self.count < self.COUNT_MAX):
next_url = 'http://9gag.com/?id=%s&c=10' % self.last_gag_id
yield scrapy.Request(url=next_url, callback=self.parse)
这篇关于如何在一定数量的请求后停止scrapy蜘蛛?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!