抓取某些网址时无法使我的脚本停止 [英] Unable to make my script stop when some urls are scraped

查看:72
本文介绍了抓取某些网址时无法使我的脚本停止的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在scrapy中创建了一个脚本来解析start_urls中列出的不同站点的标题.脚本完美地完成了它的工作.

我现在想做的是让我的脚本在解析了两个 url 后停止,无论有多少个 url.

到目前为止我已经尝试过:

导入scrapy从 scrapy.crawler 导入 CrawlerProcess类 TitleSpider(scrapy.Spider):name = "title_bot"start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"]定义解析(自我,响应):yield {'title':response.css('title::text').get()}如果 __name__ == "__main__":c = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0',})c.crawl(TitleSpider)c.开始()

<块引用>

当列出的两个 URL 被抓取时,如何让我的脚本停止?

解决方案

目前我看到的唯一一种立即停止这个脚本的方法 - os._exit 强制退出函数的用法:

导入操作系统导入scrapy从 scrapy.crawler 导入 CrawlerProcess类 TitleSpider(scrapy.Spider):name = "title_bot"start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"]item_counter =0定义解析(自我,响应):yield {'title':response.css('title::text').get()}self.item_counter+=1打印(self.item_counter)如果 self.item_counter >=2:self.crawler.stats.close_spider(self,"2 items")os._exit(0)如果 __name__ == "__main__":c = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0' })c.crawl(TitleSpider)c.开始()

我尝试过的另一件事.
但是我没有收到所需的结果(在 start_urls 中,我立即停止了 2 个抓取的项目,其中只有 3 个 url 的脚本):

  1. CrawlerProcess 实例转移到蜘蛛设置中并调用CrawlerProcess.stop ,(reactor.stop) 等..等方法来自 parse 方法.
  2. CloseSpider 扩展的使用docs source ) 和以下 CrawlerProcess 定义:

    c = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0',扩展":{'scrapy.extensions.closespider.CloseSpider':500,},"CLOSESPIDER_ITEMCOUNT":2 })

  3. CONCURRENT_REQUESTS 设置减少为 1(使用 raise CloseSpiderparse 方法中的条件).
    当应用程序抓取 2 个项目并且它使用 raise ClosesSpider 到达代码行 - 已经第三个请求在另一个线程中开始.
    如果使用常规方式停止蜘蛛,应用程序将处于活动状态,直到它处理先前发送的请求并处理他们的回应,只有在那之后 - 它关闭.

由于您的应用程序在 start_urls 中的 url 数量相对较少,因此应用程序会在到达 raise CloseSpider 之前很长时间开始处理所有 url.

I'v created a script in scrapy to parse the titles of different sites listed in start_urls. The script is doing it's job flawlessly.

What I wish to do now is let my script stop after two of the urls are parsed no matter how many urls are there.

I've tried so far with:

import scrapy
from scrapy.crawler import CrawlerProcess

class TitleSpider(scrapy.Spider):
    name = "title_bot"
    start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"]

    def parse(self, response):
        yield {'title':response.css('title::text').get()}

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0', 
    })
    c.crawl(TitleSpider)
    c.start()

How can I make my script stop when two of the listed urls are scraped?

解决方案

Currently I see the only one way to immediately stop this script - usage of os._exit force exit function:

import os
import scrapy
from scrapy.crawler import CrawlerProcess

class TitleSpider(scrapy.Spider):
    name = "title_bot"
    start_urls = ["https://www.google.com/","https://www.yahoo.com/","https://www.bing.com/"]
    item_counter =0

    def parse(self, response):
        yield {'title':response.css('title::text').get()}
        self.item_counter+=1
        print(self.item_counter)
        if self.item_counter >=2:
            self.crawler.stats.close_spider(self,"2 items")
            os._exit(0)

if __name__ == "__main__":
    c = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0' })
    c.crawl(TitleSpider)
    c.start()

Another things that I tried.
But I didn't received required result (immediately stop script afted 2 scraped items with only 3 urls in start_urls):

  1. Transfer CrawlerProcess instance into spider settings and calling CrawlerProcess.stop ,(reactor.stop), etc.. and other methods from parse method.
  2. Usage of CloseSpider extension docs source ) with following CrawlerProcess definition:

    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
        'EXTENSIONS' : {
    
    'scrapy.extensions.closespider.CloseSpider': 500,
                        },
    "CLOSESPIDER_ITEMCOUNT":2 })
    

  3. Reducing CONCURRENT_REQUESTS setting to 1 (with raise CloseSpider condition in parse method).
    When application scraped 2 items and it reaches code line with raise ClosesSpider - 3rd request already started in another thread.
    In case of usage conventional way to stop spider, application will be active until it process previously sent requests and process their responses and only after that - it closes.

As your application has relatively low numbers of urls in start_urls, application starts process all urls a long before it reaches raise CloseSpider.

这篇关于抓取某些网址时无法使我的脚本停止的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆