以编程方式调用同一个蜘蛛 [英] Calling the same spider programmatically

查看:61
本文介绍了以编程方式调用同一个蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个蜘蛛,它可以抓取所通过网站的链接.当使用不同的数据集完成执行时,我想再次启动同一个蜘蛛.如何再次重启同一个爬虫?网站通过数据库传递.我希望爬虫无限循环运行,直到所有网站都被爬取.目前我必须一直启动爬虫scrapy crawl first.有没有什么办法让爬虫启动一次,爬完所有网站就停止?

I have a spider which crawls links for the websites passed. I want to start the same spider again when its execution is finished with different set of data. How to restart the same crawler again? The websites are passed through database. I want the crawler to run in a unlimited loop until all the websites are crawled. Currently I have to start the crawler scrapy crawl first all the time. Is there any way to start the crawler once and it will stop when all the websites are crawled?

我搜索了相同的内容,并找到了在爬虫关闭/完成后处理爬虫的解决方案.但是我不知道如何以编程方式调用 closed_handler 方法中的蜘蛛.

I searched for the same, and found a solution of handling the crawler once its closed/finished. But I don't know how to call the spider form the closed_handler method programmatically.

以下是我的代码:

 class MySpider(CrawlSpider):
        def __init__(self, *args, **kwargs):
            super(MySpider, self).__init__(*args, **kwargs)
            SignalManager(dispatcher.Any).connect(
                self.closed_handler, signal=signals.spider_closed)

        def closed_handler(self, spider):
            reactor.stop()
            settings = Settings()
            crawler = Crawler(settings)
            crawler.signals.connect(spider.spider_closing, signal=signals.spider_closed)
            crawler.configure()
            crawler.crawl(MySpider())
            crawler.start()
            reactor.run()

        # code for getting the websites from the database
        name = "first"
        def parse_url(self, response):
            ...

我收到错误:

Error caught on signal handler: <bound method ?.closed_handler of <MySpider 'first' at 0x40f8c70>>

Traceback (most recent call last):
  File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "c:\python27\lib\site-packages\scrapy\xlib\pydispatch\robustapply.py", line 57, in robustApply
    return receiver(*arguments, **named)
  File "G:\Scrapy\web_link_crawler\web_link_crawler\spiders\first.py", line 72, in closed_handler
    crawler = Crawler(settings)
  File "c:\python27\lib\site-packages\scrapy\crawler.py", line 32, in __init__
    self.spidercls.update_settings(self.settings)
AttributeError: 'Settings' object has no attribute 'update_settings'

这是完成这项工作的正确方法吗?或者还有其他方法吗?请帮忙!

Is this the right way to get this done? Or is there any other way? Please help!

谢谢

推荐答案

另一种方法是创建一个新脚本,您可以在其中从数据库中选择链接并将它们保存到文件中,然后以这种方式调用 scrapy 脚本

Another way to do it would be making a new script where you select the links from the database and save them to a file and then call the scrapy script this way

os.system("scrapy crawl first")

并将文件中的链接加载到您的爬虫上,然后继续工作.

and load the links from the file onto your spider and work from there on.

如果你想不断地检查数据库中的新链接,在第一个脚本中只需在无限循环中不时调用数据库,并在有新链接时进行scrapy调用!

If you want to constantly check the database for new links, in the first script just call the database from time to time in an infinite loop and make the scrapy call whenever there are new links!

这篇关于以编程方式调用同一个蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆