如何以编程方式设置和启动 Scrapy 蜘蛛(网址和设置) [英] How to setup and launch a Scrapy spider programmatically (urls and settings)

查看:37
本文介绍了如何以编程方式设置和启动 Scrapy 蜘蛛(网址和设置)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用scrapy编写了一个工作爬虫,
现在想通过一个django webapp来控制,也就是说:

I've written a working crawler using scrapy,
now I want to control it through a Django webapp, that is to say:

  • 设置 1 个或多个 start_urls
  • 设置 1 个或多个 allowed_domains
  • 设置settings
  • 启动蜘蛛
  • 停止/暂停/恢复蜘蛛
  • 在运行时检索一些统计数据
  • 在蜘蛛完成后检索一些统计数据.

起初我以为 scrapyd 是为此而制作的,但是阅读文档后,它似乎更像是一个能够管理打包蜘蛛"的守护进程,又名碎蛋";并且所有设置(start_urlsallowed_domainssettings)仍然必须在scrapy egg"本身中进行硬编码;所以它看起来不像是我的问题的解决方案,除非我错过了什么.

At first I thought scrapyd was made for this, but after reading the doc, it seems that it's more a daemon able to manage 'packaged spiders', aka 'scrapy eggs'; and that all the settings (start_urls , allowed_domains, settings ) must still be hardcoded in the 'scrapy egg' itself ; so it doesn't look like a solution to my question, unless I missed something.

我也看了这个问题:如何为抓取提供 URL 以供抓取? ;但是提供多个 url 的最佳答案被作者自己限定为丑陋的黑客",涉及一些 python 子进程和复杂的 shell 处理,所以我认为这里找不到解决方案.此外,它可能适用于 start_urls,但它似乎不允许 allowed_domainssettings.

I also looked at this question : How to give URL to scrapy for crawling? ; But the best answer to provide multiple urls is qualified by the author himeslf as an 'ugly hack', involving some python subprocess and complex shell handling, so I don't think the solution is to be found here. Also, it may work for start_urls, but it doesn't seem to allow allowed_domains or settings.

然后我看了一下scrapy webservices :这似乎是检索统计数据的好方法.但是,它仍然需要一个正在运行的蜘蛛,并且没有任何线索可以更改settings

Then I gave a look to scrapy webservices : It seems to be the good solution for retrieving stats. However, it still requires a running spider, and no clue to change settings

关于这个主题有几个问题,似乎都不令人满意:

There are a several questions on this subject, none of them seems satisfactory:

  • using-one-scrapy-spider-for-several-websites This one seems outdated, as scrapy has evolved a lot since 0.7
  • creating-a-generic-scrapy-spider No accepted answer, still talking around tweaking shell parameters.

我知道在生产环境中使用scrapy;而像scrapyd这样的工具表明,肯定有一些方法可以处理这些需求(我无法想象scrapyd正在处理的scrapy鸡蛋是手工生成的!)

I know that scrapy is used in production environments ; and a tool like scrapyd shows that there are definitvely some ways to handle these requirements (I can't imagine that the scrapy eggs scrapyd is dealing with are generated by hand !)

非常感谢您的帮助.

推荐答案

一开始我以为scrapyd是为此而生的,但在阅读文档后,它似乎更像是一个能够管理打包蜘蛛"的守护进程,又名碎蛋";并且所有设置(start_urls、allowed_domains、settings)仍然必须在scrapy egg"本身中进行硬编码;所以它看起来不像是我的问题的解决方案,除非我错过了什么.

At first I thought scrapyd was made for this, but after reading the doc, it seems that it's more a daemon able to manage 'packaged spiders', aka 'scrapy eggs'; and that all the settings (start_urls , allowed_domains, settings ) must still be hardcoded in the 'scrapy egg' itself ; so it doesn't look like a solution to my question, unless I missed something.

我不同意上面的说法,start_urls 不需要硬编码,它们可以动态传递给类,你应该可以像这样将它作为参数传递

I don't agree to the above statement, start_urls need not be hard-coded they can be dynamically passed to the class, you should be able to pass it as an argument like this

http://localhost:6800/schedule.json -d project=myproject -d spider=somespider -d setting=DOWNLOAD_DELAY=2 -d arg1=val1

或者您应该能够从数据库或文件中检索 URL.我从这样的数据库中获取它

Or you should be able to retrieve the URLs from a database or a file. I get it from a database like this

class WikipediaSpider(BaseSpider):
    name = 'wikipedia'
    allowed_domains = ['wikipedia.com']
    start_urls = []

    def __init__(self, name=None, url=None, **kwargs):
        item = MovieItem()
        item['spider'] = self.name
        # You can pass a specific url to retrieve 
        if url:
            if name is not None:
                self.name = name
            elif not getattr(self, 'name', None):
                raise ValueError("%s must have a name" % type(self).__name__)
            self.__dict__.update(kwargs)
            self.start_urls = [url]
        else:
            # If there is no specific URL get it from Database
            wikiliks = # < -- CODE TO RETRIEVE THE LINKS FROM DB -->
            if wikiliks == None:
                print "**************************************"
                print "No Links to Query"
                print "**************************************"
                return None

            for link in wikiliks:
                # SOME PROCESSING ON THE LINK GOES HERE
                self.start_urls.append(urllib.unquote_plus(link[0]))

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        # Remaining parse code goes here

这篇关于如何以编程方式设置和启动 Scrapy 蜘蛛(网址和设置)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆