运行多个 Scrapy Spider(简单的方法)Python [英] Running Multiple Scrapy Spiders (the easy way) Python

查看:42
本文介绍了运行多个 Scrapy Spider(简单的方法)Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Scrapy 非常酷,但是我发现文档非常简单,一些简单的问题很难回答.在将来自各种 stackoverflow 的各种技术结合在一起后,我终于想出了一种简单且不过分技术性的方法来运行多个 scrapy Spider.我想它比尝试实现scrapyd等技术更少:

Scrapy is pretty cool, however I found the documentation to very bare bones, and some simple questions were tough to answer. After putting together various techniques from various stackoverflows I have finally come up with an easy and not overly technical way to run multiple scrapy spiders. I would imagine its less technical than trying to implement scrapyd etc:

所以这里有一个蜘蛛,它可以很好地完成它在表单请求后抓取一些数据的一项工作:

So here is one spider that works well at doing it's one job of scraping some data after a formrequest:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.http import FormRequest
from swim.items import SwimItem

class MySpider(BaseSpider):
    name = "swimspider"
    start_urls = ["swimming website"]

    def parse(self, response):
        return [FormRequest.from_response(response,formname="AForm",
                    formdata={"lowage": "20, "highage": "25"}
                    ,callback=self.parse1,dont_click=True)]

    def parse1(self, response):       
        #open_in_browser(response)
        hxs = Selector(response)
        rows = hxs.xpath(".//tr")
        items = []

        for rows in rows[4:54]:
            item = SwimItem()
            item["names"] = rows.xpath(".//td[2]/text()").extract()
            item["age"] = rows.xpath(".//td[3]/text()").extract()
            item["swimtime"] = rows.xpath(".//td[4]/text()").extract()
            item["team"] = rows.xpath(".//td[6]/text()").extract()
            items.append(item)           
        return items

而不是故意用我想要的表单输入写出表单数据,即20"和25:

Instead of deliberately writing out the formdata with the form inputs I wanted ie "20" and "25:

formdata={"lowage": "20", "highage": "25}

我使用了自我".+ 一个变量名:

I used "self." + a variable name:

formdata={"lowage": self.lowage, "highage": self.highage}

这允许您从命令行使用您想要的参数调用蜘蛛(见下文).使用 python subprocess call() 函数可以轻松地一个接一个地调用这些命令行.这意味着我可以进入我的命令行,输入python scrapymanager.py"并让我的所有蜘蛛做他们的事情,每个蜘蛛在他们的命令行中传递不同的参数,并将他们的数据下载到正确的位置:

This then allows you to call the spider from the command line with the arguments that you want (see below). Use the python subprocess call() function to call those very command lines one after another, easily. It means I can go to my commandline, type "python scrapymanager.py" and have all of my spiders do their thing, each with different arguments passed at their command line, and download their data to the correct place:

#scrapymanager

from random import randint
from time import sleep
from subprocess import call

#free
call(["scrapy crawl swimspider -a lowage='20' -a highage='25' -a sex='W' -a StrkDist='10025' -o free.json -t json"], shell=True)
sleep(randint(15,45))

#breast
call(["scrapy crawl swimspider -a lowage='20' -a highage='25' -a sex='W' -a StrkDist='30025' -o breast.json -t json"], shell=True)
sleep(randint(15,45))

#back
call(["scrapy crawl swimspider -a lowage='20' -a highage='25' -a sex='W' -a StrkDist='20025' -o back.json -t json"], shell=True)
sleep(randint(15,45))

#fly
call(["scrapy crawl swimspider -a lowage='20' -a highage='25' -a sex='W' -a StrkDist='40025' -o fly.json -t json"], shell=True)
sleep(randint(15,45))

因此,与其花费数小时试图组装一个复杂的单个蜘蛛,它连续爬行每个形态(在我的例子中是不同的游泳动作),这是一种同时"运行许多蜘蛛的非常轻松的方式(我做到了包括每个带有 sleep() 函数的 scrapy 调用之间的延迟.

So rather than spending hours trying to rig up a complicated single spider that crawls each form in succession (in my case different swim strokes), this is a pretty painless way to run many many spiders "all at once" (I did include a delay between each scrapy call with the sleep() functions).

希望这对某人有所帮助.

Hopefully this helps someone.

推荐答案

这里有一个简单的方法.您需要将此代码与scrapy.cfg 保存在同一目录中(我的scrapy 版本是1.3.3):

Here it is the easy way. you need to save this code at the same directory with scrapy.cfg (My scrapy version is 1.3.3) :

from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

setting = get_project_settings()
process = CrawlerProcess(setting)

for spider_name in process.spiders.list():
    print ("Running spider %s" % (spider_name))
    process.crawl(spider_name,query="dvh") #query dvh is custom argument used in your scrapy

process.start()

并运行它.就是这样!

and run it. thats it!

这篇关于运行多个 Scrapy Spider(简单的方法)Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆