如何让 Selenium 与 Scrapy 并行运行? [英] How can I make Selenium run in parallel with Scrapy?

查看:76
本文介绍了如何让 Selenium 与 Scrapy 并行运行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Scrapy 和 Selenium 抓取一些网址.一些 url 直接由 Scrapy 处理,其他的则先用 Selenium 处理.

I'm trying to scrape some urls with Scrapy and Selenium. Some of the urls are processed by Scrapy directly and the others are handled with Selenium first.

问题是:当 Selenium 处理一个 url 时,Scrapy 没有并行处理其他的.它等待 webdriver 完成它的工作.

The problem is: while Selenium is handling a url, Scrapy is not processing the others in parallel. It waits for the webdriver to finish its work.

我尝试使用不同的初始化参数运行多个蜘蛛在单独的进程中(使用多处理池),但我得到了 twisted.internet.error.ReactorNotRestartable.我还尝试在 parse 方法中生成另一个进程.但似乎我没有足够的经验来做对.

I have tried to run multiple spiders with different init parameters in separate processes (using multiprocessing pool), but I got twisted.internet.error.ReactorNotRestartable. I also tried to spawn another process in parse method. But seems that I don't have enought experience to make it right.

在下面的示例中,只有在关闭 webdriver 时才会打印所有 url.请指教,有什么办法可以让它并行"运行?

In the example below all the urls are printed only when the webdriver is closed. Please advise, is there any way to make it run "in parallel"?

import time

import scrapy
from selenium.webdriver import Firefox


def load_with_selenium(url):
    with Firefox() as driver:
        driver.get(url)
        time.sleep(10)  # Do something
        page = driver.page_source
    return page


class TestSpider(scrapy.Spider):
    name = 'test_spider'

    tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
             {'start_url': 'https://www.nytimes.com/', 'selenium': True}]

    def start_requests(self):
        for task in self.tasks:
            yield scrapy.Request(url=task['start_url'], callback=self.parse, meta=task)

    def parse(self, response):
        if response.meta['selenium']:
            response = response.replace(body=load_with_selenium(response.meta['start_url']))

        for url in response.xpath('//a/@href').getall():
            print(url)

推荐答案

看来我已经找到了解决方案.

It seems that I've found a solution.

我决定使用多处理,在每个进程中运行一个蜘蛛并将任务作为其初始化参数传递.在某些情况下,这种方法可能不合适,但对我有用.

I decided to use multiprocessing, running one spider in each process and passing a task as its init parameter. In some cases this approach may be inappropriate, but it works for me.

我之前尝试过这种方式,但我收到了 twisted.internet.error.ReactorNotRestartable 异常.这是由于调用 start() CrawlerProcess 在每个进程中多次,这是不正确的.这里我找到了一个简单而清晰的示例,该示例使用回调在循环中运行蜘蛛.

I tried this way before but I was getting the twisted.internet.error.ReactorNotRestartable exception. It was caused by calling the start() method of the CrawlerProcess in each process multiple times, which is incorrect. Here I found a simple and clear example of running a spider in a loop using callbacks.

所以我在进程之间拆分了我的 tasks 列表.然后在 crawl(tasks) 方法中,我创建了一个回调链来多次运行我的蜘蛛,每次都传递一个不同的任务作为它的 init 参数.

So I split my tasks list between the processes. Then inside the crawl(tasks) method I make a chain of callbacks to run my spider multiple times passing a different task as its init parameter every time.

import multiprocessing

import numpy as np
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

tasks = [{'start_url': 'https://www.theguardian.com/', 'selenium': False},
         {'start_url': 'https://www.nytimes.com/', 'selenium': True}]


def crawl(tasks):
    process = CrawlerProcess(get_project_settings())

    def run_spider(_, index=0):
        if index < len(tasks):
            deferred = process.crawl('test_spider', task=tasks[index])
            deferred.addCallback(run_spider, index + 1)
            return deferred

    run_spider(None)
    process.start()


def main():
    processes = 2
    with multiprocessing.Pool(processes) as pool:
        pool.map(crawl, np.array_split(tasks, processes))


if __name__ == '__main__':
    main()

我的问题帖子中TestSpider的代码必须相应地修改以接受任务作为初始化参数.

The code of TestSpider in my question post must be modified accordingly to accept a task as an init parameter.

def __init__(self, task):
    scrapy.Spider.__init__(self)
    self.task = task

def start_requests(self):
    yield scrapy.Request(url=self.task['start_url'], callback=self.parse, meta=self.task)

这篇关于如何让 Selenium 与 Scrapy 并行运行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆