使用phantomjs进行动态内容时可能会出现Scrap和Selenium竞争情况 [英] Using phantomjs for dynamic content with scrapy and selenium possible race condition

查看:84
本文介绍了使用phantomjs进行动态内容时可能会出现Scrap和Selenium竞争情况的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,这是来自此处的后续问题:更改正在运行的蜘蛛的数量scrapyd

First off, this is a follow up question from here: Change number of running spiders scrapyd

我用phantomjs和硒为我的项目开发了一个下载中间件.当我一次在本地运行蜘蛛时,它运行良好,并且并没有真正减慢速度.

I'm used phantomjs and selenium to create a downloader middleware for my scrapy project. It works well and hasn't really slowed things down when I run my spiders one at a time locally.

但是不久前,我在AWS上安装了一台刮擦的服务器.我注意到一种可能的竞态条件似乎在一次运行多个蜘蛛时导致错误和性能问题.我觉得问题出在两个不同的问题上.

But just recently I put a scrapyd server up on AWS. I noticed a possible race condition that seems to be causing errors and performance issues when more than one spider is running at once. I feel like the problem stems from two separate issues.

1)蜘蛛试图同时使用phantomjs可执行文件.

1) Spiders trying to use phantomjs executable at the same time.

2)蜘蛛试图同时登录到phantomjs的ghostdriver日志文件.

2) Spiders trying to log to phantomjs's ghostdriver log file at the same time.

在这里猜测,性能问题可能是蜘蛛试图等待直到资源可用为止(这可能是由于我也对sqlite数据库也存在竞争条件).

Guessing here, the performance issue may be the spider trying to wait until the resources are available (this could be due to the fact that I also had a race condition for an sqlite database as well).

这是我得到的错误:

exceptions.IOError:[Errno 13]权限被拒绝:'ghostdriver.log'(日志文件竞争条件?)

exceptions.IOError: [Errno 13] Permission denied: 'ghostdriver.log' (log file race condition?)

selenium.common.exceptions.WebDriverException:消息:无法连接到GhostDriver"(可执行的竞争条件?)

selenium.common.exceptions.WebDriverException: Message: 'Can not connect to GhostDriver' (executable race condition?)

我的问题是:

我对哪些问题的分析似乎是正确的?

Does my analysis of what the problem(s) are seem correct?

除了限制一次可以运行的蜘蛛数量之外,是否有解决此问题的已知解决方案?

Are there any known solutions to this problem other than limiting the number of spiders that can be ran at a time?

还有其他方法可以处理javascript吗? (如果您认为我应该提出一个全新的问题,以讨论使用scrapy处理javascript的最佳方法,请告诉我,我会的.)

Is there some other way I should be handling javascript? (if you think I should create an entirely new question to discuss the best way to handle javascript with scrapy let me know and I will)

这是我的下载中间件:

class JsDownload(object):

    @check_spider_middleware
    def process_request(self, request, spider):
        if _platform == "linux" or _platform == "linux2":
            driver = webdriver.PhantomJS(service_log_path='/var/log/scrapyd/ghost.log')
        else:
            driver = webdriver.PhantomJS(executable_path=settings.PHANTOM_JS_PATH)
        driver.get(request.url)
        return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))

注意:_platform代码是一种临时解决方法,直到我将此源代码部署到静态环境中为止.

note: the _platform code is a temporary work around until I get this source code deployed into a static environment.

我找到了针对JavaScript问题的解决方案,但它们都是基于Spider的.这让我感到困扰,因为这意味着每个请求都必须在下载程序处理程序中进行一次,然后在Spider中进行.这就是为什么我决定将我的软件实现为下载中间件的原因.

I found solutions on SO for javascript problem but they were spider based. This bothered me because it meant every request had to be made once in the downloader handler and again in the spider. That is why I decided to implement mine as a downloader middleware.

推荐答案

尝试使用网络驱动程序与phantomjs进行交互 https://github.com/brandicted/scrapy-webdriver

try using webdriver to interface with phantomjs https://github.com/brandicted/scrapy-webdriver

这篇关于使用phantomjs进行动态内容时可能会出现Scrap和Selenium竞争情况的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆