防止Python驱动的PhantomJS/Selenium中的CSS/其他资源下载 [英] Prevent CSS/other resource download in PhantomJS/Selenium driven by Python

查看:15
本文介绍了防止Python驱动的PhantomJS/Selenium中的CSS/其他资源下载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图通过阻止下载 CSS/其他资源来加速 Python 中的 Selenium/PhantomJS 网络爬虫.我只需要下载 img src 和 alt 标签.我找到了这个代码:

I'm trying to speed up Selenium/PhantomJS webscraper in Python by preventing download of CSS/other resources. All I need to download is img src and alt tags. I've found this code:

page.onResourceRequested = function(requestData, request) {
    if ((/http://.+?.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['url']);
        request.abort();
    }
};

通过:我该怎么做控制 PhantomJS 跳过下载某种资源?

如何/在哪里可以在 Python 驱动的 Selenium 中实现此代码?或者,是否有另一种更好的方法来阻止 CSS/其他资源的下载?

注意:我已经找到了如何通过以下方式编辑 service_args 变量来防止图像下载:

Note: I've already found how to prevent image download by editing service_args variable via:

如何在 python webdriver 中为 phantomjs/ghostdriver 设置代理吗?

PhantomJS 1.8 与python 上的硒.如何屏蔽图片?

但是 service_args 无法帮助我处理 CSS 之类的资源.谢谢!

But service_args can't help me with resources like CSS. Thanks!

推荐答案

一个名为watsonmw"的大胆年轻灵魂 最近向 Ghostdriver(Phantom.js 使用它与 Selenium 交互)添加了功能,允许访问 Phantom.js API 调用,这些调用需要一个页面对象,就像你引用的 onResourceRequested 一样.

A bold young soul by the name of "watsonmw" recently added functionality to Ghostdriver (which Phantom.js uses to interface with Selenium) that allows access to Phantom.js API calls which require a page object, like the onResourceRequested one you cited.

对于不惜一切代价的解决方案,请考虑从源代码构建(开发人员指出大约需要 30 分钟......在现代机器上进行 4 个并行编译作业")并集成他的补丁,链接如上.

For a solution at all costs, consider building from source (which developers note "takes roughly 30 minutes ... with 4 parallel compile jobs on a modern machine") and integrating his patch, linked above.

那么这个(未经测试的)Python 代码应该可以作为概念证明:

Then this (untested) Python code should work as a proof of concept:

from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')

# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.execute('executePhantomScript', {'script': '''
page.onResourceRequested = function(requestData, request) {
    // ...
}
''', 'args': []})

在那之前,你只会得到一个 Can't find variable: page 异常.

Until then, you’ll just get a Can't find variable: page exception.

祝你好运!有很多不错的选择,例如在 Javascript 环境中工作、驱动 Gecko、代理等.

Good luck! There are a lot of great alternatives, like working in a Javascript environment, driving Gecko, proxies, etc.

这篇关于防止Python驱动的PhantomJS/Selenium中的CSS/其他资源下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆