防止Python驱动的PhantomJS/Selenium中的CSS/其他资源下载 [英] Prevent CSS/other resource download in PhantomJS/Selenium driven by Python
问题描述
我试图通过阻止下载 CSS/其他资源来加速 Python 中的 Selenium/PhantomJS 网络爬虫.我只需要下载 img src 和 alt 标签.我找到了这个代码:
I'm trying to speed up Selenium/PhantomJS webscraper in Python by preventing download of CSS/other resources. All I need to download is img src and alt tags. I've found this code:
page.onResourceRequested = function(requestData, request) {
if ((/http://.+?.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
console.log('The url of the request is matching. Aborting: ' + requestData['url']);
request.abort();
}
};
通过:我该怎么做控制 PhantomJS 跳过下载某种资源?
如何/在哪里可以在 Python 驱动的 Selenium 中实现此代码?或者,是否有另一种更好的方法来阻止 CSS/其他资源的下载?
注意:我已经找到了如何通过以下方式编辑 service_args 变量来防止图像下载:
Note: I've already found how to prevent image download by editing service_args variable via:
如何在 python webdriver 中为 phantomjs/ghostdriver 设置代理吗?
和
PhantomJS 1.8 与python 上的硒.如何屏蔽图片?
但是 service_args 无法帮助我处理 CSS 之类的资源.谢谢!
But service_args can't help me with resources like CSS. Thanks!
推荐答案
一个名为watsonmw"的大胆年轻灵魂 最近向 Ghostdriver(Phantom.js 使用它与 Selenium 交互)添加了功能,允许访问 Phantom.js API 调用,这些调用需要一个页面对象,就像你引用的 onResourceRequested
一样.
A bold young soul by the name of "watsonmw" recently added functionality to Ghostdriver (which Phantom.js uses to interface with Selenium) that allows access to Phantom.js API calls which require a page object, like the onResourceRequested
one you cited.
对于不惜一切代价的解决方案,请考虑从源代码构建(开发人员指出大约需要 30 分钟......在现代机器上进行 4 个并行编译作业")并集成他的补丁,链接如上.
For a solution at all costs, consider building from source (which developers note "takes roughly 30 minutes ... with 4 parallel compile jobs on a modern machine") and integrating his patch, linked above.
那么这个(未经测试的)Python 代码应该可以作为概念证明:
Then this (untested) Python code should work as a proof of concept:
from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')
# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')
driver.execute('executePhantomScript', {'script': '''
page.onResourceRequested = function(requestData, request) {
// ...
}
''', 'args': []})
在那之前,你只会得到一个 Can't find variable: page
异常.
Until then, you’ll just get a Can't find variable: page
exception.
祝你好运!有很多不错的选择,例如在 Javascript 环境中工作、驱动 Gecko、代理等.
Good luck! There are a lot of great alternatives, like working in a Javascript environment, driving Gecko, proxies, etc.
这篇关于防止Python驱动的PhantomJS/Selenium中的CSS/其他资源下载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!