防止在Python驱动的PhantomJS/Selenium中下载CSS/其他资源 [英] Prevent CSS/other resource download in PhantomJS/Selenium driven by Python

查看:73
本文介绍了防止在Python驱动的PhantomJS/Selenium中下载CSS/其他资源的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过阻止CSS/其他资源的下载来加快Python中的Selenium/PhantomJS网络爬虫的速度.我需要下载的只是img src和alt标签.我找到了以下代码:

page.onResourceRequested = function(requestData, request) {
    if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['url']);
        request.abort();
    }
};

通过:我如何才能控制PhantomJS跳过下载某种资源?

如何/在哪里实现在Python驱动的Selenium中的此代码?或者,还有其他更好的方法来阻止CSS/其他资源的下载吗?

注意:我已经找到了如何通过以下方式编辑service_args变量来防止图像下载:

如何在python webdriver中为phantomjs/ghostdriver设置代理?

PhantomJS 1.8 with python上的Selenium.如何屏蔽图像?

但是service_args不能帮助我使用CSS之类的资源.谢谢!

解决方案

一个大胆的年轻灵魂,名字叫"watsonmw".最近在Ghostdriver中添加了功能(Phantom.js用于与Selenium连接),该功能允许访问 Phantom.js API调用,该调用需要页面对象,就像您引用的onResourceRequested.

对于不惜一切代价的解决方案,请考虑从源代码构建(开发人员指出,花费大约30分钟的时间,在现代计算机上执行4个并行编译作业")并集成其补丁,如上链接.

然后,这个(未经测试的)Python代码应该可以用作概念证明:

from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')

# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.execute('executePhantomScript', {'script': '''
page.onResourceRequested = function(requestData, request) {
    // ...
}
''', 'args': []})

直到那时,您只会得到Can't find variable: page例外.

祝你好运!有很多不错的选择,例如在Javascript环境中工作,驱动Gecko,代理等等.

I'm trying to speed up Selenium/PhantomJS webscraper in Python by preventing download of CSS/other resources. All I need to download is img src and alt tags. I've found this code:

page.onResourceRequested = function(requestData, request) {
    if ((/http:\/\/.+?\.css/gi).test(requestData['url']) || requestData['Content-Type'] == 'text/css') {
        console.log('The url of the request is matching. Aborting: ' + requestData['url']);
        request.abort();
    }
};

via: How can I control PhantomJS to skip download some kind of resource?

How/where can I implement this code in Selenium driven by Python? Or, is there another better way to stop CSS/other resources from downloading?

Note: I've already found how to prevent image download by editing service_args variable via:

How do I set a proxy for phantomjs/ghostdriver in python webdriver?

and

PhantomJS 1.8 with Selenium on python. How to block images?

But service_args can't help me with resources like CSS. Thanks!

解决方案

A bold young soul by the name of "watsonmw" recently added functionality to Ghostdriver (which Phantom.js uses to interface with Selenium) that allows access to Phantom.js API calls which require a page object, like the onResourceRequested one you cited.

For a solution at all costs, consider building from source (which developers note "takes roughly 30 minutes ... with 4 parallel compile jobs on a modern machine") and integrating his patch, linked above.

Then this (untested) Python code should work as a proof of concept:

from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')

# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')

driver.execute('executePhantomScript', {'script': '''
page.onResourceRequested = function(requestData, request) {
    // ...
}
''', 'args': []})

Until then, you’ll just get a Can't find variable: page exception.

Good luck! There are a lot of great alternatives, like working in a Javascript environment, driving Gecko, proxies, etc.

这篇关于防止在Python驱动的PhantomJS/Selenium中下载CSS/其他资源的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆