如何使用Selenium/PhantomJS列出加载的资源? [英] How to list loaded resources with Selenium/PhantomJS?

查看:87
本文介绍了如何使用Selenium/PhantomJS列出加载的资源?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想加载一个网页并列出该页面的所有已加载资源(javascript/images/css).我使用以下代码加载页面:

I want to load a webpage and list all loaded resources (javascript/images/css) for that page. I use this code to load the page:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://example.com')

上面的代码可以完美地工作,我可以对HTML页面进行一些处理.问题是,如何列出该页面加载的所有资源?我想要这样的东西:

The code above works perfectly and I can do some processing to the HTML page. The question is, how do I list all of the resources loaded by that page? I want something like this:

['http://example.com/img/logo.png',
 'http://example.com/css/style.css',
 'http://example.com/js/jquery.js',
 'http://www.google-analytics.com/ga.js']

我也接受其他解决方案,例如使用PySide.QWebView模块.我只想列出按页面加载的资源.

I also open to other solution, like using PySide.QWebView module. I just want to list the resources loaded by page.

推荐答案

这不是Selenium解决方案,但它在python和PhantomJS上可以很好地工作.

This is not a Selenium solution, but it can work really well with python and PhantomJS.

其想法是与Chrome开发人员工具的网络"标签中的操作完全相同. 为此,我们必须听取网页上的每个请求.

The idea is to do exactly the same as in the 'Network' tab in Chrome Developper Tools. To do so we have to listen to every request made by the webpage.

使用phantomjs,可以使用此脚本来完成此操作,并在方便时使用它:

Using phantomjs, this can be done using this script, use it at your own convenience :

// getResources.js
// Usage: 
// ./phantomjs --ssl-protocol=any --web-security=false getResources.js your_url
// the ssl-protocol and web-security flags are added to dismiss SSL errors

var page = require('webpage').create();
var system = require('system');
var urls = Array();

// function to check if the requested resource is an image
function isImg(url) {
  var acceptedExts = ['jpg', 'jpeg', 'png'];
  var baseUrl = url.split('?')[0];
  var ext = baseUrl.split('.').pop().toLowerCase();
  if (acceptedExts.indexOf(ext) > -1) {
    return true;
  } else {
    return false;
  }
}

// function to check if an url has a given extension
function isExt(url, ext) {
  var baseUrl = url.split('?')[0];
  var fileExt = baseUrl.split('.').pop().toLowerCase();
  if (ext == fileExt) {
    return true;
  } else {
    return false;
  }
}

// Listen for all requests made by the webpage, 
// (like the 'Network' tab of Chrome developper tools)
// and add them to an array
page.onResourceRequested = function(request, networkRequest) { 
  // If the requested url if the one of the webpage, do nothing
  // to allow other ressource requests
  if (system.args[1] == request.url) {
    return;
  } else if (isImg(request.url) || isExt(request.url, 'js') || isExt(request.url, 'css')) {
    // The url is an image, css or js file 
    // add it to the array
    urls.push(request.url)
    // abort the request for a better response time
    // can be omitted for collecting asynchronous loaded files
    networkRequest.abort(); 
  }
};

// When all requests are made, output the array to the console
page.onLoadFinished = function(status) {
  console.log(JSON.stringify(urls));
  phantom.exit();
};

// If an error occur, dismiss it
page.onResourceError = function(){
  return false;
}
page.onError = function(){
  return false;
}

// Open the web page
page.open(system.args[1]);

Python部分

现在使用以下命令在python中调用代码:

Python part

And now call the code in python with:

from subprocess import check_output
import json

out = check_output(['./phantomjs', '--ssl-protocol=any', \
    '--web-security=false', 'getResources.js', your_url])
data = json.loads(out)

希望这会有所帮助

这篇关于如何使用Selenium/PhantomJS列出加载的资源?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆