我怎样才能捕捉和处理来自使用casperjs的XHR响应的数据? [英] How can I catch and process the data from the XHR responses using casperjs?
问题描述
在网页上的数据被动态显示,似乎检查在html每一个变化,并提取数据是非常艰巨的任务,并还需要我使用非常不可靠的XPath。因此,我希望能够从 XHR
包中提取数据。
The data on the webpage is displayed dynamically and it seems that checking for every change in the html and extracting the data is a very daunting task and also needs me to use very unreliable XPaths. So I would want to be able to extract the data from the XHR
packets.
我希望能够从 XHR提取信息
包以及产生XHR报文发送到服务器。
在提取部分信息对我来说更重要,因为发送的信息可以通过自动触发使用casperjs HTML元素容易处理。
I hope to be able to extract information from XHR
packets as well as generate 'XHR' packets to be sent to the server.
The extracting information part is more important for me because the sending of information can be handled easily by automatically triggering html elements using casperjs.
我安装我的意思的屏幕截图。
I'm attaching a screenshot of what I mean.
在响应选项卡中的文字是我需要事后处理数据。 (本XHR应答已经接收从该服务器。)
The text in the response tab is the data I need to process afterwards. (This XHR response has been received from the server.)
推荐答案
这是不容易做到的,因为<一href="http://docs.casperjs.org/en/latest/events-filters.html#resource-received"><$c$c>resource.received$c$c>事件处理程序只提供元数据,如网址
,标题
或状态
,但不是实际的数据。底层phantomjs事件处理程序的行为方式相同。
This is not easily possible, because the resource.received
event handler only provides meta data like url
, headers
or status
, but not the actual data. The underlying phantomjs event handler acts the same way.
如果Ajax调用是无状态,您可以重复请求
If the ajax call is stateless, you may repeat the request
casper.on("resource.received", function(resource){
// somehow identify this request, here: if it contains ".json"
// it also also only does something when the stage is "end" otherwise this would be executed two times
if (resource.url.indexOf(".json") != -1 && resource.stage == "end") {
var data = casper.evaluate(function(url){
// synchronous GET request
return __utils__.sendAJAX(url, "GET");
}, resource.url);
// do something with data, you might need to JSON.parse(data)
}
});
casper.start(url); // your script
您可能需要添加事件监听器<一href="http://docs.casperjs.org/en/latest/events-filters.html#resource-requested"><$c$c>resource.requested$c$c>.这样,你就不需要这样的调用来完成。
You may want to add the event listener to resource.requested
. That way you don't need to way for the call to complete.
您也可以做这样的(源控制流程内这一权利:答:CasperJS waitForResource:如何获取资源我'已经等了):
You can also do this right inside of the control flow like this (source: A: CasperJS waitForResource: how to get the resource i've waited for):
casper.start(url);
var res, resData;
casper.waitForResource(function check(resource){
res = resource;
return resource.url.indexOf(".json") != -1;
}, function then(){
resData = casper.evaluate(function(url){
// synchronous GET request
return __utils__.sendAJAX(url, "GET");
}, res.url);
// do something with the data here or in a later step
});
casper.run();
状态AJAX请求
如果它的不是无状态,您将需要更换的XMLHtt prequest实施。您将需要注入自己实施的onreadystatechange
处理程序,收集在页面信息窗口
对象,后来收集在另一个评估
电话。
Stateful AJAX Request
If it is not stateless, you would need to replace the implementation of XMLHttpRequest. You will need to inject your own implementation of the onreadystatechange
handler, collect the information in the page window
object and later collect it in another evaluate
call.
您可能想看看 XHR摊贩在sinon.js 或使用 XMLHtt prequest
下面的完整的代理(我的方法3,从的如何创建一个XMLHtt prequest包装/代理):
You may want to look at the XHR faker in sinon.js or use the following complete proxy for XMLHttpRequest
(I modeled it after method 3 from How can I create a XMLHttpRequest wrapper/proxy?):
function replaceXHR(){
(function(window, debug){
function args(a){
var s = "";
for(var i = 0; i < a.length; i++) {
s += "\t\n[" + i + "] => " + a[i];
}
return s;
}
var _XMLHttpRequest = window.XMLHttpRequest;
window.XMLHttpRequest = function() {
this.xhr = new _XMLHttpRequest();
}
// proxy ALL methods/properties
var methods = [
"open",
"abort",
"setRequestHeader",
"send",
"addEventListener",
"removeEventListener",
"getResponseHeader",
"getAllResponseHeaders",
"dispatchEvent",
"overrideMimeType"
];
methods.forEach(function(method){
window.XMLHttpRequest.prototype[method] = function() {
if (debug) console.log("ARGUMENTS", method, args(arguments));
if (method == "open") {
this._url = arguments[1];
}
return this.xhr[method].apply(this.xhr, arguments);
}
});
// proxy change event handler
Object.defineProperty(window.XMLHttpRequest.prototype, "onreadystatechange", {
get: function(){
// this will probably never called
return this.xhr.onreadystatechange;
},
set: function(onreadystatechange){
var that = this.xhr;
var realThis = this;
that.onreadystatechange = function(){
// request is fully loaded
if (that.readyState == 4) {
if (debug) console.log("RESPONSE RECEIVED:", typeof that.responseText == "string" ? that.responseText.length : "none");
// there is a response and filter execution based on url
if (that.responseText && realThis._url.indexOf("whatever") != -1) {
window.myAwesomeResponse = that.responseText;
}
}
onreadystatechange.call(that);
};
}
});
var otherscalars = [
"onabort",
"onerror",
"onload",
"onloadstart",
"onloadend",
"onprogress",
"readyState",
"responseText",
"responseType",
"responseXML",
"status",
"statusText",
"upload",
"withCredentials",
"DONE",
"UNSENT",
"HEADERS_RECEIVED",
"LOADING",
"OPENED"
];
otherscalars.forEach(function(scalar){
Object.defineProperty(window.XMLHttpRequest.prototype, scalar, {
get: function(){
return this.xhr[scalar];
},
set: function(obj){
this.xhr[scalar] = obj;
}
});
});
})(window, false);
}
如果你想捕捉从一开始的AJAX调用,您需要将其添加到第一个事件处理程序之一
If you want to capture the AJAX calls from the very beginning, you need to add this to one of the first event handlers
casper.on("page.initialized", function(resource){
this.evaluate(replaceXHR);
});
或评估(replaceXHR)
当你需要它。
控制流程是这样的:
function replaceXHR(){ /* from above*/ }
casper.start(yourUrl, function(){
this.evaluate(replaceXHR);
});
function getAwesomeResponse(){
return this.evaluate(function(){
return window.myAwesomeResponse;
});
}
// stops waiting if window.myAwesomeResponse is something that evaluates to true
casper.waitFor(getAwesomeResponse, function then(){
var data = JSON.parse(getAwesomeResponse());
// Do something with data
});
casper.run();
如前所述,我创建XMLHtt prequest代理让每一个使用它的页面上的时间,我可以用它做什么。你刮的页面使用 xhr.onreadystatechange
回调接收数据。该代理是通过定义一个特定的setter函数所接收的数据写入 window.myAwesomeResponse
页面背景下完成的。你需要做的唯一的事情是检索该文本。
As described above, I create a proxy for XMLHttpRequest so that every time it is used on the page, I can do something with it. The page that you scrape uses the xhr.onreadystatechange
callback to receive data. The proxying is done by defining a specific setter function which writes the received data to window.myAwesomeResponse
in the page context. The only thing you need to do is retrieving this text.
编写一个代理JSONP更容易,如果你知道preFIX(函数与加载的JSON如插入({数据来称呼:有些,JSON ,这里],ID:asdasda)
)。您可以覆盖插入
页面背景
Writing a proxy for JSONP is even easier, if you know the prefix (the function to call with the loaded JSON e.g. insert({"data":["Some", "JSON", "here"],"id":"asdasda")
). You can overwrite insert
in the page context
- 在页面加载
在
after the page is loaded
casper.start(url).then(function(){
this.evaluate(function(){
var oldInsert = insert;
insert = function(json){
window.myAwesomeResponse = json;
oldInsert.apply(window, arguments);
};
});
}).waitFor(getAwesomeResponse, function then(){
var data = JSON.parse(getAwesomeResponse());
// Do something with data
}).run();
或收到请求之前(如果函数注册的请求调用之前)
or before the request is received (if the function is registered just before the request is invoked)
casper.on("resource.requested", function(resource){
// filter on the correct call
if (resource.url.indexOf(".jsonp") != -1) {
this.evaluate(function(){
var oldInsert = insert;
insert = function(json){
window.myAwesomeResponse = json;
oldInsert.apply(window, arguments);
};
});
}
}).run();
casper.start(url).waitFor(getAwesomeResponse, function then(){
var data = JSON.parse(getAwesomeResponse());
// Do something with data
}).run();
这篇关于我怎样才能捕捉和处理来自使用casperjs的XHR响应的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!