如何使用 casperjs 从 XHR 响应中捕获和处理数据? [英] How can I catch and process the data from the XHR responses using casperjs?
问题描述
网页上的数据是动态显示的,看起来检查html中的每一个变化并提取数据是一项非常艰巨的任务,还需要我使用非常不可靠的XPath.所以我希望能够从 XHR
数据包中提取数据.
The data on the webpage is displayed dynamically and it seems that checking for every change in the html and extracting the data is a very daunting task and also needs me to use very unreliable XPaths. So I would want to be able to extract the data from the XHR
packets.
我希望能够从 XHR
数据包中提取信息,并生成要发送到服务器的XHR"数据包.提取信息部分对我来说更重要,因为通过使用 casperjs 自动触发 html 元素可以轻松处理信息的发送.
I hope to be able to extract information from XHR
packets as well as generate 'XHR' packets to be sent to the server.
The extracting information part is more important for me because the sending of information can be handled easily by automatically triggering html elements using casperjs.
我附上了我的意思的截图.
I'm attaching a screenshot of what I mean.
响应选项卡中的文本是我之后需要处理的数据.(此 XHR 响应已从服务器收到.)
The text in the response tab is the data I need to process afterwards. (This XHR response has been received from the server.)
推荐答案
这并不容易,因为 resource.received
事件处理程序仅提供元数据,如 url
、headers
或status
,但不是实际数据.底层 phantomjs 事件处理程序的行为方式相同.
This is not easily possible, because the resource.received
event handler only provides meta data like url
, headers
or status
, but not the actual data. The underlying phantomjs event handler acts the same way.
如果ajax调用无状态,你可以重复请求
If the ajax call is stateless, you may repeat the request
casper.on("resource.received", function(resource){
// somehow identify this request, here: if it contains ".json"
// it also also only does something when the stage is "end" otherwise this would be executed two times
if (resource.url.indexOf(".json") != -1 && resource.stage == "end") {
var data = casper.evaluate(function(url){
// synchronous GET request
return __utils__.sendAJAX(url, "GET");
}, resource.url);
// do something with data, you might need to JSON.parse(data)
}
});
casper.start(url); // your script
您可能希望将事件侦听器添加到 resource.requested
.这样您就无需为完成呼叫而费心了.
You may want to add the event listener to resource.requested
. That way you don't need to way for the call to complete.
您也可以像这样在控制流内部执行此操作(来源:A:CasperJS waitForResource:如何获取我的资源已经等了):
You can also do this right inside of the control flow like this (source: A: CasperJS waitForResource: how to get the resource i've waited for):
casper.start(url);
var res, resData;
casper.waitForResource(function check(resource){
res = resource;
return resource.url.indexOf(".json") != -1;
}, function then(){
resData = casper.evaluate(function(url){
// synchronous GET request
return __utils__.sendAJAX(url, "GET");
}, res.url);
// do something with the data here or in a later step
});
casper.run();
<小时>
有状态的 AJAX 请求
如果它不是无状态,则需要替换 XMLHttpRequest 的实现.您将需要注入自己的 onreadystatechange
处理程序实现,在页面 window
对象中收集信息,然后在另一个 evaluate
调用中收集它.
Stateful AJAX Request
If it is not stateless, you would need to replace the implementation of XMLHttpRequest. You will need to inject your own implementation of the onreadystatechange
handler, collect the information in the page window
object and later collect it in another evaluate
call.
您可能想查看 sinon.js 中的 XHR faker 或使用以下 XMLHttpRequest
的完整代理(我根据 如何创建 XMLHttpRequest 包装器/代理?):
You may want to look at the XHR faker in sinon.js or use the following complete proxy for XMLHttpRequest
(I modeled it after method 3 from How can I create a XMLHttpRequest wrapper/proxy?):
function replaceXHR(){
(function(window, debug){
function args(a){
var s = "";
for(var i = 0; i < a.length; i++) {
s += "
[" + i + "] => " + a[i];
}
return s;
}
var _XMLHttpRequest = window.XMLHttpRequest;
window.XMLHttpRequest = function() {
this.xhr = new _XMLHttpRequest();
}
// proxy ALL methods/properties
var methods = [
"open",
"abort",
"setRequestHeader",
"send",
"addEventListener",
"removeEventListener",
"getResponseHeader",
"getAllResponseHeaders",
"dispatchEvent",
"overrideMimeType"
];
methods.forEach(function(method){
window.XMLHttpRequest.prototype[method] = function() {
if (debug) console.log("ARGUMENTS", method, args(arguments));
if (method == "open") {
this._url = arguments[1];
}
return this.xhr[method].apply(this.xhr, arguments);
}
});
// proxy change event handler
Object.defineProperty(window.XMLHttpRequest.prototype, "onreadystatechange", {
get: function(){
// this will probably never called
return this.xhr.onreadystatechange;
},
set: function(onreadystatechange){
var that = this.xhr;
var realThis = this;
that.onreadystatechange = function(){
// request is fully loaded
if (that.readyState == 4) {
if (debug) console.log("RESPONSE RECEIVED:", typeof that.responseText == "string" ? that.responseText.length : "none");
// there is a response and filter execution based on url
if (that.responseText && realThis._url.indexOf("whatever") != -1) {
window.myAwesomeResponse = that.responseText;
}
}
onreadystatechange.call(that);
};
}
});
var otherscalars = [
"onabort",
"onerror",
"onload",
"onloadstart",
"onloadend",
"onprogress",
"readyState",
"responseText",
"responseType",
"responseXML",
"status",
"statusText",
"upload",
"withCredentials",
"DONE",
"UNSENT",
"HEADERS_RECEIVED",
"LOADING",
"OPENED"
];
otherscalars.forEach(function(scalar){
Object.defineProperty(window.XMLHttpRequest.prototype, scalar, {
get: function(){
return this.xhr[scalar];
},
set: function(obj){
this.xhr[scalar] = obj;
}
});
});
})(window, false);
}
如果您想从一开始就捕获 AJAX 调用,您需要将其添加到第一个事件处理程序中
If you want to capture the AJAX calls from the very beginning, you need to add this to one of the first event handlers
casper.on("page.initialized", function(resource){
this.evaluate(replaceXHR);
});
或 evaluate(replaceXHR)
在您需要时使用.
or evaluate(replaceXHR)
when you need it.
控制流程如下所示:
function replaceXHR(){ /* from above*/ }
casper.start(yourUrl, function(){
this.evaluate(replaceXHR);
});
function getAwesomeResponse(){
return this.evaluate(function(){
return window.myAwesomeResponse;
});
}
// stops waiting if window.myAwesomeResponse is something that evaluates to true
casper.waitFor(getAwesomeResponse, function then(){
var data = JSON.parse(getAwesomeResponse());
// Do something with data
});
casper.run();
如上所述,我为 XMLHttpRequest 创建了一个代理,这样每次在页面上使用它时,我都可以用它做一些事情.您抓取的页面使用 xhr.onreadystatechange
回调来接收数据.代理是通过定义一个特定的 setter 函数来完成的,该函数将接收到的数据写入页面上下文中的 window.myAwesomeResponse
.您唯一需要做的就是检索此文本.
As described above, I create a proxy for XMLHttpRequest so that every time it is used on the page, I can do something with it. The page that you scrape uses the xhr.onreadystatechange
callback to receive data. The proxying is done by defining a specific setter function which writes the received data to window.myAwesomeResponse
in the page context. The only thing you need to do is retrieving this text.
如果您知道前缀(使用加载的 JSON 调用的函数,例如 insert({"data":["Some", "JSON", "here"],"id":"asdasda")
).您可以在页面上下文中覆盖insert
Writing a proxy for JSONP is even easier, if you know the prefix (the function to call with the loaded JSON e.g. insert({"data":["Some", "JSON", "here"],"id":"asdasda")
). You can overwrite insert
in the page context
页面加载后
after the page is loaded
casper.start(url).then(function(){
this.evaluate(function(){
var oldInsert = insert;
insert = function(json){
window.myAwesomeResponse = json;
oldInsert.apply(window, arguments);
};
});
}).waitFor(getAwesomeResponse, function then(){
var data = JSON.parse(getAwesomeResponse());
// Do something with data
}).run();
或在收到请求之前(如果函数在调用请求之前注册)
or before the request is received (if the function is registered just before the request is invoked)
casper.on("resource.requested", function(resource){
// filter on the correct call
if (resource.url.indexOf(".jsonp") != -1) {
this.evaluate(function(){
var oldInsert = insert;
insert = function(json){
window.myAwesomeResponse = json;
oldInsert.apply(window, arguments);
};
});
}
}).run();
casper.start(url).waitFor(getAwesomeResponse, function then(){
var data = JSON.parse(getAwesomeResponse());
// Do something with data
}).run();
这篇关于如何使用 casperjs 从 XHR 响应中捕获和处理数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!