通过在PhantomJS中循环来刮取多个URL [英] Scraping multiple URLs by looping in PhantomJS
问题描述
我到目前为止的实验:
var countries = [Albania,Afghanistan];
var len = countries.length;
var name1 =.html;
var add1 =http://www.kluwerarbitration.com/CommonUI/BITs.aspx?country=;
var country =;
var name =;
var add =;
(i = 1; i <= len; i ++){
country = countries [i]
name = country.concat (name1)
add = add1.concat(name1)
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = name
$ b page.open(add,function(status){
var content = page.content;
fs.write(path,content, 'w')
phantom.exit();
});
$ b我似乎没有得到任何错误时运行代码,该脚本只为第二个国家/地区创建一个html文件,其中包含有关我所感兴趣的小型表中的页面异常的所有信息。
从类似的问题收集一些信息。然而,也因为我找不到一个简单的可重复的例子,我不明白我做错了什么。
解决方案主要问题似乎是您退出的时间太早。您正在循环中创建多个页面
实例。由于PhantomJS是异步的,对 page.open()
的调用立即存在,并且下一个for循环迭代被执行。
for循环非常快,但是web请求很慢。这意味着甚至在加载第一页之前,您的循环就已经完全执行。这也意味着加载的第一个页面也将退出PhantomJS,因为您在页面的每个页面调用 phantom.exit()
。 open()
回调。
var countFinished = 0,
maxFinished = len;
函数checkFinish(){
countFinished ++;
if(countFinished + 1 === maxFinished){
phantom.exit();
(i = 1; i <= len; i ++){
country = countries [i]
name = country .concat(name1)
add = add1.concat(country)
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = name
$ b page.open(add,function(status){
var content = page.content;
fs.write(path,content, 'w')
checkFinish();
});
$ b $ p
$ b 问题在于你正在创建很多页面
实例没有清理。你应该关闭他们:
$ p $ for(i = 1; i <= len; i ++) {
(function(i){
country = countries [i]
name = country.concat(name1)
add = add1.concat(country)
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = name
$ b $ page.open(add,function(status){
var content = page.content;
fs.write(path,content,'w');
page.close();
checkFinish();
});
})(i);
$ b由于JavaScript具有函数级范围,所以需要使用一个IIFE在 page.open()
回调中保留对正确的页
实例的引用。有关更多信息,请参阅以下问题:问题:循环中的JavaScript闭包 - 简单的实例
如果您之后不想清理,那么您应该对所有这些URL使用相同的页面
实例。我已经在这里做了一个答案:答:循环网址做同样的事情
I am using PhantomJS to scrape some websites and therefore extract information with r. I am following this tutorial. Everything works fine for a single page, but I couldn't find any simple tutorial on how to automate for multiple pages.
My experiments so far:
var countries = [ "Albania" ,"Afghanistan"];
var len = countries.length;
var name1 = ".html";
var add1 = "http://www.kluwerarbitration.com/CommonUI/BITs.aspx?country=";
var country ="";
var name ="";
var add="";
for (i=1; i <= len; i++){
country = countries[i]
name = country.concat(name1)
add = add1.concat(name1)
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = name
page.open(add, function (status) {
var content = page.content;
fs.write(path,content,'w')
phantom.exit();
});
}
I don't seem to get any error when running the code, the script creates a html file only for the second country, which contains all information on the page exception made for the small table I am interested in.
I tried to gather some information from similar questions. However, also because I couldn't find a simple reproducible example, I don't understand what I am doing wrong.
解决方案 The main problem seems to be that you're exiting too early. You're creating multiple page
instances in a loop. Since PhantomJS is asynchronous, the call to page.open()
immediately exists and the next for loop iteration is executed.
A for-loop is pretty fast, but web requests are slow. This means that your loop is fully executed before even the first page is loaded. This also means that the first page that is loaded will also exit PhantomJS, because you're calling phantom.exit()
in each of those page.open()
callbacks. I suspect the second URL is faster for some reason and is therefore always written.
var countFinished = 0,
maxFinished = len;
function checkFinish(){
countFinished++;
if (countFinished + 1 === maxFinished) {
phantom.exit();
}
}
for (i=1; i <= len; i++) {
country = countries[i]
name = country.concat(name1)
add = add1.concat(country)
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = name
page.open(add, function (status) {
var content = page.content;
fs.write(path, content,'w')
checkFinish();
});
}
The problem is that you're creating a lot of page
instances without cleaning up. You should close them when you're done with them:
for (i=1; i <= len; i++) {
(function(i){
country = countries[i]
name = country.concat(name1)
add = add1.concat(country)
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = name
page.open(add, function (status) {
var content = page.content;
fs.write(path, content,'w');
page.close();
checkFinish();
});
})(i);
}
Since JavaScript has function-level scope, you would need to use an IIFE to retain a reference to the correct page
instance in the page.open()
callback. See this question for more information about that: Q: JavaScript closure inside loops – simple practical example
If you don't want to clean up afterwards, then you should use the same page
instance over all of those URLs. I already have an answer about doing that here: A: Looping over urls to do the same thing
这篇关于通过在PhantomJS中循环来刮取多个URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!