Cheerio不等待身体加载 [英] Cheerio doesn't wait for body to load
问题描述
我制作了一个非常简单的脚本,该脚本抓取了一个食谱网站,以获取标题,准备时间和配料。一切工作正常,但脚本无法抓取数组的每一页。有时我会得到其中的4个,有时是2个,有时甚至是0 ...
I made a very simple script which scrape a recipes website to get the title, time of preparation and the ingredients. Everything works fine except that the script is not able to scrape each page of my arrays. Sometimes i get 4 of them, sometimes 2, sometimes even 0 ...
似乎脚本不会等待正文完全加载。我完全知道cheerio不了解网站上的javascript,但就我所知,我所刮取的信息并非来自任何脚本,而是纯HTML。
It seems that the script doesn't wait the body to be fully loaded. I'm fully aware that cheerio doesn't understand javascript on website, but for all i know the information I scrape aren't generated from any script, it is pure HTML.
我该如何要求cheerio在访问页面时等待1秒,或者只是等待html完全加载。
How can i ask cheerio to wait 1 second when a page is visited, or simply to wait for the html to be fully loaded.
这是我的代码,它有效,因此您可以尝试一下,并提供输出示例:
Here is my code, it works so you can try it, and an example of the output :
pools = [
"http://www.marmiton.org/recettes/recette_salade-de-betteraves-a-l-orientale_16831.aspx",
"http://www.marmiton.org/recettes/recette_pain-d-epices-a-la-dijonnaise_16832.aspx",
"http://www.marmiton.org/recettes/recette_tarte-au-chocolat-et-creme-moka_16834.aspx",
"http://www.marmiton.org/recettes/recette_poulet-a-la-gaston-gerard_16836.aspx",
"http://www.marmiton.org/recettes/recette_assiette-paula_16837.aspx"]
var request = require("request");
var cheerio = require("cheerio");
var poolsLength = pools.length;
for (var i = 0 ; i < pools.length ; i++) {
var url = pools[i];
request(url, function (error, response, body) {
if (!error) {
var $ = cheerio.load(body,{
ignoreWhitespace: true
});
var name = [];
var address = [];
var website = [];
$('body').each(function(i, elem){
name = $(elem).find('.fn').text();
address = $(elem).find('.preptime').text();
website = $(elem).find('.m_content_recette_ingredients').text();
console.log(name+"±"+address+"±"+website);}
)}
})
};`
如上所示,它仅适用于2个5页。
As you can see above, it only worked for 2 of 5 pages.
推荐答案
您可以尝试以下代码,setTimeout将导致在抓取之前页面加载延迟。
You can try the following code, the setTimeout will cause a delay for the page to load before scraping.
pools = [
"http://www.marmiton.org/recettes/recette_salade-de-betteraves-a-l-orientale_16831.aspx",
"http://www.marmiton.org/recettes/recette_pain-d-epices-a-la-dijonnaise_16832.aspx",
"http://www.marmiton.org/recettes/recette_tarte-au-chocolat-et-creme-moka_16834.aspx",
"http://www.marmiton.org/recettes/recette_poulet-a-la-gaston-gerard_16836.aspx",
"http://www.marmiton.org/recettes/recette_assiette-paula_16837.aspx"]
var request = require("request");
var cheerio = require("cheerio");
var poolsLength = pools.length;
var interval = 10 * 1000; // 10 seconds;
for (var i = 0 ; i < pools.length ; i++) {
var url = pools[i];
setTimeout( function (i) {
request(url, function (error, response, body) {
if (!error) {
var $ = cheerio.load(body,{
ignoreWhitespace: true
});
var name = [];
var address = [];
var website = [];
$('body').each(function(i, elem){
name = $(elem).find('.fn').text();
address = $(elem).find('.preptime').text();
website = $(elem).find('.m_content_recette_ingredients').text();
console.log(name+"±"+address+"±"+website);}
)
}
}, interval * i, i);
})
}
这篇关于Cheerio不等待身体加载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!