Cheerio不等待身体加载 [英] Cheerio doesn't wait for body to load

查看:39
本文介绍了Cheerio不等待身体加载的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我制作了一个非常简单的脚本,该脚本抓取了一个食谱网站,以获取标题,准备时间和配料。一切工作正常,但脚本无法抓取数组的每一页。有时我会得到其中的4个,有时是2个,有时甚至是0 ...

I made a very simple script which scrape a recipes website to get the title, time of preparation and the ingredients. Everything works fine except that the script is not able to scrape each page of my arrays. Sometimes i get 4 of them, sometimes 2, sometimes even 0 ...

似乎脚本不会等待正文完全加载。我完全知道cheerio不了解网站上的javascript,但就我所知,我所刮取的信息并非来自任何脚本,而是纯HTML。

It seems that the script doesn't wait the body to be fully loaded. I'm fully aware that cheerio doesn't understand javascript on website, but for all i know the information I scrape aren't generated from any script, it is pure HTML.

我该如何要求cheerio在访问页面时等待1秒,或者只是等待html完全加载。

How can i ask cheerio to wait 1 second when a page is visited, or simply to wait for the html to be fully loaded.

这是我的代码,它有效,因此您可以尝试一下,并提供输出示例:

Here is my code, it works so you can try it, and an example of the output :

pools = [
     "http://www.marmiton.org/recettes/recette_salade-de-betteraves-a-l-orientale_16831.aspx",
     "http://www.marmiton.org/recettes/recette_pain-d-epices-a-la-dijonnaise_16832.aspx",
     "http://www.marmiton.org/recettes/recette_tarte-au-chocolat-et-creme-moka_16834.aspx",
     "http://www.marmiton.org/recettes/recette_poulet-a-la-gaston-gerard_16836.aspx",
   "http://www.marmiton.org/recettes/recette_assiette-paula_16837.aspx"]

    var request = require("request");
    var cheerio = require("cheerio");
    var poolsLength = pools.length;

    for (var i = 0 ; i < pools.length ; i++) {
       var url = pools[i];
        request(url, function (error, response, body) {
         if (!error) {
        var $ = cheerio.load(body,{
          ignoreWhitespace: true
    });
       var name = [];
       var address = [];
       var website = [];

    $('body').each(function(i, elem){
          name = $(elem).find('.fn').text();
          address = $(elem).find('.preptime').text();
          website = $(elem).find('.m_content_recette_ingredients').text();
          console.log(name+"±"+address+"±"+website);}
     )}
    })
    };`

如上所示,它仅适用于2个5页。

As you can see above, it only worked for 2 of 5 pages.

推荐答案

您可以尝试以下代码,setTimeout将导致在抓取之前页面加载延迟。

You can try the following code, the setTimeout will cause a delay for the page to load before scraping.

pools = [
         "http://www.marmiton.org/recettes/recette_salade-de-betteraves-a-l-orientale_16831.aspx",
         "http://www.marmiton.org/recettes/recette_pain-d-epices-a-la-dijonnaise_16832.aspx",
         "http://www.marmiton.org/recettes/recette_tarte-au-chocolat-et-creme-moka_16834.aspx",
         "http://www.marmiton.org/recettes/recette_poulet-a-la-gaston-gerard_16836.aspx",
       "http://www.marmiton.org/recettes/recette_assiette-paula_16837.aspx"]

        var request = require("request");
        var cheerio = require("cheerio");
        var poolsLength = pools.length;
        var interval = 10 * 1000; // 10 seconds;
        for (var i = 0 ; i < pools.length ; i++) {
           var url = pools[i];
           setTimeout( function (i) {
            request(url, function (error, response, body) {
             if (!error) {
            var $ = cheerio.load(body,{
              ignoreWhitespace: true
        });
           var name = [];
           var address = [];
           var website = [];

        $('body').each(function(i, elem){
              name = $(elem).find('.fn').text();
              address = $(elem).find('.preptime').text();
              website = $(elem).find('.m_content_recette_ingredients').text();
              console.log(name+"±"+address+"±"+website);}
         )
        }
        }, interval * i, i);
        })
        }

这篇关于Cheerio不等待身体加载的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆