我怎样才能用承诺重写这个? [英] How can I rewrite this with promises?

查看:45
本文介绍了我怎样才能用承诺重写这个?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为 T 恤网站构建内容抓取工具.

I am building a content scraper for a tshirt website.

目标是仅通过一个硬编码网址进入网站:http://shirts4mike.com

The goal is to enter a website through only one hardcoded url: http://shirts4mike.com

然后我会找到每件 T 恤的所有产品页面,然后创建一个包含它的详细信息的对象.然后将其添加到数组中.

I will then find all the product pages for each tshirt, and then create a object with it's details. Then add it to an array.

当阵列装满 T 恤时,我将处理阵列并将其记录到 CSV 文件中.

When the array is full of the tshirts, I'll work through the array and log it into a CSV file.

现在,我在请求/响应和函数调用的时间安排方面遇到了一些问题.

Right now, I am having some trouble with the timing of the requests/responses and the function calls.

如何确保在正确的时间调用 NEXT 函数?我知道它不起作用,因为它是异步的.

How can I make sure that I call the NEXT function on the right time? I understand that it's not working because of it's async nature.

我如何在正确的时间调用 secondScrapelastScraperconvertJson2Csv 以便它们使用的变量不是未定义的?

How can I call secondScrape, lastScraper and convertJson2Csv at the right time so that the variables they're working with are not undefined?

我尝试使用诸如 response.end() 之类的东西,但这不起作用.

I tried to use something such as response.end() but this is not working.

我假设我需要使用 Promise 来使其正常工作?并且清晰易读?

I'm assuming I NEED to use promises to make this work properly? and to be legible?

有什么想法吗?我的代码如下:

Any ideas? My code is below:

//Modules being used:
var cheerio = require('cheerio');
var request = require('request');
var moment = require('moment');

//hardcoded url
var url = 'http://shirts4mike.com/';

//url for tshirt pages
var urlSet = new Set();

var remainder;
var tshirtArray;


// Load front page of shirts4mike
request(url, function(error, response, html) {
    if(!error && response.statusCode == 200){
        var $ = cheerio.load(html);

    //iterate over links with 'shirt'
        $("a[href*=shirt]").each(function(){
            var a = $(this).attr('href');

            //create new link
            var scrapeLink = url + a;

            //for each new link, go in and find out if there is a submit button. 
            //If there, add it to the set
            request(scrapeLink, function(error,response, html){
                if(!error && response.statusCode == 200) {
                    var $ = cheerio.load(html);

                    //if page has a submit it must be a product page
                    if($('[type=submit]').length !== 0){

                        //add page to set
                        urlSet.add(scrapeLink);

                    } else if(remainder === undefined) {
                        //if not a product page, add it to remainder so it another scrape can be performed.
                        remainder = scrapeLink;                     
                    }
                }
            });
        });     
    }
    //call second scrape for remainder
    secondScrape();
});


function secondScrape() {
    request(remainder, function(error, response, html) {
        if(!error && response.statusCode == 200){
            var $ = cheerio.load(html);

            $("a[href*=shirt]").each(function(){
                var a = $(this).attr('href');

                //create new link
                var scrapeLink = url + a;

                request(scrapeLink, function(error,response, html){
                    if(!error && response.statusCode == 200){

                        var $ = cheerio.load(html);

                        //collect remaining product pages and add to set
                        if($('[type=submit]').length !== 0){
                            urlSet.add(scrapeLink);
                        }
                    }
                });
            });     
        }
    });
    console.log(urlSet);
    //call lastScraper so we can grab data from the set (product pages)
    lastScraper();
};



function lastScraper(){
    //scrape set, product pages
    for(var i = 0; i < urlSet.length; i++){
        var url = urlSet[i];

        request(url, function(error, response, html){
            if(!error && response.statusCode == 200){
                var $ = cheerio.load(html);

                //grab data and store as variables
                var price = $('.price').text();
                var img = $('.shirt-picture').find("img").attr("src");
                var title = $('body').find(".shirt-details > h1").text().slice(4);

                var tshirtObject = {};
                //add values into tshirt object

                tshirtObject.price = price;
                tshirtObject.img = img;
                tshirtObject.title = title;
                tshirtObject.url = url;
                tshirtObject.date = moment().format('MMMM Do YYYY, h:mm:ss a');

                //add the object into the array of tshirts
                tshirtArray.push(tshirtObject); 
            }
        });
    }
    //call function to iterate through tshirt objects in array in order to convert to JSON, then into CSV to be logged
    convertJson2Csv();
};

推荐答案

您正确地将 Promise 识别为提前解决您的时间问题的方法.

You correctly identify promises as a way ahead to solving your timing issues.

为了让 promises 可用,你需要承诺 request(或者采用一个 HTTP 库,它的方法返回 promises).

In order to have promises available, you need to promisify request (or adopt a HTTP lib, whose methods return promises).

您可以仅使用 Promise 解决时间问题,但您也可以借此机会改进整体范式.您可以编写一个递归调用自身的函数,而不是用于几乎相同的第一/第二/第三阶段的离散函数.写得正确,这将确保目标站点中的每个页面最多被访问一次;出于整体性能和目标服务器负载的考虑,应避免重新访问.

You could just fix the timing issues with promises, but you could also take the opportunity to improve the overall paradigm. Instead of discrete functions for virtually identical first/second/third stages, you can write a single function that calls itself recursively. Written correctly, this will ensure that each page in the target site is visited a maximum of once; revisits should be avoided on grounds of overall performance, and loading of the target server.

//Modules being used:
var Promise = require('path/to/bluebird');
var cheerio = require('cheerio');
var moment = require('moment');

// Promisify `request` to make `request.getAsync()` available.
// Ref: http://stackoverflow.com/questions/28308131/how-do-you-properly-promisify-request
var request = Promise.promisify(require('request'));
Promise.promisifyAll(request);

//hardcoded url
var url = 'http://shirts4mike.com/';

var urlSet = new Set();
var tshirtArray = [];

var maxLevels = 3; // limit the recursion to this number of levels.

function scrapePage(url_, levelCounter) {
    // Bale out if :
    //   a) the target url_ has been visited already,
    //   b) maxLevels has been reached.
    if(urlSet.has(url_) || levelCounter >= maxLevels) {
        return Promise.resolve();
    }
    urlSet.add(url_);

    return request.getAsync(url_).then(function(response, html) {
        var $;
        if(response.statusCode !== 200) {
            throw new Error('statusCode was not 200'); // will be caught below
        }
        $ = cheerio.load(html);
        if($('[type=submit]').length > 0) {
            // yay, it's a product page.
            tshirtArray.push({
                price: $('.price').text(),
                img: $('.shirt-picture').find("img").attr("src"),
                title: $('body').find(".shirt-details > h1").text().slice(4),
                url: url_,
                date: moment().format('MMMM Do YYYY, h:mm:ss a')
            });
        }
        // find any shirt links on page represented by $, visit each link in turn, and scrape.
        return Promise.all($("a[href*=shirt]").map(function(link) {
            return scrapePage(link.href, levelCounter + 1);
        }).get());
    }).catch(function(e) {
        // ensure "success" even if scraping threw an error.
        console.log(e);
        return null;
    });
}

scrapePage(url, 0).then(convertJson2Csv);

如您所见,递归解决方案:

As you can see, a recursive solution :

  • 避免代码重复,
  • 将根据您的需要向下钻取多个级别 - 由变量 maxLevels 确定.

注意:这仍然不是一个好的解决方案.这里有一个隐含的假设,就像在原始代码中一样,所有衬衫页面都可以从站点的主页访问,仅通过衬衫"链接.如果可以通过例如服装">衬衫"访问衬衫,那么上面的代码将找不到任何衬衫.

Note: This is still not a good solution. There's an implicit assumption here, as in the original code, that all shirt pages are reachable from the site's home page, via "shirt" links alone. If shirts were reachable via eg "clothing" > "shirts", then the code above won't find any shirts.

这篇关于我怎样才能用承诺重写这个?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆