如何从Google Cloud Function(Cheerio,Node.js)发出多个http请求 [英] How to make multiple http requests from a Google Cloud Function (Cheerio, Node.js)

查看:59
本文介绍了如何从Google Cloud Function(Cheerio,Node.js)发出多个http请求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题:

我正在使用Cheerio,Node.js和Google Cloud Functions构建一个网络爬虫.

I'm building a web-scraper with Cheerio, Node.js, and Google Cloud Functions.

问题是我需要发出多个请求,然后将每个请求中的数据写入Firestore数据库,然后再调用response.send()从而终止该函数.

The problem is I need to make multiple requests, then write data from each request to a Firestore database before calling response.send() and thereby terminating the function.

我的代码需要两个循环:第一个循环是数据库中的URL,每个循环都发出一个单独的请求.第二个循环是Cheerio使用.each从DOM抓取表数据的多行并为每行进行单独写入的循环.

My code requires two loops: the first loop is with urls from my db, with each one making a separate request. The second loop is with Cheerio using .each to scrape multiple rows of table data from the DOM and make a separate write for each row.

我尝试过的事情:

我已经尝试将每个请求推送到一个Promise数组,然后在调用res.send()之前等待所有Promise用Promise.all()解决,但是我对Promise还是有些不确定确保这是正确的方法. (我已经以这种方式获得了适用于较小数据集的代码,但是仍然不一致.)

I've tried pushing each request to an array of promises and then waiting for all the promises to resolve with promises.all() before calling res.send(), but I'm still a little shaky on promises and not sure that is the right approach. (I have gotten the code to work for smaller datasets that way, but still inconsistently.)

我还尝试将每个请求创建为新的Promise,并使用async/await等待来自forEach循环的每个函数调用,以便为每个请求留出时间,并完全完成写入操作,因此我可以在之后调用res.send(),我发现forEach不支持Async/await.

I also tried creating each request as a new promise and using async/await to await each function call from the forEach loop to allow time for each request and write to fully finish so I could call res.send() afterward, but I found out that forEach doesn't support Async/await.

我试图通过p-iteration模块解决这个问题,但是因为它实际上不是forEach而是查询的一种方法(doc.forEach()),所以我认为它不是那样工作的.

I tried to get around that with the p-iteration module but because its not actually forEach but rather a method on the query (doc.forEach()) I don't think it works like that.

这是我的代码.

注意:

如上所述,这不是我尝试过的所有事情(我删除了我的诺言尝试),但这应该表明我正在努力实现的目标.

As mentioned, this is not everything I tried (I removed my promise attempts), but this should show what I am trying to accomplish.

export const getCurrentLogs = functions.https.onRequest((req, response) => {


//First, I make a query from my db to get the urls 
// that I want the webscrapper to loop through. 

const ref = scheduleRef.get()

.then((snapshot) => {

    snapshot.docs.forEach((doc) => {

        const scheduleGame = doc.data()
        const boxScoreUrl = scheduleGame.boxScoreURL

//Inside the forEach I call the request 
// as a function with the url passed in

        updatePlayerLogs("https://" + boxScoreUrl + "/");


    });

})

.catch(err => {
    console.log('Error getting schedule', err);
});


function updatePlayerLogs (url){


//Here I'm not sure on how to set these options 
// to make sure the request stays open but I have tried 
// lots of different things. 

    const options = {
        uri: url,
        Connection: 'keep-alive',
        transform: function (body) {
            return cheerio.load(body);
        }
    };

   request(options)

        .then(($) => {


//Below I loop through some table data 
// on the dom with cheerio. Every loop 
// in here needs to be written to firebase individually. 


                $('.stats-rows').find('tbody').children('tr').each(function(i, element){


                    const playerPage = $(element).children('td').eq(0).find('a').attr('href');


                    const pts = replaceDash($(element).children('td').eq(1).text());
                    const reb =  replaceDash($(element).children('td').eq(2).text());
                    const ast =  replaceDash($(element).children('td').eq(3).text());
                    const fg =  replaceDash($(element).children('td').eq(4).text());
                    const _3pt =  replaceDash($(element).children('td').eq(5).text());
                    const stl =  replaceDash($(element).children('td').eq(9).text());
                    const blk =  replaceDash($(element).children('td').eq(10).text());
                    const to =  replaceDash($(element).children('td').eq(11).text());


                    const currentLog = {
                        'pts': + pts,
                        'reb': + reb,
                        'ast': + ast,
                        'fg':  fgPer,
                        '3pt': + _3ptMade,
                        'stl': + stl,
                        'blk':  + blk,
                        'to':  + to
                    }

                   //here is the write
                    playersRef.doc(playerPage).update({

                        'currentLog': currentLog

                    }) 
                    .catch(error => 
                        console.error("Error adding document: ", error + " : " + url)
                     );
                });

            })

        .catch((err) => {
            console.log(err); 
        });

    };

//Here I call response.send() to finish the function. 
// I have tried doing this lots of different ways but 
// whatever I try the response is being sent before all 
// docs are written.

   response.send("finished writing logs")

});

我尝试过的所有结果均会导致最后期限超出错误(可能是由于我调查了配额限制,但我认为我不应该超过此限制)或某些无法解释的错误,其中代码未完成执行但显示我在日志中什么也没有.

Everything I have tried either results in a deadline exceeded error (possibly because of quota limits which I have looked into but I don't think I should be exceeding) Or some unexplained error where the code doesn't finish executing but shows me nothing in the logs.

请帮助,在我不了解的这种情况下,有没有办法使用异步/等待?有没有一种使用诺言使之优雅的方法?

Please help, is there a way to use async/await in this scenario that I am not understanding? Is there a way to use promises to make this elegant?

非常感谢,

推荐答案

也许看看这样的东西.它使用蓝鸟承诺

Maybe have a look at something like this. It uses Bluebird promises and the request-promise library

const Promise = require('bluebird');
var rp = require('request-promise');

const urlList = ['http://www.google.com', 'http://example.com']

async function getList() {
  await Promise.map(urlList, (url, index, length) => { 

    return rp(url)
      .then((response) => {

        console.log(`${'\n\n\n'}${url}:${'\n'}${response}`);
        return;
      }).catch(async (err) => {
        console.log(err);
        return;

      })


  }, {
    concurrency: 10
  }); //end Promise.map

}

getList();

这篇关于如何从Google Cloud Function(Cheerio,Node.js)发出多个http请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆