如何从 Google Cloud Function(Cheerio,Node.js)发出多个 http 请求 [英] How to make multiple http requests from a Google Cloud Function (Cheerio, Node.js)
问题描述
我的问题:
我正在使用 Cheerio、Node.js 和 Google Cloud Functions 构建一个网络抓取工具.
I'm building a web-scraper with Cheerio, Node.js, and Google Cloud Functions.
问题是我需要发出多个请求,然后在调用 response.send() 之前将每个请求中的数据写入 Firestore 数据库,从而终止函数.
The problem is I need to make multiple requests, then write data from each request to a Firestore database before calling response.send() and thereby terminating the function.
我的代码需要两个循环:第一个循环是来自我的数据库的 url,每个循环都发出单独的请求.第二个循环是 Cheerio 使用 .each 从 DOM 中抓取多行表数据并为每一行单独写入.
My code requires two loops: the first loop is with urls from my db, with each one making a separate request. The second loop is with Cheerio using .each to scrape multiple rows of table data from the DOM and make a separate write for each row.
我的尝试:
我已经尝试将每个请求推送到一组承诺,然后在调用 res.send() 之前等待所有承诺通过 promises.all() 解决,但我仍然对承诺有点动摇,而不是确定这是正确的方法.(我已经让代码以这种方式适用于较小的数据集,但仍然不一致.)
I've tried pushing each request to an array of promises and then waiting for all the promises to resolve with promises.all() before calling res.send(), but I'm still a little shaky on promises and not sure that is the right approach. (I have gotten the code to work for smaller datasets that way, but still inconsistently.)
我还尝试将每个请求创建为一个新的 Promise,并使用 async/await 等待来自 forEach 循环的每个函数调用,以便为每个请求留出时间并写入完全完成,以便之后我可以调用 res.send(),但是我发现 forEach 不支持 Async/await.
I also tried creating each request as a new promise and using async/await to await each function call from the forEach loop to allow time for each request and write to fully finish so I could call res.send() afterward, but I found out that forEach doesn't support Async/await.
我试图用 p-iteration 模块解决这个问题,但因为它实际上不是 forEach 而是查询上的一个方法 (doc.forEach()),我不认为它是这样工作的.
I tried to get around that with the p-iteration module but because its not actually forEach but rather a method on the query (doc.forEach()) I don't think it works like that.
这是我的代码.
注意:
如前所述,这不是我尝试过的所有事情(我删除了我的承诺尝试),但这应该表明我正在努力完成什么.
As mentioned, this is not everything I tried (I removed my promise attempts), but this should show what I am trying to accomplish.
export const getCurrentLogs = functions.https.onRequest((req, response) => {
//First, I make a query from my db to get the urls
// that I want the webscraper to loop through.
const ref = scheduleRef.get()
.then((snapshot) => {
snapshot.docs.forEach((doc) => {
const scheduleGame = doc.data()
const boxScoreUrl = scheduleGame.boxScoreURL
//Inside the forEach I call the request
// as a function with the url passed in
updatePlayerLogs("https://" + boxScoreUrl + "/");
});
})
.catch(err => {
console.log('Error getting schedule', err);
});
function updatePlayerLogs (url){
//Here I'm not sure on how to set these options
// to make sure the request stays open but I have tried
// lots of different things.
const options = {
uri: url,
Connection: 'keep-alive',
transform: function (body) {
return cheerio.load(body);
}
};
request(options)
.then(($) => {
//Below I loop through some table data
// on the dom with cheerio. Every loop
// in here needs to be written to firebase individually.
$('.stats-rows').find('tbody').children('tr').each(function(i, element){
const playerPage = $(element).children('td').eq(0).find('a').attr('href');
const pts = replaceDash($(element).children('td').eq(1).text());
const reb = replaceDash($(element).children('td').eq(2).text());
const ast = replaceDash($(element).children('td').eq(3).text());
const fg = replaceDash($(element).children('td').eq(4).text());
const _3pt = replaceDash($(element).children('td').eq(5).text());
const stl = replaceDash($(element).children('td').eq(9).text());
const blk = replaceDash($(element).children('td').eq(10).text());
const to = replaceDash($(element).children('td').eq(11).text());
const currentLog = {
'pts': + pts,
'reb': + reb,
'ast': + ast,
'fg': fgPer,
'3pt': + _3ptMade,
'stl': + stl,
'blk': + blk,
'to': + to
}
//here is the write
playersRef.doc(playerPage).update({
'currentLog': currentLog
})
.catch(error =>
console.error("Error adding document: ", error + " : " + url)
);
});
})
.catch((err) => {
console.log(err);
});
};
//Here I call response.send() to finish the function.
// I have tried doing this lots of different ways but
// whatever I try the response is being sent before all
// docs are written.
response.send("finished writing logs")
});
我尝试过的所有操作都会导致超出期限的错误(可能是因为我已经研究过配额限制,但我认为我不应该超过)或者代码没有完成执行但显示的一些无法解释的错误我在日志中什么都没有.
Everything I have tried either results in a deadline exceeded error (possibly because of quota limits which I have looked into but I don't think I should be exceeding) Or some unexplained error where the code doesn't finish executing but shows me nothing in the logs.
请帮忙,有没有办法在我不理解的情况下使用 async/await?有没有办法使用 Promise 来让这变得优雅?
Please help, is there a way to use async/await in this scenario that I am not understanding? Is there a way to use promises to make this elegant?
非常感谢,
推荐答案
也许看看这样的东西.它使用 Bluebird 承诺 和 请求-承诺库
Maybe have a look at something like this. It uses Bluebird promises and the request-promise library
const Promise = require('bluebird');
var rp = require('request-promise');
const urlList = ['http://www.google.com', 'http://example.com']
async function getList() {
await Promise.map(urlList, (url, index, length) => {
return rp(url)
.then((response) => {
console.log(`${'
'}${url}:${'
'}${response}`);
return;
}).catch(async (err) => {
console.log(err);
return;
})
}, {
concurrency: 10
}); //end Promise.map
}
getList();
这篇关于如何从 Google Cloud Function(Cheerio,Node.js)发出多个 http 请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!