由于内存不足而无法完成承诺 [英] unable to complete promises due to out of memory

查看:96
本文介绍了由于内存不足而无法完成承诺的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个脚本可以抓取约1000个网页.我正在使用Promise.all将它们一起触发,并在完成所有页面后返回:

I have a script to scrape ~1000 webpages. I'm using Promise.all to fire them together, and it returns when all pages are done:

Promise.all(urls.map(url => scrap(url)))
    .then(results => console.log('all done!', results));

这很不错,很正确,除了一件事-机器由于并发请求而内存不足.我使用jsdom进行抓取,它很快占用了几GB的内存,考虑到它实例化了数百个window,这是可以理解的.

This is sweet and correct, except for one thing - machine goes out of memory because of the concurrent requests. I'm use jsdom for scrapping, it quickly takes up a few GB of mem, which is understandable considering it instantiates hundreds of window.

我有个解决办法,但我不喜欢.也就是说,将控制流更改为不使用Promise.all,而应遵守我的诺言:

I have an idea to fix but I don't like it. That is, change control flow to not use Promise.all, but chain my promises:

let results = {};
urls.reduce((prev, cur) =>
    prev
        .then(() => scrap(cur))
        .then(result => results[cur] = result)
        // ^ not so nice. 
, Promise.resolve())
    .then(() => console.log('all done!', results));

这不像Promise.all那样好.它的性能不如链接起来,必须存储返回的值以供以后处理.

This is not as good as Promise.all... Not performant as it's chained, and returned values have to be stored for later processing.

有什么建议吗?我应该改善控制流程还是应该改善scrap()中的内存使用情况,还是有办法让节点限制内存分配?

Any suggestions? Should I improve the control flow or should I improve mem usage in scrap(), or is there a way to let node throttle mem allocation?

推荐答案

您正在尝试并行运行1000个Web抓取.您将需要选择一个明显少于1000的数字,并且一次只能运行N个,因此您在此过程中消耗的内存更少.您仍然可以使用Promise来跟踪它们何时完成.

You are trying to run 1000 web scrapes in parallel. You will need to pick some number significantly less than 1000 and run only N at a time so you consume less memory while doing so. You can still use a promise to keep track of when they are all done.

Bluebird的Promise.map() 可以通过传递一个并发值作为选项.或者,您可以自己编写.

Bluebird's Promise.map() can do that for you by just passing a concurrency value as an option. Or, you could write it yourself.

我有个解决办法,但我不喜欢.也就是说,变更控制 流不使用Promise.all,而是兑现我的诺言:

I have an idea to fix but I don't like it. That is, change control flow to not use Promise.all, but chain my promises:

您想要的是同时进行N项操作.排序是一种特殊情况,其中N = 1通常比并行执行某些操作要慢得多(也许使用N = 10).

What you want is N operations in flight at the same time. Sequencing is a special case where N = 1 which would often be much slower than doing some of them in parallel (perhaps with N = 10).

这不像Promise.all那样好……不如链条式的性能好, 返回的值必须存储起来,以便以后处理.

This is not as good as Promise.all... Not performant as it's chained, and returned values have to be stored for later processing.

如果存储的值是内存问题的一部分,则可能必须以任何方式将它们存储在内存之外.您将不得不分析存储的结果正在使用多少内存.

If stored values are part of your memory problem, you may have to store them out of memory somewhere any ways. You will have to analyze how much memory the stored results are using.

有什么建议吗?我应该改善控制流程还是应该改善 scrap()中的内存使用情况,还是有办法让节点限制内存 分配?

Any suggestions? Should I improve the control flow or should I improve mem usage in scrap(), or is there a way to let node throttle mem allocation?

使用 Bluebird的Promise.map() 或自己编写类似的内容.并行编写最多可进行N次操作并保持所有结果有序的内容不是火箭科学,但是要使其正确,还需要做一些工作.我之前曾在另一个答案中介绍过它,但现在似乎找不到.我会继续寻找.

Use Bluebird's Promise.map() or write something similar yourself. Writing something that runs up to N operations in parallel and keeps all the results in order is not rocket science, but it is a bit of work to get it right. I've presented it before in another answer, but can't seem to find it right now. I will keep looking.

在这里找到我先前的相关答案:

Found my prior related answer here: Make several requests to an API that can only handle 20 request a minute

这篇关于由于内存不足而无法完成承诺的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆