CoGbkResult包含10000个以上的元素,需要重复(可能很慢) [英] CoGbkResult has more than 10000 elements,reiteration (which may be slow) is required

查看:103
本文介绍了CoGbkResult包含10000个以上的元素,需要重复(可能很慢)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个作业中看到此消息,该作业的运行确实比类似作业(输入略有不同)要慢.

I'm seeing this message in a job which indeed runs more slowly than similar jobs (with slightly different inputs).

将要重复是什么意思?它只会影响性能,还是意味着我的代码可以在相同的输入上运行两次(我的代码有时确实会有副作用).

What does it mean that there will be reiteration? Does it only affect performance or it means that my code could be running twice on the same inputs (my code does occasionally does have side effects).

谢谢! G

推荐答案

这意味着加入的PCollection太大而无法保留在内存中,因此从中获取元素的效率将低于整个集合适合内存的效率.我们重申对CoGroupByKey的物化输入,但是不会重新运行您的代码,因此这只会影响性能.

This means that the joined PCollection is too large to keep in memory, so that fetching elements from it will be less efficient than if the entire collection fit in memory. We reiterate over the materialized input to the CoGroupByKey, but your code is not re-run, so this only affects performance.

值得注意的是,在出现工作者失败的情况下,具有副作用的代码可能会多次运行.

It's worth noting that code with side effects may be run more than once in the case of worker failure.

这篇关于CoGbkResult包含10000个以上的元素,需要重复(可能很慢)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆