多个CoGroupByKey具有相同的apache梁 [英] Multiple CoGroupByKey with same key apache beam

查看:98
本文介绍了多个CoGroupByKey具有相同的apache梁的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到一种情况,需要将管道中的主数据流(1.5TB)加入2个不同的数据集(4.92GB和17.35GB).我用于两者的CoGroupByKey的键是相同的.有没有一种方法可以避免在第一个完成后重新组合联接的左侧?目前,我只是将输出保留为KV>.这似乎比第一次联接后分段发射每个元素更好,但是第二个groupByKey似乎仍然比我期望的要花费更长的时间.我本来打算研究将CoGroupByKey分开,以查看是否可以忽略对一侧的分组,但是我现在真的很安全,不降低到该水平.

I have a situation where I need to join the main data stream (1.5TB) in my pipeline to 2 different datasets (4.92GB and 17.35GB). The key that I use to do the CoGroupByKey for both are the same. Is there a way to avoid reshuffling the left side of the join after the first completes? Currently I am just leaving the output as a KV>. This seems to be better than emitting each element piecewise after the first join, but the second groupByKey still seems to be taking a lot longer than I would expect. I was going to start looking into pulling apart CoGroupByKey to see if I can ignore grouping one side, but I really feel safer not going down to that level at this point.

这是在第一次加入后将Iterables分组之前

推荐答案

在处理主输入时,您是否考虑过将较小的数据集作为View.asMap()View.asMultimap()侧输入访问? Dataflow运行程序具有对地图和多地图侧输入的优化实现,可以高效地执行键查找,而无需将整个数据加载到内存中.

Have you considered accessing the smaller datasets as View.asMap() or View.asMultimap() side inputs when processing the main input? The Dataflow runner has an optimized implementation of map and multimap side inputs which performs key lookups efficiently without loading the whole data into memory.

这篇关于多个CoGroupByKey具有相同的apache梁的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆