具有相同密钥 apache 光束的多个 CoGroupByKey [英] Multiple CoGroupByKey with same key apache beam

查看:26
本文介绍了具有相同密钥 apache 光束的多个 CoGroupByKey的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将管道中的主数据流 (1.5TB) 加入 2 个不同的数据集(4.92GB 和 17.35GB).我用来为两者执行 CoGroupByKey 的密钥是相同的.有没有办法避免在第一个完成后重新调整连接的左侧?目前我只是将输出保留为 KV>.这似乎比在第一次加入后分段发射每个元素要好,但第二个 groupByKey 似乎仍然比我预期的要长得多.我打算开始研究拆分 CoGroupByKey,看看我是否可以忽略分组一侧,但我真的觉得现在不下降到那个级别更安全.

I have a situation where I need to join the main data stream (1.5TB) in my pipeline to 2 different datasets (4.92GB and 17.35GB). The key that I use to do the CoGroupByKey for both are the same. Is there a way to avoid reshuffling the left side of the join after the first completes? Currently I am just leaving the output as a KV>. This seems to be better than emitting each element piecewise after the first join, but the second groupByKey still seems to be taking a lot longer than I would expect. I was going to start looking into pulling apart CoGroupByKey to see if I can ignore grouping one side, but I really feel safer not going down to that level at this point.

这是在第一次加入后保持可迭代对象分组之前

推荐答案

您是否考虑过以 View.asMap()View.asMultimap() 的形式访问较小的数据集处理主输入时的副输入?Dataflow runner 对映射和多映射侧输入进行了优化实现,可以高效地执行键查找,而无需将整个数据加载到内存中.

Have you considered accessing the smaller datasets as View.asMap() or View.asMultimap() side inputs when processing the main input? The Dataflow runner has an optimized implementation of map and multimap side inputs which performs key lookups efficiently without loading the whole data into memory.

这篇关于具有相同密钥 apache 光束的多个 CoGroupByKey的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆