具有相同密钥 apache 光束的多个 CoGroupByKey [英] Multiple CoGroupByKey with same key apache beam

查看：26 发布时间：2021/11/11 22:41:45 google-cloud-dataflow dataflow apache-beam

本文介绍了具有相同密钥 apache 光束的多个 CoGroupByKey的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要将管道中的主数据流 (1.5TB) 加入 2 个不同的数据集(4.92GB 和 17.35GB).我用来为两者执行 CoGroupByKey 的密钥是相同的.有没有办法避免在第一个完成后重新调整连接的左侧?目前我只是将输出保留为 KV>.这似乎比在第一次加入后分段发射每个元素要好，但第二个 groupByKey 似乎仍然比我预期的要长得多.我打算开始研究拆分 CoGroupByKey，看看我是否可以忽略分组一侧，但我真的觉得现在不下降到那个级别更安全.

I have a situation where I need to join the main data stream (1.5TB) in my pipeline to 2 different datasets (4.92GB and 17.35GB). The key that I use to do the CoGroupByKey for both are the same. Is there a way to avoid reshuffling the left side of the join after the first completes? Currently I am just leaving the output as a KV>. This seems to be better than emitting each element piecewise after the first join, but the second groupByKey still seems to be taking a lot longer than I would expect. I was going to start looking into pulling apart CoGroupByKey to see if I can ignore grouping one side, but I really feel safer not going down to that level at this point.

这是在第一次加入后保持可迭代对象分组之前

具有相同密钥 apache 光束的多个 CoGroupByKey [英] Multiple CoGroupByKey with same key apache beam

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

具有相同密钥 apache 光束的多个 CoGroupByKey [英] Multiple CoGroupByKey with same key apache beam

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭