多个CoGroupByKey具有相同的apache梁 [英] Multiple CoGroupByKey with same key apache beam

查看：98 发布时间：2020/9/3 5:21:53 google-cloud-dataflow dataflow apache-beam

本文介绍了多个CoGroupByKey具有相同的apache梁的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我遇到一种情况，需要将管道中的主数据流(1.5TB)加入2个不同的数据集(4.92GB和17.35GB).我用于两者的CoGroupByKey的键是相同的.有没有一种方法可以避免在第一个完成后重新组合联接的左侧?目前，我只是将输出保留为KV>.这似乎比第一次联接后分段发射每个元素更好，但是第二个groupByKey似乎仍然比我期望的要花费更长的时间.我本来打算研究将CoGroupByKey分开，以查看是否可以忽略对一侧的分组，但是我现在真的很安全，不降低到该水平.

I have a situation where I need to join the main data stream (1.5TB) in my pipeline to 2 different datasets (4.92GB and 17.35GB). The key that I use to do the CoGroupByKey for both are the same. Is there a way to avoid reshuffling the left side of the join after the first completes? Currently I am just leaving the output as a KV>. This seems to be better than emitting each element piecewise after the first join, but the second groupByKey still seems to be taking a lot longer than I would expect. I was going to start looking into pulling apart CoGroupByKey to see if I can ignore grouping one side, but I really feel safer not going down to that level at this point.

这是在第一次加入后将Iterables分组之前

多个CoGroupByKey具有相同的apache梁 [英] Multiple CoGroupByKey with same key apache beam

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

多个CoGroupByKey具有相同的apache梁 [英] Multiple CoGroupByKey with same key apache beam

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭