如何从< key，value>中输出所有值对，使用Google Dataflow按密钥分组 [英] How to output all values from a <key,value> pair, grouped by key, using Google Dataflow

查看：127 发布时间：2020/4/26 9:28:05 key-value google-cloud-dataflow

本文介绍了如何从< key，value>中输出所有值对，使用Google Dataflow按密钥分组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试做一些看似相对简单但有些困难的事情.

I'm trying to do something that seems relatively straightforward but am running into some difficulty.

我有一堆文本，每一行都是一个值.我分析文本的每一行，创建适当的键，然后发出KV对.然后，我使用GroupByKey转换.最后，我想输出现在按键分组的所有文本(如果可以为每个键得到一个文本文件，则可以加分，但是我不确定这是可能的.)

I have a bunch of text, and each line is a value. I analyze each line of text, create the appropriate key, then emit KV pairs. I then use the GroupByKey transform. Finally, I want to output all the text now grouped by key (bonus points if I can get one text file for each key, but I'm not sure that's possible).

这是管道的apply的样子:

    public PCollection<String> apply(PCollection<String> generator) {

        // Returns individuals lines of text as <String,String> KV pairs
        PCollection<KV<String,String>> generatedTextKV = generator.apply(
                ParDo.of(new GeneratorByLineFn()));

        // Groups the <String,String> KV pairs by value
        PCollection<KV<String, Iterable<String>>> groupedText = generatedTextKV.apply(
            GroupByKey.<String, String>create());

        // Hopefully returns output where all of each key's values are together
        PCollection<String> results = groupedText.apply(ParDo.of(new FormatOutputFn()));

        return results;
    }

不幸的是，我无法使FormatOutputFn()正常工作.

Unfortunately, I cannot get the FormatOutputFn() to work as desired.

遍历Iterable<String>并输出每个值并不能保证键值对的分组(如果我错了，请更正我，然后我的问题就解决了).然后，我尝试使用StringBuilder()，它适用于小型数据集，但毫不奇怪，在较大数据的日志中会生成java.lang.OutOfMemoryError: Java heap space错误.我也尝试了Flatten.FlattenIterables转换，但是由于K，V对中的值不是PCollection，而是常规的Iterable，所以也不起作用.

Iterating over the Iterable<String> and outputting each value doesn't guarantee the key,value grouping (please correct me if I'm wrong about this, then my problem is solved). I then tried using StringBuilder(), which works with small datasets but unsurprisingly generates java.lang.OutOfMemoryError: Java heap space errors in the log on larger data. I also tried the Flatten.FlattenIterables transform, but that doesn't work either since the value in the K,V pair is not a PCollection, but just a regular Iterable.

我看过这个问题通过公用密钥进行分析，但是从答案中我并不完全清楚我应该如何处理自己的情况.我想我必须使用Combine.PerKey，但是我不确定如何使用它.我还假设必须有一种预烘焙的方法来执行此操作，但是我在文档中找不到这种预烘焙的方法.我确定我只是在找对地方.

I've seen this question on analysis by common key, but from the answer it is not entirely clear to me exactly what I should do with my situation. I think I have to use Combine.PerKey, but I'm not exactly sure how to use it. I'm also assuming there has to be a pre-baked way to do this, but I can't find that pre-baked way in the docs. I'm sure I'm just not looking in the right place.

而且，如上所述，如果有一种方法可以获取文本文件输出，而文本文件的名称是键，而值都在文件中，那就太好了.但是我不认为Dataflow可以做到这一点(还好吗?).

And, as mentioned above, if there is a way to get text file output where the name of the text file is the key and the values are all in the file, that would be amazing. But I don't think Dataflow can do this (yet?).

感谢您的阅读.

如何从< key，value>中输出所有值对，使用Google Dataflow按密钥分组 [英] How to output all values from a <key,value> pair, grouped by key, using Google Dataflow

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何从&lt; key，value&gt;中输出所有值对，使用Google Dataflow按密钥分组 [英] How to output all values from a &lt;key,value&gt; pair, grouped by key, using Google Dataflow

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

如何从< key，value>中输出所有值对，使用Google Dataflow按密钥分组 [英] How to output all values from a <key,value> pair, grouped by key, using Google Dataflow

登录关闭