如何使用Google Dataflow计算文件中的总行数 [英] How to count total number of rows in a file using google dataflow

查看:44
本文介绍了如何使用Google Dataflow计算文件中的总行数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道是否有一种方法可以使用google dataflow找出文件中的总行数.任何代码示例和指针都会有很大的帮助.基本上,我有一种方法

I would like to know if there is a way to find out total no rows in a file using google dataflow. Any code sample and pointer will be great help. Basically, I have a method as

int getCount(String fileName) {}

因此,上述方法将返回总行数,其实现将是数据流代码.

So, above method will return total count of rows and its implementation will be dataflow code.

谢谢

推荐答案

似乎您的用例不需要进行分布式处理,因为该文件已压缩,因此无法并行读取.但是,由于使用Dataflow API易于访问GCS和自动解压缩,您可能仍会发现它有用.

Seems like your use case is one that doesn't require distributed processing, because the file is compressed and hence can not be read in parallel. However, you may still find it useful to use Dataflow APIs for the sake of their ease of access to GCS and automatic decompression.

由于您还希望将结果作为实际的Java对象从管道中获取,因此您需要使用Direct运行器,该运行器在进程中运行,而无需与Dataflow服务进行对话或进行任何分布式处理,但是作为回报它提供了将 PCollection 的内容提取到Java对象中的功能:

Since you also want to get the result out of your pipeline as an actual Java object, you need to use the Direct runner, which runs in-process, without talking to the Dataflow service or doing any distributed processing, however in return it provides the ability to extract PCollection's into Java objects:

类似这样的东西:

PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
    p.apply(TextIO.Read.from("gs://..."))
     .apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);

这篇关于如何使用Google Dataflow计算文件中的总行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆