如何使用Google Dataflow计算文件中的总行数 [英] How to count total number of rows in a file using google dataflow
问题描述
我想知道是否有一种方法可以使用google dataflow找出文件中的总行数.任何代码示例和指针都会有很大的帮助.基本上,我有一种方法
I would like to know if there is a way to find out total no rows in a file using google dataflow. Any code sample and pointer will be great help. Basically, I have a method as
int getCount(String fileName) {}
因此,上述方法将返回总行数,其实现将是数据流代码.
So, above method will return total count of rows and its implementation will be dataflow code.
谢谢
推荐答案
似乎您的用例不需要进行分布式处理,因为该文件已压缩,因此无法并行读取.但是,由于使用Dataflow API易于访问GCS和自动解压缩,您可能仍会发现它有用.
Seems like your use case is one that doesn't require distributed processing, because the file is compressed and hence can not be read in parallel. However, you may still find it useful to use Dataflow APIs for the sake of their ease of access to GCS and automatic decompression.
由于您还希望将结果作为实际的Java对象从管道中获取,因此您需要使用Direct运行器,该运行器在进程中运行,而无需与Dataflow服务进行对话或进行任何分布式处理,但是作为回报它提供了将 PCollection
的内容提取到Java对象中的功能:
Since you also want to get the result out of your pipeline as an actual Java object, you need to use the Direct runner, which runs in-process, without talking to the Dataflow service or doing any distributed processing, however in return it provides the ability to extract PCollection
's into Java objects:
类似这样的东西:
PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
p.apply(TextIO.Read.from("gs://..."))
.apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);
这篇关于如何使用Google Dataflow计算文件中的总行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!