Cloud Dataflow:读取整个文本文件,而不是逐行读取 [英] Cloud Dataflow: reading entire text files rather than lines by line
问题描述
我正在寻找一种读取整个文件的方法,以便将每个文件完全读取为单个字符串. 我想在gs://my_bucket/*/*.json上传递JSON文本文件模式,使用ParDo,然后完全处理每个文件.
I'm looking for a way to read ENTIRE files so that every file will be read entirely to a single String. I want to pass a pattern of JSON text files on gs://my_bucket/*/*.json, have a ParDo then process each and every file entirely.
什么是最好的方法?
推荐答案
即使在特殊情况下[1],您可能会做一些不同的事情,我也将给出最普遍有用的答案.
I am going to give the most generally useful answer, even though there are special cases [1] where you might do something different.
I think what you want to do is to define a new subclass of FileBasedSource
and use Read.from(<source>)
. Your source will also include a subclass of FileBasedReader
; the source contains the configuration data and the reader actually does the reading.
我认为最好对API进行完整的描述,但我将重点介绍关键的替代点以及它们与您的需求的关系:
I think a full description of the API is best left to the Javadoc, but I will highlight the key override points and how they relate to your needs:
-
FileBasedSource#isSplittable()
,您将要覆盖并返回false
.这将表明没有文件内分割. -
FileBasedSource#createForSubrangeOfFile(String, long, long)
,您将覆盖它以仅返回指定文件的子源. -
FileBasedSource#createSingleFileReader()
,您将覆盖它以为当前文件生成一个FileBasedReader
(该方法应假定已将其拆分为单个文件的级别).
FileBasedSource#isSplittable()
you will want to override and returnfalse
. This will indicate that there is no intra-file splitting.FileBasedSource#createForSubrangeOfFile(String, long, long)
you will override to return a sub-source for just the file specified.FileBasedSource#createSingleFileReader()
you will override to produce aFileBasedReader
for the current file (the method should assume it is already split to the level of a single file).
要实现阅读器:
-
FileBasedReader#startReading(...)
,您将不执行任何操作;框架已经为您打开了文件,然后将其关闭. -
FileBasedReader#readNextRecord()
,您将重写以将整个文件作为单个元素读取.
FileBasedReader#startReading(...)
you will override to do nothing; the framework will already have opened the file for you, and it will close it.FileBasedReader#readNextRecord()
you will override to read the entire file as a single element.
[1]一个简单的特殊情况示例是,当您实际上只有少量文件时,可以在提交作业之前将其扩展,并且它们都需要花费相同的时间来处理.然后,您可以先使用Create.of(expand(<glob>))
,再使用ParDo(<read a file>)
.
[1] One example easy special case is when you actually have a small number of files, you can expand them prior to job submission, and they all take the same amount of time to process. Then you can just use Create.of(expand(<glob>))
followed by ParDo(<read a file>)
.
这篇关于Cloud Dataflow:读取整个文本文件,而不是逐行读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!