Cloud Dataflow:读取整个文本文件，而不是逐行读取 [英] Cloud Dataflow: reading entire text files rather than lines by line

查看：47 发布时间：2020/11/18 1:32:48 java google-cloud-dataflow

本文介绍了Cloud Dataflow:读取整个文本文件，而不是逐行读取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在寻找一种读取整个文件的方法，以便将每个文件完全读取为单个字符串. 我想在gs://my_bucket/*/*.json上传递JSON文本文件模式，使用ParDo，然后完全处理每个文件.

I'm looking for a way to read ENTIRE files so that every file will be read entirely to a single String. I want to pass a pattern of JSON text files on gs://my_bucket/*/*.json, have a ParDo then process each and every file entirely.

什么是最好的方法?

推荐答案

即使在特殊情况下[1]，您可能会做一些不同的事情，我也将给出最普遍有用的答案.

I am going to give the most generally useful answer, even though there are special cases [1] where you might do something different.

我认为您想要做的是定义

I think what you want to do is to define a new subclass of FileBasedSource and use Read.from(<source>). Your source will also include a subclass of FileBasedReader; the source contains the configuration data and the reader actually does the reading.

我认为最好对API进行完整的描述，但我将重点介绍关键的替代点以及它们与您的需求的关系:

I think a full description of the API is best left to the Javadoc, but I will highlight the key override points and how they relate to your needs:

FileBasedSource#isSplittable()，您将要覆盖并返回false.这将表明没有文件内分割.
FileBasedSource#createForSubrangeOfFile(String, long, long)，您将覆盖它以仅返回指定文件的子源.
FileBasedSource#createSingleFileReader()，您将覆盖它以为当前文件生成一个FileBasedReader(该方法应假定已将其拆分为单个文件的级别).

FileBasedSource#isSplittable() you will want to override and return false. This will indicate that there is no intra-file splitting.
FileBasedSource#createForSubrangeOfFile(String, long, long) you will override to return a sub-source for just the file specified.
FileBasedSource#createSingleFileReader() you will override to produce a FileBasedReader for the current file (the method should assume it is already split to the level of a single file).

要实现阅读器:

FileBasedReader#startReading(...)，您将不执行任何操作；框架已经为您打开了文件，然后将其关闭.
FileBasedReader#readNextRecord()，您将重写以将整个文件作为单个元素读取.

FileBasedReader#startReading(...) you will override to do nothing; the framework will already have opened the file for you, and it will close it.
FileBasedReader#readNextRecord() you will override to read the entire file as a single element.

[1]一个简单的特殊情况示例是，当您实际上只有少量文件时，可以在提交作业之前将其扩展，并且它们都需要花费相同的时间来处理.然后，您可以先使用Create.of(expand(<glob>))，再使用ParDo(<read a file>).

[1] One example easy special case is when you actually have a small number of files, you can expand them prior to job submission, and they all take the same amount of time to process. Then you can just use Create.of(expand(<glob>)) followed by ParDo(<read a file>).

这篇关于Cloud Dataflow:读取整个文本文件，而不是逐行读取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Cloud Dataflow:读取整个文本文件，而不是逐行读取 [英] Cloud Dataflow: reading entire text files rather than lines by line

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Cloud Dataflow:读取整个文本文件，而不是逐行读取 [英] Cloud Dataflow: reading entire text files rather than lines by line

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭