Cloud Dataflow:读取整个文本文件,而不是逐行读取 [英] Cloud Dataflow: reading entire text files rather than lines by line

查看:47
本文介绍了Cloud Dataflow:读取整个文本文件,而不是逐行读取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种读取整个文件的方法,以便将每个文件完全读取为单个字符串. 我想在gs://my_bucket/*/*.json上传递JSON文本文件模式,使用ParDo,然后完全处理每个文件.

I'm looking for a way to read ENTIRE files so that every file will be read entirely to a single String. I want to pass a pattern of JSON text files on gs://my_bucket/*/*.json, have a ParDo then process each and every file entirely.

什么是最好的方法?

推荐答案

即使在特殊情况下[1],您可能会做一些不同的事情,我也将给出最普遍有用的答案.

I am going to give the most generally useful answer, even though there are special cases [1] where you might do something different.

我认为您想要做的是定义

I think what you want to do is to define a new subclass of FileBasedSource and use Read.from(<source>). Your source will also include a subclass of FileBasedReader; the source contains the configuration data and the reader actually does the reading.

我认为最好对API进行完整的描述,但我将重点介绍关键的替代点以及它们与您的需求的关系:

I think a full description of the API is best left to the Javadoc, but I will highlight the key override points and how they relate to your needs:

  • FileBasedSource#isSplittable(),您将要覆盖并返回false.这将表明没有文件内分割.
  • FileBasedSource#createForSubrangeOfFile(String, long, long),您将覆盖它以仅返回指定文件的子源.
  • FileBasedSource#createSingleFileReader(),您将覆盖它以为当前文件生成一个FileBasedReader(该方法应假定已将其拆分为单个文件的级别).
  • FileBasedSource#isSplittable() you will want to override and return false. This will indicate that there is no intra-file splitting.
  • FileBasedSource#createForSubrangeOfFile(String, long, long) you will override to return a sub-source for just the file specified.
  • FileBasedSource#createSingleFileReader() you will override to produce a FileBasedReader for the current file (the method should assume it is already split to the level of a single file).

要实现阅读器:

  • FileBasedReader#startReading(...),您将不执行任何操作;框架已经为您打开了文件,然后将其关闭.
  • FileBasedReader#readNextRecord(),您将重写以将整个文件作为单个元素读取.
  • FileBasedReader#startReading(...) you will override to do nothing; the framework will already have opened the file for you, and it will close it.
  • FileBasedReader#readNextRecord() you will override to read the entire file as a single element.

[1]一个简单的特殊情况示例是,当您实际上只有少量文件时,可以在提交作业之前将其扩展,并且它们都需要花费相同的时间来处理.然后,您可以先使用Create.of(expand(<glob>)),再使用ParDo(<read a file>).

[1] One example easy special case is when you actually have a small number of files, you can expand them prior to job submission, and they all take the same amount of time to process. Then you can just use Create.of(expand(<glob>)) followed by ParDo(<read a file>).

这篇关于Cloud Dataflow:读取整个文本文件,而不是逐行读取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆