在Google Cloud Dataflow中按顺序读取文件 [英] Read file in order in Google Cloud Dataflow

查看:108
本文介绍了在Google Cloud Dataflow中按顺序读取文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Spotify Scio 来读取从Stackdriver导出到Google Cloud Storage的日志.它们是JSON文件,其中每一行都是一个条目.查看工作日志,似乎文件已拆分为多个块,然后按任何顺序读取.在这种情况下,我已经将我的工作仅限于1名工人.有没有办法强制这些块按顺序读取和处理?

作为示例(textFile本质上是TextIO.Read):

val sc = ScioContext(myOptions)
sc.textFile(myFile).map(line => logger.info(line))

根据工作日志将产生类似以下的输出:

line 5
line 6
line 7
line 8
<Some other work>
line 1
line 2
line 3
line 4
<Some other work>
line 9
line 10
line 11
line 12

我想知道的是,是否有一种方法可以强制其按顺序读取第1-12行.我发现对文件进行gzip压缩并使用指定的CompressionType进行读取是一种解决方法,但我想知道是否有任何方法可以执行此操作,而无需压缩或更改原始文件.

解决方案

Google Cloud Dataflow/Apache Beam当前不支持在处理管道中对订单进行排序或保留.允许排序输出的缺点是,它会为大型数据集输出这样的结果,最终会在单台机器上出现瓶颈,而对于大型数据集则无法扩展.

I'm using Spotify Scio to read logs that are exported from Stackdriver to Google Cloud Storage. They are JSON files where every line is a single entry. Looking at the worker logs it seems like the file is split into chunks, which are then read in any order. I've already limited my job to exactly 1 worker in this case. Is there a way to force these chunks to be read and processed in order?

As an example (textFile is basically a TextIO.Read):

val sc = ScioContext(myOptions)
sc.textFile(myFile).map(line => logger.info(line))

Would produce output similar to this based on the worker logs:

line 5
line 6
line 7
line 8
<Some other work>
line 1
line 2
line 3
line 4
<Some other work>
line 9
line 10
line 11
line 12

What I want to know is if there's a way to force it to read lines 1-12 in order. I've found that gzipping the file and reading it with the CompressionType specified is a workaround but I'm wondering if there are any ways to do this that don't involve zipping or changing the original file.

解决方案

Google Cloud Dataflow / Apache Beam currently do not support sorting or preservation of order in processing pipelines. The drawback of allowing for sorted output is that it outputting such a result for large datasets eventually bottlenecks on a single machine, which is not scalable for large datasets.

这篇关于在Google Cloud Dataflow中按顺序读取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆