为什么在运行管道时将零字节文件写入GCS? [英] Why are zero byte files written to GCS when running a pipeline?

查看:64
本文介绍了为什么在运行管道时将零字节文件写入GCS?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们的工作/管道正在将ParDo转换的结果写回GCS,即使用TextIO.Write.to("gs://...")

Our job/pipeline is writing the results of a ParDo transformation back out to GCS i.e. using TextIO.Write.to("gs://...")

我们注意到,当作业/管道完成时,它将在输出存储桶中留下大量0字节文件.

We've noticed that when the job/pipeline completes, it leaves numerous 0 byte files in the output bucket.

管道的输入来自GCS的多个文件,所以我假设结果是分片的,这很好.

The input to the pipeline is from multiple files from GCS, so I'm assuming the results are sharded, which is fine.

但是为什么我们会得到空文件?

But why do we get empty files?

推荐答案

这些空分片很可能是中间管道步骤的结果,该步骤实际上是稀疏的,并且一些预先分区的分片中没有记录.

It is likely that these empty shards are the results of an intermediate pipeline step which turned out to be somewhat sparse and some pre-partitioned shards had no records in them.

例如如果在TextIO.Write之前有一个GroupByKey,并且说密钥空间被分片为[00,01),[01,02),...,[fe,ff)范围(总共255个分片),但是全部从此GroupByKey的输入发出的实际键值在[34,81)和[a3,b5)范围内,则将生成255个输出文件,但大多数输出​​文件为空. (这是一个假设的分区方案,仅供您参考)

E.g. if there was a GroupByKey right before the TextIO.Write and, say, the keyspace was sharded into ranges [00, 01), [01, 02), ..., [fe, ff) (255 shards total), but all actual keys emitted from the input of this GroupByKey were in the range [34, 81) and [a3, b5), then 255 output files will be produced, but most of them will turn out empty. (this is a hypothetical partitioning scheme, just to give you the idea)

我的其余答案将以问答形式.

The rest of my answer will be in the form of Q&A.

为什么要完全生成空文件?如果没有任何输出,请不要创建文件! 的确,从技术上讲避免生产它们是可能的,例如通过在写入第一个元素时在写入输出时延迟打开它们. AFAIK我们通常不这样做,因为空的输出文件通常不是问题,并且比没有文件更容易理解空文件:例如,如果仅发现50个分片中的第一个,那将非常令人困惑非空,您将只有一个名为00001-of-000050的输出文件:您想知道其他49个文件发生了什么.

Why produce empty files at all? If there's nothing to output, don't create the file! It's true that it would be technically possible to avoid producing them, e.g. by opening them lazily when writing output when the first element is written. AFAIK we normally don't do this because empty output files are usually not an issue, and it is easier to understand an empty file than absence of a file: it would be pretty confusing if, say, only the first of 50 shards turned out non-empty and you would only have a single output file named 00001-of-000050: you'd wonder what happened to the 49 other ones.

但是为什么不添加删除空文件的后处理步骤呢?原则上,我们可以添加删除空输出并重命名其余输出的后处理步骤(与xxxxx-of-yyyyy文件模式),如果空输出成为大问题.

But why not add a post-processing step to delete the empty files? In principle we could add a post-processing step of deleting the empty outputs and renaming the rest (to be consistent with the xxxxx-of-yyyyy filepattern) if empty outputs became a big issue.

空碎片的存在是否预示着我的管道有问题? 大量的空碎片可能意味着系统选择的碎片不是次优/不均匀,我们应该将计算拆分为更少,更统一的碎片.如果这对您来说是个问题,是否可以提供有关管道输出的更多详细信息,例如:屏幕截图显示非空输出也非常小:它们仅包含少量记录吗? (如果是这样,可能难以在不事先知道数据的情况下实现统一分片)

Does existence of empty shards signal a problem in my pipeline? A lot of empty shards might mean that the system-chosen sharding was suboptimal/uneven and we should have split the computation into fewer, more uniform shards. If this is a problem for you, could you give more details about your pipeline's output, e.g.: your screenshot shows that the non-empty outputs are also pretty small: do they contain just a handful of records? (if so, it may be difficult to achieve uniform sharding without knowing the data in advance)

但是我原始输入的分片不是空的,输入的输出分片不是分片的吗?如果您的管道具有GroupByKey(或派生的)操作,那么将存在中间步骤,其中输入和输出中的分片数量不同:例如一个操作可能会消耗30个输入碎片,但会产生50个输出碎片,反之亦然.在其他一些不涉及GroupByKey的情况下,输入和输出中的分片数量也可能是不同的.

But the shards of my original input are not empty, doesn't sharding of output mirror sharding of input? If your pipeline has GroupByKey (or derived) operations, there will be intermediate steps where the number of shards in input and output are different: e.g. an operation may consume 30 shards of input but produce 50 shards of output, or vice versa. Different number of shards in input and output is also possible in some other cases not involving GroupByKey.

TL; DR如果您的总体输出正确,那不是错误,但请告诉我们是否对您有问题:)

TL;DR If your overall output is correct, it's not a bug, but tell us if it is a problem for you :)

这篇关于为什么在运行管道时将零字节文件写入GCS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆