Dataflow是否利用Google Cloud Storage的gzip转码功能? [英] Is Dataflow making use of Google Cloud Storage's gzip transcoding?

查看:89
本文介绍了Dataflow是否利用Google Cloud Storage的gzip转码功能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试处理JSON文件(10 GB未压缩/2 GB压缩),并且我想优化管道.

I am trying to process JSON files (10 GB uncompressed/2 GB compressed) and I want to optimize my pipeline.

根据官方文档,Google云存储(GCS)可以选择对gzip文件进行转码,这意味着当正确标记了gzip文件后,应用程序会将其解压缩. Google Cloud Dataflow(GCDF)在处理未压​​缩文件时具有更好的并行性,因此我想知道是否在GCS上设置 meta标签对性能有积极影响吗?

According to the official docs Google Cloud Storage (GCS) has the option to transcode gzip files, which means the application gets them uncompressed, when they are tagged correctly. Google Cloud Dataflow (GCDF) has better parallelism when dealing with uncompressed files, so I was wondering if setting the meta tag on GCS has a positive effect on performance?

由于我的输入文件相对较大,因此解压缩它们是否有意义,以便Dataflow将它们分成较小的块?

Since my input files are relatively large, does it make sense to unzip them so that Dataflow splits them in smaller chunks?

推荐答案

您不应使用此元标记.这很危险,因为GCS会错误地报告文件大小(例如,报告压缩后的大小,但是数据流/光束会读取未压缩的数据).

You should not use this meta tag. It's dangerous, as GCS would report the size of your file incorrectly (e.g. report the compressed size, but dataflow/beam would read the uncompressed data).

在任何情况下,未压缩文件的分割都依赖于从文件的不同部分并行读取,如果文件最初是压缩的,则不可能做到这一点.

In any case, the splitting of uncompressed files relies on reading in parallel from different segments of a file, and this is not possible if the file is originally compressed.

这篇关于Dataflow是否利用Google Cloud Storage的gzip转码功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆