相对于使用Google数据流保存在Google存储中的普通文本文件读取压缩文件时,性能相对较差 [英] Relatively poor performance when reading compressed files vis a vis normal text files kept in google storage using google dataflow

查看:91
本文介绍了相对于使用Google数据流保存在Google存储中的普通文本文件读取压缩文件时,性能相对较差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Google数据流从云存储中读取了11.57GB的文件,并将其写入了Google BigQuery. 30名工人花了大约12分钟的时间.

I used google dataflow to read an 11.57GB file from cloud storage and wrote them to google BigQuery. It took around 12 mins with 30 workers.

然后我压缩了相同的文件(大小现在变为1.06GB),然后再次使用google dataflow从google存储中读取它们并将其写入BigQuery.现在,与30名工人一起花了大约31分钟.

I then compressed the same file(size now became 1.06GB) and then again read them from google storage using google dataflow and wrote them to BigQuery. It now took around 31 mins with same 30 workers.

这两个数据流作业都具有相同的管道选项,除了第一个数据流作业中的输入文件未压缩,而第二个数据流作业中的输入文件已压缩.

Both the dataflow jobs had same pipeline options except the input file in first dataflow job was uncompressed but the input file was compressed in the second datatflow job.

当Google数据流读取压缩文件时,性能似乎会出现大幅下降.

It seems there is a huge drop in performance when google dataflow reads compressed files.

读取压缩文件时,ParDo转换和BigQueryIO转换的速度降低了50%以上.

The speed of ParDo transform and BigQueryIO transform drops by more 50% when reading compressed files.

即使我将工人的数量增加到200,也似乎并没有改善,因为读取相同的压缩文件并写入bigquery仍然需要28分钟.

It does not seem to improve even when I increase the number of workers to 200 as it still took 28mins to read the same compressed file and write to bigquery

读取压缩文件时是否可以加快整个过程?

Is there a way to speed the entire process when reading compressed files?

推荐答案

从压缩数据读取时,每个文件只能由一个工作人员处理;从未压缩的数据读取时,可以更好地并行化工作.由于只有一个文件,因此可以解释您所看到的性能差异.

When reading from compressed data, each file can only be processed by one worker; when reading from uncompressed data the work can be parallelized much better. Since you have only one file, that explains the performance difference you are seeing.

加快速度的最佳选择是使用未压缩的输入,或使用多个较小的文件.另外,为降低成本,在读取一个压缩文件时,您可以减少工作人员的工作量.

The best options for speeding this up are to use uncompressed input, or using multiple smaller files. Alternatively, to reduce costs, you could run on fewer workers when reading the one compressed file.

这篇关于相对于使用Google数据流保存在Google存储中的普通文本文件读取压缩文件时,性能相对较差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆