谷歌数据流写入bigquery表性能 [英] google dataflow write to bigquery table performance

查看:119
本文介绍了谷歌数据流写入bigquery表性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我比较了处理数据和输出到Bigquery表和文件的性能,差异非常明显:

I compared performance of processing data and output to Bigquery tables and files, difference is significant:

输入:约600个文件中的150万条记录 转换:在每个记录中构造/转换几个字段,构造一个键并发出键,值对;最终,每个键的记录都转到一个目标,一个文件或一个表;

input: 1.5M records from about 600 files transform: construct/convert a few fields in each records, construct a key and emit key,value pairs; eventually records per each key go to one target, a file or a table;

花费7分钟写入13个文件,超过60分钟写入13个bigquery表;

it took 7 mins to write to 13 files, and over 60 mins write to 13 bigquery tables;

试着了解这是预期的结果,还是我做错了?写入bigquery表时应考虑哪些因素?

Try to understand is this expected outcome or I didn't do it right? what's the factors should be considered when write to bigquery table?

请帮助,这可能是我要尝试做的事情.

Please help, this could be show stopper for what I'm trying to do.

推荐答案

对于批处理作业,Dataflow会将数据写入BigQuery,方法是将其写入GCS,然后运行BigQuery作业将数据导入BigQuery.如果您想知道BigQuery作业要花多长时间,我想可以看看您项目中运行的BigQuery作业.

For batch jobs, Dataflow imports data into BigQuery by writing it to GCS and then running BigQuery jobs to import that data into BigQuery. If you want to know how long the BigQuery jobs are taking I think can look at the BigQuery jobs run in your project.

您可以尝试使用以下命令来获取有关BigQuery导入作业的信息.

You can try the following commands to get information about your BigQuery import jobs.

  bq ls -j <PROJECT ID>:

上面的命令应该向您显示作业列表以及诸如持续时间之类的东西. (请注意,我认为必须在项目ID的末尾加冒号.)

The above command should show you a list of jobs and things like the duration. (Note the colon at the end of project ID I think the colon is required).

然后您可以尝试

bq show -j <JOB ID>

以获取有关作业的其他信息.

To get additional information about the job.

请注意,您必须是项目的所有者,才能查看其他用户运行的作业.这适用于由Dataflow运行的BigQuery作业,因为Dataflow使用服务帐户.

Note you must be an owner of the project in order to be able to see jobs run by other users. This applies to BigQuery jobs run by Dataflow because Dataflow uses service account.

这篇关于谷歌数据流写入bigquery表性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆