分析云数据流BigQuery吞吐量/流水线 [英] Analyze Cloud Data Flow BigQuery Throughput / Pipelining

查看:128
本文介绍了分析云数据流BigQuery吞吐量/流水线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图弄清楚DataFlow如何扩展某些操作,以及如何让表现最佳。首先,我创建了一个简单的流程,用于从BigQuery中读取数据(大约25M行,总共30GB),JSON抽取,一个按键的简单组,然后一个聚集(每个〜100个元素)对每个键进行转换并将其放回到一个新表中(〜500k行,总共25gb)。

总的管道执行时间为10-18分钟我分配了多少工人,或者如果我使用多核机器等等,我无法在此之下加快速度。我还没有确定确切的阈值,但基本上20个单核或10个四核的差别不再可靠了。所以我的问题是如何研究这个问题进一步确定哪一步需要花费大部分时间,以及如何改进它。我假设DataFlow本身负责缩放各个步骤并在它们之间进行平衡。但对我而言,现在看到收到的第一条消息和最后一条消息发送,以及可能每步一次的吞吐量都会很有趣。这是在某个地方可用的东西,还是我必须开始仪表并自己记录这些?有了这些信息,我就可以开始基于此优化单个步骤,也许可以重写DataFlows缩放。



这是正确的方法还是有更好的方法? p>

**目标时间为2分钟。

解决方案

数据流首先执行一个BigQuery导出作业,将数据复制到GCS并将其读入并进行处理。同样,Dataflow将数据写入GCS,然后执行BigQuery导入作业来加载表。



如果您查看作业消息和/或日志,您应该能够看到导出作业何时完成(并且您的代码开始从导出的文件中读取)以及导入作业何时开始(并且您的代码已完成写入所有输出数据)。如果10-18分钟的大部分时间都用在了BigQuery导入/导出作业中,那么调整管道将不会对性能产生重大影响。



其他常见问题有一个或两个特别热的键(例如,包含大部分数据)。从您的描述中可以看出,情况并非如此(您提到每个键都有大约100个元素)。


I'm trying to figure out how DataFlow scales certain operations and how to make then performing best. To start with I just created a simple flow that reads data from BigQuery (~25M rows, 30GB total), does a JSON extract, a simple group by key and then one aggregation the group by (~100 elements each), and does another transform on each key and puts it back into a new table (~500k rows, 25gb total).

The total pipeline execution time for this is between 10-18 minutes depending on how many workers I assign or if I use multi core machines, etc. I can't speed it up below this. I have not determined the exact threshold, but basically the difference between 20 single cores, or 10 quad cores is not reliably measurable anymore.

So my question is how to investigate this further and figure out which step takes most time and how I can improve it. I assume DataFlow itself takes care of scaling individual steps and balancing among them. But for me it would for example be interesting now to see first message received and last message send, and maybe the throughput over time per step. Is this something that is available somewhere, or do I have to start instrumenting and logging this myself? With this information I would then start to optimize individual steps based on this and maybe override DataFlows scaling.

Is this the right approach, or are there better approaches available?

** The target time for me is to get this down to 2 minutes.

解决方案

Dataflow first executes a BigQuery export job to copy your data to GCS before it reads it in and processes it. Similarly, Dataflow writes the data to GCS and then does a BigQuery import job to load the table.

If you look at the Job Messages and/or the Logs you should be able to see when the export job finishes (and your code begins to read from the exported files) and when the import job starts (and your code has finished writing all output data). If the majority of the 10-18 minutes was spent in BigQuery import/export jobs, then tuning the pipeline won't have a significant impact on performance.

Other common issues are having one or two keys which are particularly hot (eg., containing most of the data). From your description it sounds like this isn't the case (you mention that each key has about 100 elements).

这篇关于分析云数据流BigQuery吞吐量/流水线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆