BigQueryIO-通过流和FILE_LOADS写入性能 [英] BigQueryIO - Write performance with streaming and FILE_LOADS

查看:82
本文介绍了BigQueryIO-通过流和FILE_LOADS写入性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的管道:Kafka->数据流流式传输(Beam v2.3)-> BigQuery

My pipeline : Kafka -> Dataflow streaming (Beam v2.3) -> BigQuery

鉴于低延迟对我而言并不重要,因此我使用

Given that low-latency isn't important in my case, I use FILE_LOADS to reduce the costs, like this :

BigQueryIO.writeTableRows()
  .withJsonSchema(schema)
  .withWriteDisposition(WriteDisposition.WRITE_APPEND)
  .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
  .withMethod(Method.FILE_LOADS)
  .withTriggeringFrequency(triggeringFrequency)      
  .withCustomGcsTempLocation(gcsTempLocation)
  .withNumFileShards(numFileShards) 
  .withoutValidation()
  .to(new SerializableFunction[ValueInSingleWindow[TableRow], TableDestination]() {
    def apply(element: ValueInSingleWindow[TableRow]): TableDestination = {
      ...
    }
  }

此Dataflow步骤在管道中引入了始终更大的延迟,因此,即使有40位n1-standard-s4工人,它也无法跟上Kafka的吞吐量(少于5万个事件/秒).如下面的屏幕快照所示,此步骤的系统延迟非常大(接近管道正常运行时间),而Kafka系统的延迟仅为几秒钟.

This Dataflow step is introducing an always bigger delay in the pipeline, so that it can't keep up with Kafka throughput (less than 50k events/s), even with 40 n1-standard-s4 workers. As shown on the screenshot below, the system lag is very big (close to pipeline up-time) for this step, whereas Kafka system lag is only a few seconds.

如果我理解正确,那么Dataflow会将元素写入 gcsTempLocation 中的 numFileShards 中,并且每个 triggeringFrequency 都会开始执行加载作业,以将其插入到BigQuery中.例如,如果我选择 triggeringFrequency 为5分钟,则可以看到(使用bq ls -a -j)所有加载作业都需要不到1分钟的时间才能完成.但是,这一步仍然引入了越来越多的延迟,导致Kafka消耗的元素越来越少(由于bcackpressure). numFileShards triggeringFrequency 的增加/减少不能解决问题.

If I understand correctly, Dataflow writes the elements into numFileShards in gcsTempLocation and every triggeringFrequency a load job is started to insert them into BigQuery. For instance if I choose a triggeringFrequency of 5 minutes, I can see (with bq ls -a -j) that all the load jobs need less than 1 minute to be completed. But still the step is introducing more and more delay, resulting in Kafka consuming less and less elements (thanks to bcackpressure). Increasing/decreasing numFileShards and triggeringFrequency doesn't correct the problem.

我没有手动指定任何窗口,我只是默认窗口.文件不在 gcsTempLocation 中累积.

I don't manually specify any window, I just the default one. Files are not accumulating in gcsTempLocation.

你知道这里出了什么问题吗?

Any idea what's going wrong here?

推荐答案

您提到您没有明确指定Window,这意味着默认情况下Dataflow将使用全局窗口". 窗口文档包含以下警告:

You mention that you don't explicitly specify a Window, which means that by default Dataflow will use the "Global window". The windowing documentation contains this warning:

警告:数据流的默认窗口行为是分配所有 PCollection的元素到单个全局窗口,即使对于 无限制的PCollections.在使用分组转换之前,例如 在无限制的PCollection上使用GroupByKey时,必须设置一个非全局的 窗口功能.请参阅设置PCollection的窗口功能.

Caution: Dataflow's default windowing behavior is to assign all elements of a PCollection to a single, global window, even for unbounded PCollections. Before you use a grouping transform such as GroupByKey on an unbounded PCollection, you must set a non-global windowing function. See Setting Your PCollection's Windowing Function.

如果您没有为无边界设置非全局窗口功能 PCollection,然后使用分组转换,例如 GroupByKey或Combine,您的管道将在生成错误 施工,您的数据流作业将失败.

If you don't set a non-global windowing function for your unbounded PCollection and subsequently use a grouping transform such as GroupByKey or Combine, your pipeline will generate an error upon construction and your Dataflow job will fail.

您也可以将PCollection的非默认触发器设置为 允许全局窗口在其他条件下发出早期"结果 条件.

You can alternatively set a non-default Trigger for a PCollection to allow the global window to emit "early" results under some other conditions.

您的管道似乎没有进行任何显式分组,但是我想知道通过BigQuery写入进行的内部分组是否会引起问题.

It appears that your pipeline doesn't do any explicit grouping, but I wonder if internal grouping via BigQuery write is causing issues.

您可以在UI中看到您的下游DropInputs是否收到了任何元素?如果不是,则表明上游BigQuery步骤中阻止了数据.

Can you see in the UI if your downstream DropInputs has received any elements? If not, this is an indication that data is getting held up in the upstream BigQuery step.

这篇关于BigQueryIO-通过流和FILE_LOADS写入性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆