BigQueryIO-通过流和FILE_LOADS写入性能 [英] BigQueryIO - Write performance with streaming and FILE_LOADS

查看：82 发布时间：2020/9/3 5:12:47 google-cloud-platform google-bigquery google-cloud-dataflow apache-beam

本文介绍了BigQueryIO-通过流和FILE_LOADS写入性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的管道:Kafka->数据流流式传输(Beam v2.3)-> BigQuery

My pipeline : Kafka -> Dataflow streaming (Beam v2.3) -> BigQuery

鉴于低延迟对我而言并不重要，因此我使用

Given that low-latency isn't important in my case, I use FILE_LOADS to reduce the costs, like this :

BigQueryIO.writeTableRows()
  .withJsonSchema(schema)
  .withWriteDisposition(WriteDisposition.WRITE_APPEND)
  .withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
  .withMethod(Method.FILE_LOADS)
  .withTriggeringFrequency(triggeringFrequency)      
  .withCustomGcsTempLocation(gcsTempLocation)
  .withNumFileShards(numFileShards) 
  .withoutValidation()
  .to(new SerializableFunction[ValueInSingleWindow[TableRow], TableDestination]() {
    def apply(element: ValueInSingleWindow[TableRow]): TableDestination = {
      ...
    }
  }

此Dataflow步骤在管道中引入了始终更大的延迟，因此，即使有40位n1-standard-s4工人，它也无法跟上Kafka的吞吐量(少于5万个事件/秒).如下面的屏幕快照所示，此步骤的系统延迟非常大(接近管道正常运行时间)，而Kafka系统的延迟仅为几秒钟.

This Dataflow step is introducing an always bigger delay in the pipeline, so that it can't keep up with Kafka throughput (less than 50k events/s), even with 40 n1-standard-s4 workers. As shown on the screenshot below, the system lag is very big (close to pipeline up-time) for this step, whereas Kafka system lag is only a few seconds.

如果我理解正确，那么Dataflow会将元素写入 gcsTempLocation 中的 numFileShards 中，并且每个 triggeringFrequency 都会开始执行加载作业，以将其插入到BigQuery中.例如，如果我选择 triggeringFrequency 为5分钟，则可以看到(使用bq ls -a -j)所有加载作业都需要不到1分钟的时间才能完成.但是，这一步仍然引入了越来越多的延迟，导致Kafka消耗的元素越来越少(由于bcackpressure). numFileShards 和 triggeringFrequency 的增加/减少不能解决问题.

If I understand correctly, Dataflow writes the elements into numFileShards in gcsTempLocation and every triggeringFrequency a load job is started to insert them into BigQuery. For instance if I choose a triggeringFrequency of 5 minutes, I can see (with bq ls -a -j) that all the load jobs need less than 1 minute to be completed. But still the step is introducing more and more delay, resulting in Kafka consuming less and less elements (thanks to bcackpressure). Increasing/decreasing numFileShards and triggeringFrequency doesn't correct the problem.

我没有手动指定任何窗口，我只是默认窗口.文件不在 gcsTempLocation 中累积.

I don't manually specify any window, I just the default one. Files are not accumulating in gcsTempLocation.

你知道这里出了什么问题吗?

Any idea what's going wrong here?

BigQueryIO-通过流和FILE_LOADS写入性能 [英] BigQueryIO - Write performance with streaming and FILE_LOADS

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

BigQueryIO-通过流和FILE_LOADS写入性能 [英] BigQueryIO - Write performance with streaming and FILE_LOADS

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭