Kafka 到 Google Cloud Platform 数据流摄取 [英] Kafka to Google Cloud Platform Dataflow ingestion

查看:34
本文介绍了Kafka 到 Google Cloud Platform 数据流摄取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

主题中的 Kafka 数据可以流式传输、使用和摄取到 BigQuery/云存储中的可能选项有哪些.

What are the possible options that the Kafka data from the topics can be streamed, consumed and ingested into the BigQuery/Cloud storage.

根据,是否可以将 Kafka 与 Google Cloud Dataflow 一起使用

GCP 附带 Dataflow,它构建在 Apache Beam 编程模型之上.KafkaIO 与 Beam Pipeline 一起使用是否是对传入数据执行实时转换的推荐方式?

GCP comes with Dataflow which is built on top of Apache Beam programming model. Is KafkaIO use with Beam Pipeline the recommended way to perform for real-time transformations on the incoming data?

https://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/kafka/KafkaIO.html

Kafka 数据可以推送到云发布订阅,然后推送到 BigQuery 表.也可以使用 GCP 之外的 Kafka 流/Spark 作业.

Kafka data can be pushed to cloud pub-sub and then onto BigQuery table. Kafka streams/Spark job that would sit out of GCP can also be used.

鉴于数据完全托管在 Google Cloud Platform (GCP) 上,在设计决策过程中需要考虑哪些因素?

What are the factors to consider during the design decision given the Data is hosted entirely on Google Cloud Platform (GCP)?

推荐答案

Kafka 支持已于 2016 年添加到 Apache Beam,KafkaIO 一组转换.这意味着 Dataflow 也支持它.

Kafka support was added to Apache Beam in 2016, with the KafkaIO set of transformations. This means that Dataflow supports it as well.

将数据加载到 BigQuery 的最简单方法是使用在 Dataflow 上运行的 Apache Beam 管道.您的管道看起来像这样:

The easiest thing for you to load data into BigQuery would be with an Apache Beam pipeline running on Dataflow. Your pipeline would look something like so:

Pipeline p = Pipeline.create();

p.apply("ReadFromKafka", KafkaIO.read()
                                .withTopic(myTopic)...)
 .apply("TransformData", ParDo.of(new FormatKafkaDataToBigQueryTableRow(mySchema))
 .apply(BigQueryIO.writeTableRows()
                  .to(myTableName)
                  .withSchema(mySchema));

p.run().waitUntilFinish();

在 Dataflow 上使用 Beam 管道的优点是您不必管理数据读取的偏移量、状态和一致性(相对于从 Kafka->BQ 读取的自定义写入过程);也不是集群(相对于 Spark 作业).

The advantages of using a Beam pipeline on Dataflow are that you would not have to manage offsets, state, and consistency of data reads (vs. a custom-written process that reads from Kafka->BQ); nor a cluster (vs. a Spark job).

最后,这是一个 使用 KafkaIO 的管道示例.

这篇关于Kafka 到 Google Cloud Platform 数据流摄取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆