Kafka到Google Cloud Platform数据流的提取 [英] Kafka to Google Cloud Platform Dataflow ingestion

查看:117
本文介绍了Kafka到Google Cloud Platform数据流的提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以将主题中的Kafka数据流式传输,使用和摄取到BigQuery/Cloud存储中的可能选项是什么.

What are the possible options that the Kafka data from the topics can be streamed, consumed and ingested into the BigQuery/Cloud storage.

按照 是否可以将Kafka与Google Cloud Dataflow一起使用

GCP附带了基于Apache Beam编程模型构建的数据流. KafkaIO是否与Beam Pipeline一起使用是对传入数据进行实时转换的推荐方法?

GCP comes with Dataflow which is built on top of Apache Beam programming model. Is KafkaIO use with Beam Pipeline the recommended way to perform for real-time transformations on the incoming data?

https ://beam.apache.org/releases/javadoc/2.5.0/org/apache/beam/sdk/io/kafka/KafkaIO.html

Kafka数据可以推送到云pub-sub,然后再推送到BigQuery表.也可以使用GCP之外的Kafka流/火花作业.

Kafka data can be pushed to cloud pub-sub and then onto BigQuery table. Kafka streams/Spark job that would sit out of GCP can also be used.

鉴于数据完全托管在Google Cloud Platform(GCP)上,因此在设计决策过程中要考虑哪些因素?

What are the factors to consider during the design decision given the Data is hosted entirely on Google Cloud Platform (GCP)?

推荐答案

Kafka支持于2016年添加到Apache Beam,使用

Kafka support was added to Apache Beam in 2016, with the KafkaIO set of transformations. This means that Dataflow supports it as well.

最简单的将数据加载到BigQuery的方法是在Dataflow上运行Apache Beam管道.您的管道如下所示:

The easiest thing for you to load data into BigQuery would be with an Apache Beam pipeline running on Dataflow. Your pipeline would look something like so:

Pipeline p = Pipeline.create();

p.apply("ReadFromKafka", KafkaIO.read()
                                .withTopic(myTopic)...)
 .apply("TransformData", ParDo.of(new FormatKafkaDataToBigQueryTableRow(mySchema))
 .apply(BigQueryIO.writeTableRows()
                  .to(myTableName)
                  .withSchema(mySchema));

p.run().waitUntilFinish();

在Dataflow上使用Beam管道的优点是您不必管理数据读取的偏移量,状态和一致性(与从Kafka-> BQ读取的自定义编写的过程相比);也没有集群(相对于Spark作业).

The advantages of using a Beam pipeline on Dataflow are that you would not have to manage offsets, state, and consistency of data reads (vs. a custom-written process that reads from Kafka->BQ); nor a cluster (vs. a Spark job).

最后,这是一个 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆