Google Cloud Dataflow - 从 PubSub 到 Parquet [英] Google Cloud Dataflow - From PubSub to Parquet

查看:39
本文介绍了Google Cloud Dataflow - 从 PubSub 到 Parquet的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 Google Cloud Dataflow 将 Google PubSub 消息写入 Google Cloud Storage.PubSub 消息采用 json 格式,我要执行的唯一操作是从 json 转换为 parquet 文件.

I'm trying to write Google PubSub messages to Google Cloud Storage using Google Cloud Dataflow. The PubSub messages come into json format and the only operation that I want to perform is a transformation from json to parquet file.

在官方文档中,我找到了一个由 google 提供的模板,该模板从 Pub/Sub 主题读取数据并将 Avro 文件写入指定的 Cloud Storage 存储桶 (https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#pubsub-to-cloud-storage-avro).问题是模板源代码是用Java编写的,而我更喜欢使用Python SDK.

In the official documentation I found a template provided by google that reads data from a Pub/Sub topic and writes Avro files into the specified Cloud Storage bucket (https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#pubsub-to-cloud-storage-avro). The problem is that the template source code is written in Java, while I prefer to use the Python SDK.

这些是我一般对 Dataflow 和 Beam 进行的第一次测试,网上没有太多材料可供参考.任何建议、链接、指导、代码片段将不胜感激.

These are the first tests I'm doing with Dataflow and Beam in general, and there's not a lot of material online to take a hint from. Any suggestions, links, guidance, piece of code would be greatly appreciated.

推荐答案

为了进一步为社区做出贡献,我总结了我们的讨论作为答案.

In order to further contribute to the community, I am summarising our discussing as an answer.

既然您开始使用 Dataflow,我可以指出一些有用的主题和建议:

Since you are starting with Dataflow, I can point out some useful topics and advice:

  1. PTransform WriteToParquet() Apache Beam 中的内置方法非常有用.它从 PCollection 记录写入 Parquet 文件.此外,为了使用它并写入镶木地板文件,您需要指定文档中指示的架构.此外,这个 文章 将帮助您更好地了解如何使用此方法以及如何将其写入 Google Cloud Storage (GCS) 存储桶.

  1. The PTransform WriteToParquet() builtin method in Apache Beam is very useful. It writes to a Parquet file from a PCollection of records. Also, in order to use it and write to a parquet file, you would need to specify the schema as indicated in the documentation. In addition, this article will help you understand better how to use this method and how to write it in a Google Cloud Storage(GCS) bucket.

Google 提供了此代码,解释了如何阅读消息来自 PubSub 并将它们写入 Google Cloud Storage.本 QuickStart 读取来自 PubSub 的消息并将来自每个窗口的消息写入存储桶.

Google provides this code explaining how read messages from PubSub and write them into Google Cloud Storage. This QuickStart reads the message from PubSub and write the messages from each window to a bucket.

由于您想从 PubSub 读取,将消息写入 Parquet 并将文件存储在 GCS 存储桶中,我建议您按照管道步骤执行以下过程:阅读您的消息,写入 Parquet 文件并将其存储在 GCS 中.

Since you want to read from PubSub, write the message to Parquet and store the file in a GCS bucket, I would advise you to do the following process as steps of your pipeline: Read your messages, write to a parquet file and store it in GCS.

我鼓励您阅读以上链接.然后,如果您有任何其他问题,您可以发布另一个主题以获得更具体的帮助.

I encourage you to read the above links. Then if you have any other question you can post another thread in order to get more specific help.

这篇关于Google Cloud Dataflow - 从 PubSub 到 Parquet的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆