ETL方法在Cloud SQL中批量加载数据 [英] ETL approaches to bulk load data in Cloud SQL

查看:76
本文介绍了ETL方法在Cloud SQL中批量加载数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要ETL数据到我的Cloud SQL实例中.此数据来自API调用.当前,我正在使用Cronjobs在Kubernetes中运行自定义Java ETL代码,该代码发出请求以收集此数据并将其加载到Cloud SQL上.问题在于管理ETL代码和监视ETL作业.当合并更多的ETL流程时,当前的解决方案可能无法很好地扩展.在这种情况下,我需要使用ETL工具.

I need to ETL data into my Cloud SQL instance. This data comes from API calls. Currently, I'm running a custom Java ETL code in Kubernetes with Cronjobs that makes request to collect this data and load it on Cloud SQL. The problem comes with managing the ETL code and monitoring the ETL jobs. The current solution may not scale well when more ETL processes are incorporated. In this context, I need to use an ETL tool.

我的Cloud SQL实例包含两种类型的表:通用事务表和包含来自API的数据的表.第二种类型在可操作的数据库透视图"中大多是只读的,并且很大一部分表每小时(分批)进行批量更新,以丢弃旧数据并刷新值.

My Cloud SQL instance contains two types of tables: common transactional tables and tables that contains data that comes from the API. The second type is mostly read-only in a "operational database perspective" and a huge part of the tables are bulk updated every hour (in batch) to discard the old data and refresh the values.

考虑到这种情况,我注意到Cloud Dataflow是GCP提供的ETL工具.但是,该工具似乎更适合需要进行复杂转换并以多种格式提取数据的大数据应用程序.另外,在Dataflow中,对数据进行并行处理,并根据需要升级工作程序节点.由于Dataflow是分布式系统,因此在分配资源以进行简单的大容量加载时,ETL流程可能会有开销.除此之外,我注意到Dataflow没有针对Cloud SQL的特定接收器.这可能意味着Dataflow不是Cloud SQL数据库中简单批量加载操作的正确工具.

Considering this context, I noticed that Cloud Dataflow is the ETL tool provided by GCP. However, it seems that this tool is more suitable for big data applications that needs to do complex transformations and ingest data in multiple formats. Also, in Dataflow, the data is parallel processed and worker nodes are escalated as needed. Since Dataflow is a distributed system, maybe the ETL process would have an overhead when allocating resources to do a simple bulk load. In addition to that, I noticed that Dataflow doesn't have a particular sink for Cloud SQL. This probably means that Dataflow isn't the correct tool for simple bulk load operations in a Cloud SQL database.

在我当前的需求中,我只需要执行简单的转换并批量加载数据.但是,将来,我们可能希望处理其他数据源(png,json,csv文件)和接收器(Cloud Storage以及BigQuery).另外,将来,我们可能希望提取流数据并将其存储在Cloud SQL中.从这个意义上讲,底层的Apache Beam模型确实很有趣,因为它为批处理和流提供了统一的模型.

In my current needs, I only need to do simple transformations and bulk load the data. However, in the future, we might want to handle other sources of data (pngs, json, csv files) and sinks (Cloud Storage and maybe BigQuery). Also, in the future, we might want to ingest streaming data and store it on Cloud SQL. In this sense, the underlying Apache Beam model is really interesting, since it offers an unified model for batch and streaming.

在所有这些背景下,我可以看到两种方法:

Giving all this context, I can see two approaches:

1)使用云中的Talend之类的ETL工具来帮助监视ETL作业和维护.

1) Use an ETL tool like Talend in the Cloud to help monitoring ETL jobs and maintenance.

2)使用Cloud Dataflow,因为我们可能需要流功能以及与各种源和接收器的集成.

2) Use Cloud Dataflow, since we may need streaming capabilities and integration with all kinds of sources and sinks.

第一种方法的问题在于,无论将来有什么新需求到来,我都可能最终还是使用Cloud Dataflow,这对我的项目而言在基础架构成本方面是不利的,因为我要为两种工具付费.

The problem with the first approach is that I may end up using Cloud Dataflow anyway when future requeriments arrives and that would be bad for my project in terms of infrastructure costs, since I would be paying for two tools.

第二种方法的问题在于,Dataflow似乎不适合简​​单地在Cloud SQL数据库中进行批量加载操作.

The problem with the second approach is that Dataflow doesn't seem to be suitable for simply bulk loading operations in a Cloud SQL Database.

在这里我出了什么问题吗?有人可以启发我吗?

Is there something I am getting wrong here? Can someone enlighten me?

推荐答案

您可以将Cloud Dataflow仅用于加载操作.这是有关如何对Dataflow执行ETL操作的教程.它使用BigQuery,但您可以使其适应连接到您的Cloud SQL或其他

You can use Cloud Dataflow just for loading operations. Here is a tutorial on how to perform ETL operations with Dataflow. It uses BigQuery but you can adapt it to connect to your Cloud SQL or other JDBC sources.

更多示例可在官方用于数据流分析的Google Cloud Platform github页面上找到用户生成的内容.

More examples can be found on the official Google Cloud Platform github page for Dataflow analysis of user generated content.

您也可以查看以下

You can also have a look at this GCP ETL architecture example that automates the tasks of extracting data from operational databases.

要简化ETL操作,请 Dataprep 是易于使用的工具,它提供

For simpler ETL operations, Dataprep is an easy tool to use and provides flow scheduling as well.

这篇关于ETL方法在Cloud SQL中批量加载数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆