如何将 Google Cloud SQL 与 Google Big Query 集成 [英] How to integrate Google Cloud SQL with Google Big Query

查看:41
本文介绍了如何将 Google Cloud SQL 与 Google Big Query 集成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在设计一个解决方案,其中 Google Cloud SQL 将用于存储来自应用程序正常运行的所有数据(一种 OLTP 数据).随着时间的推移,数据预计会增长到相当大的规模.数据本身本质上是关系数据,因此我们选择了 Cloud SQL 而不是 Cloud Datastore.

I am designing a solution in which Google Cloud SQL will be used to store all data from the regular functioning of the app(kind of OLTP data). The data is expected to grow over time into pretty large size. The data itself is relational in nature and hence we have chosen Cloud SQL instead of Cloud Datastore.

需要将此数据输入 Big Query 进行分析,并且这需要接近实时分析(这是最好的情况),但实际上可能会出现一些滞后.但我正在尝试设计一种解决方案,将这种延迟降至最低.

This data needs to be fed into Big Query for analytics and this needs to be near real-time analytics (as the best case), although realistically some lag can be expected. But I am trying to design a solution which reduces this lag to minimum possible.

我的问题有 3 个部分 -

My question has 3 parts -

  1. 我应该使用 Cloud SQL 存储数据然后将其移至 BigQuery 还是更改基本设计本身并最初也使用 BigQuery 存储数据?BigQuery 是否适合用于常规的低延迟 OLTP 工作负载?(我不这么认为 - 我的假设正确吗?)

  1. Should I use Cloud SQL for storing data and then move it to BigQuery or change the basic design itself and use BigQuery for storing the data initially as well? Is BigQuery suitable for use for regular, low-latency OLTP workloads?(I don't think so - is my assumption correct?)

将 Cloud SQL 数据加载到 BigQuery 并使这种集成近乎实时地工作的推荐/最佳做法是什么?

What is the recommended/best practice for loading Cloud SQL data into BigQuery and have this integration work near real-time?

Cloud Dataflow 是一个不错的选择吗?如果我将 Cloud SQL 连接到 Cloud DataFlow 并进一步连接到 BigQuery - 它会起作用吗?或者有没有其他更好的方法来实现这一点(如问题 2 所述)?

Is Cloud Dataflow a good option? If I connect Cloud SQL to Cloud DataFlow and further to BigQuery - will it work? Or is there any other way to achieve this which is better(as asked in question 2)?

推荐答案

看看 WePay 是如何做到的:

Take a look at how WePay does this:

MySQL to GCS 操作符对 MySQL 执行 SELECT 查询桌子.SELECT 拉取所有大于(或等于)最后一个的数据高水印.高水印是主键表(如果表是仅附加的),或修改时间戳列(如果表收到更新).再次,SELECT 语句也可以回溯一点时间(或行)以捕获可能丢失的来自最后一个查询的行(由于上述问题).

The MySQL to GCS operator executes a SELECT query against a MySQL table. The SELECT pulls all data greater than (or equal to) the last high watermark. The high watermark is either the primary key of the table (if the table is append-only), or a modification timestamp column (if the table receives updates). Again, the SELECT statement also goes back a bit in time (or rows) to catch potentially dropped rows from the last query (due to the issues mentioned above).

借助 Airflow,他们设法让 BigQuery 每 15 分钟与 MySQL 数据库同步一次.

With Airflow they manage to keep BigQuery synchronized to their MySQL database every 15 minutes.

这篇关于如何将 Google Cloud SQL 与 Google Big Query 集成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆